Comment by kianN

Comment by kianN 6 days ago

Some statistical notes for those interested:

Under the hood, this model resembles LDA, but replaces its Dirichlet priors with Pitman–Yor Processes (PYPs), which better capture the power-law behavior of word distributions. It also supports arbitrary hierarchical priors, allowing metadata-aware modeling.

For example, in an earnings-transcript corpus, a typical LDA might have a flat structure: Prior → Document

Our model instead uses a hierarchical graph: Uniform Prior → Global Topics → Ticker → Quarter → Paragraph

This hierarchical structure, combined with the PYP statistics, consistently yields more coherent and fine-grained topic structures than standard LDA does. There’s also a “fast mode” that collapses some hierarchy levels for quicker runs; it’s a handy option if you’re curious to see the impact hierarchy has on the model results (or in a rush).

johnhoffman 4 days ago

Curious about what you use to productionalize this; it is so cool and inspiring to see hierarchical bayes applications like this.

What is the go to "production" stack for something like this nowadays? Is Stan dead? Do you do HMC or approximations with e.g. Pyro?

Reply View 1 reply

kianN 4 days ago

We built our own collapsed Gibbs sampler in C: PyMC/Stan are use HMC which scales only to a few hundred parameters and we are modeling millions.
Above C we built a python wrapper to help construct arbitrary Dirichlet and Pitman-Yor Processes graphs.
From there we have some python wrappers and store it all in a hierarchical DuckDB schema for fast query access.
The site itself is actually just a light wrapper around our API that simplifies this process.

Reply View | 0 replies

malshe 5 days ago

Very interesting! Do you have a manuscript or a technical writeup for the model? I would love to learn more about the implementation details.

Reply View 2 replies

kianN 5 days ago

We do! We have a (very) high level overview focused on applying this model to language on our blog: https://blog.sturdystatistics.com/posts/technology/.
We have some more technical write-ups on the internals of the model that are not hosted publicly (we have some on-going publication efforts applying those model to scRNA sequencing). But feel free to shoot me an email (in my profile) and I'd be happy to send over some of our more technical documents.

Reply View | 1 reply
- malshe 5 days ago
  
  That’s great. Thanks
  
  Reply View | 0 replies