Comment by masterphai
Comment by masterphai 8 days ago
Interesting project - it’s rare to see news-flow tracking done in real time at this scale. One thing you may want to stress-test is how stable the clustering remains when stories evolve semantically over a few hours. Embeddings tend to drift as outlets rewrite or localize a piece, and HNSW can sometimes over-merge when the centroid shifts.
A trick that helped in a similar system I built was doing a second-pass “temporal coherence” check: if two articles are close in embedding space but far apart in publish time or share no common entities, keep them in adjacent clusters rather than forcing a merge. It reduced false positives significantly.
Also curious how you handle deduping syndicated content - AP/Reuters can dominate the embedding space unless you weight publisher identity or canonical URLs.
Overall, really nice work. The propagation timeline is especially useful.
Thanks for your comment, unfortunately it seems that your comments are primarily LLM-generated (for people looking for evidence, the first comments of this user should provide enough evidence, although they’re getting better by fine tuning the prompt). As HN is primarily a place for humans, please do not do this here. Thanks.