Show HN: Real-time system that tracks how news spreads across 200k websites
(yandori.io)253 points by antiochIst 8 days ago
I built a system that monitors ~200,000 news RSS feeds in near real-time and clusters related articles to show how stories spread across the web.
It uses Snowflake’s Arctic model for embeddings and HNSW for fast similarity search. Each “story cluster” shows who published first, how fast it propagated, and how the narrative evolved as more outlets picked it up.
Would love feedback on the architecture, scaling approach, and any ways to make the clusters more accurate or useful.
Live demo: https://yandori.io/news-flow/
This is interesting, but it seems like it is tracking stories with similar headlines and that's not always how news propagates. Frequently a blogger will read an interview, select an quote from the interview and write a new headline around the quote they cherry picked. It used to be common practice to link the original source, but that always doesn't happen.
I have long thought that search engines, news aggregators and social media companies have a journalistic responsibility to favor the original/primary source of every story, but things have not worked out that way. If you can manage to truly develop something like this it would be a valuable tool for rewarding the work of reporting over SEO.
Anyway, please consider that headlines and time stamps do not tell the entire story when it comes to sourcing.
For example: Your website offers this story (https://hotspotatl.com/6587626/dr-jackie-married-to-medicine...) as first to publish. But right in the text it cites another website BOSSIP as the source of the interview.
Also: there doesn't appear to be a way to link results from your website.