Comment by andai
>the similarity check doesn't appear to do translation
This surprises me. The system is based on embeddings. AFAIK embeddings cluster the same concept in different languages in roughly the same place? Maybe it depends on the model (or maybe it's not exact and the clustering cutoff loses it).
I'm basically throwing away non english articles for now... I'll pry get them in later, but I want to get english right first before trying to move to other languages...
The embeddings themselves will (pry) cluster ok in different languages (but I have not tested this yet)