Comment by authorfly
Comment by authorfly 5 days ago
Nice. I like the tiny size a lot, that's already an advantage over SBERTs smallest models.
But it seems quite dated technically - which I understand is a tradeoff for performance - but can you provide a way to toggle between different types of similarity (e.g. semantic, NLI, noun-abstract)?
E.g. I sometimes want "Freezing" and "Burning" to be very similar (1) as in regards to say grouping/clustering articles in a newspaper into categories like "Extreme environmental events", like on MTEB/Sentence-Similarity, as classic Word2Vec/GloVe would do. But if this was a chemistry article, I want them to be opposite, like ChatGPT embeddings would be. And sometime I want to use NLI embeddings to work our the causal link between two things. Because the latter two embedding types are more recent (2019+), they are where the technical opportunity is, not the older MTEB/semantic similarity ones which have been performant enough for many use cases since 2014 and 2019 received a big boost with mini-lm-v2 etc.
For the above 3 embedding types I can use SBERT but the dimensions are large, models quite large, and having to load multiple models for different similarity types is straining on resources, it often takes about 6GB because generative embedding models (or E5 etc) are large, as are NLI models.
Great ideas - I’ll run some experiments and see how feasible it is. I’d want to see how performance is if I train on a single type of similarity. Without any contextual computation, I am not sure there are other options for doing it. It may require switching between models, but that’s not much of an issue.