Comment by authorfly

deepsquirrelnet a year ago

Great ideas - I’ll run some experiments and see how feasible it is. I’d want to see how performance is if I train on a single type of similarity. Without any contextual computation, I am not sure there are other options for doing it. It may require switching between models, but that’s not much of an issue.

Reply View 0 replies

refulgentis a year ago

Its a 17 MB model that benchmarks obviously worse than MiniLM v2 (which is SBERT). I run V3 on ONNX on every platform you can think of with a 23 MB model.

I don't intend for that to be read as dismissive, it's just important to understand work like this in context - here, it's that there's a cool trick where if you get to an advanced understanding of LLMs, you notice they have embeddings too, and if that is your lens, it's much more straightforward to take a step forward and mess with those, than take a step back and survey the state of embeddings.

Reply View 0 replies

curl-up a year ago

I assume that by "ChatGPT embeddings" you mean OpenAI embedding models. In that case, "burning" and "freezing" are not opposite at all, with a cosine similarity of 0.46 (running on text-embedding-large-3 with 1024 dimensions). "Perfectly opposite" embeddings would have a similarity of -1.

It's a common mistake people make, thinking that words that have the opposite meaning will have opposite embeddings. Instead, words with opposite meanings have a lot in common, e.g. both "burning" and "freezing" are related to temperature, physics, they're both english words, they're both words that can be a verb, a noun and an adjective (not that many such words), they're both spelled correctly, etc. All these features end up being a part of the embedding.

Reply View 7 replies

magicalhippo a year ago

This might be a dumb question but... if I get the embeddings of words with a common theme like "burning", "warm", "cool", "freezing", would I be able to relatively well fit an arc (or line) between them? So that if I interpolate along that arc/line, I get vectors close to "hot" and "cold"?

Reply View | 1 reply
- authorfly a year ago
  
  This was the original argument for the King-Queen-Man-Women Word2Vec paper - it turns out no, not beyond basic categories. Yes to a degree. But all embeddings as trained based on what the creator decides they want them to do; to represent semantic(meanginful) similarity - similar word use - or topics or domains - or level of language use - or indeed to work multilingually and to clump together embeddings in one language, etc.
  Different models will give you different results - many are based on search-retrieval, for which MTEB is a good benchmark. But those ones won't generally "excel" at what you propose, they'll just be in the same area.
  
  Reply View | 0 replies
authorfly a year ago

You are missing the woods for the trees in my point. LLM based (especially RLHF) embeddings allow you to do much more and encode greater context that either "this thing is being used as a potent adjective", or "this thing is a noun similar to that other [abstraction] noun" <-- Word2Vec or "this thing is similar in terms of the whole sentence when doing retrieval tasks" <-- SBERT
If you can't see why it is useful that neither Word2Vec or SBERT can put "positive charge" and "negative charge" in very different, opposite embedding space while LLM and RLHF based embeddings can, you don't understand the full utilization possible with embeddings.
Firstly, you can choose what you embed the word with, such as "Article Topic:" or "Temperature:" to adjust the output of the embedding and results of cosine similarity to be relevant for your use case (if you use a word-based embedding, which captures much less than a sentence for search/retrieval/many other tasks like categorising)
Secondly, by default, these models are not as "dumb state" as the original slew of Word2Vec and GloVe, which yes would score very highly for words like "loved" and "hated" in similar use as adjectives, which caused issues for things like semantic classification of reviews, etc. Whereas these models encode so much more, that they see the difference between "loved" and "hated" is much bigger than that between "loved" and "walk" for example. *This is already a useful default step up, but most anyone using RLHF embeddings is embedding sentences to get the best use out of them*
Your understanding of embeddings is rather flawed to focus on "hey're both english words, they're both words that can be a verb, a noun and an adjective (not that many such words)". Why do embeddings in different languages with the same semantic meaning land closer in space than two unrelated english languages? The model has no focus on part of speech type, and is ideally suited to embedding sentences, where with every additional token, it can produce a more useful embedding. Being spelled correctly belies that you have a miscomrephension that these systems are a "look up" - yes they are for one word, and if you spelt that one word wrong (or token which represents multiple words, one token), you'd get a different place in embedding space and one very wrong. However, when you have multiple tokens, a mispelling moves the embedding space very little, because the model is adept at comprehending mispelling and slang and other "translation"-like tasks early, and making their effects irrelevant for downstream tasks unless they are useful to keep around. Effective resolution of spelling mistakes is anyhow possible with models as small as 2-5GB, as T5 showed way back in 2019, and I'd posit even some sentence similarity trained models (e.g. based on BERT which had a training set with some spelling errors) treat spelling mistakes essentially the same way.
I am aware of the options from OpenAI for embeddings, as I have used them for a long, long time. The original options were each based on the released early models, especially ada and babbage, and though the naming convention isn't clear any more, the more recent models are based on RLHF models, like ChatGPT, and hence I mention ChatGPT to make it clear to cursory readers that I am not referring to the older tier of embedding models by OpenAI based on non-RLHF models.

Reply View | 4 replies
- curl-up a year ago
  
  Tone of your post is really strange and condescending, not sure why. You made a statement that I, in my work, very often see people make when they first start learning about embeddings (expecting words that we humans see as "opposite" to actually have opposite embeddings), and I corrected it, as it might help other people reading this thread.
  > Firstly, you can choose what you embed the word with, such as "Article Topic:" or "Temperature:" to adjust the output of the embedding and results of cosine similarity to be relevant for your use case
  As far as LLM-based embeddings go, unless you train the model for this type of format, this is not true at all. In fact, the opposite is true - adding such qualifiers before your text only increases the similarity, as those two texts are, in fact, more similar after such additions. I am aware that instruct-embedding models work, but their performance and flexibility is, in my experience, very limited
  As for the rest of your post, I really don't see why you are trying to convince me that LLM-based embeddings have so much more to them than previous models. I am very well aware of this - my work revolves around such new models. I simply corrected a common misconception that you gave, and I don't really care if you "really think that" or if you know what the truth is but just wrote it as an off-hand remark.
  
  Reply View | 3 replies
  
  authorfly a year ago
  
  Saying "Perfectly opposite" does not need to mean the mathematical cosine similarity would be -1. The point you implied by bringing up this irrelevant information is to be dismissive of the relevance of generative model embeddings for different tasks (and 0.41 is less similar than you get in previous embedding modes which don't have the rich context of LLMs or RLFF models). This is why you got the snarky tone back, you took an unnecessary literal interpretation, and revealed in your later paragraphs a dated attitude to embeddings that you tend to get from a surface level understanding i.e. that adjective, noun or other PoS type or presence is more important for similarity (e.g. adjectives are closer to each other in Word2Vec but NOT consistently so in generative embeddings).
  Ofcourse embeddings prefixed will be generally closer. You misunderstand the use case and are looking at embeddings in an outdated way. The point is thus:
  When I want to use embeddings to model newspaper articles, I put "Article:" infront of the topic as I embed it, and for that purpose they will suite my needs better. When I need to use embeddings for temperature or scientific literature purposes, I might put "Temperature:" in front of them, and "Burning"/"Freezing" will be further apart. That is useful in a way that Word2Vec, GloVe and even to lesser degree SBERT cannot do.
  The misconception you claim is based on Word2Vec and GloVe and not true generall - words can have several senses with polysemy, as can phrases anyhow so it's a difficult point to argue for in the first place - when you say " words that have the opposite meaning will have opposite embeddings. Instead, words with opposite meanings have a lot in common" is only true of embeddings from Word2Vec, GloVe, and the early BART era, which are quickly falling out of fashion as they are limited. Your understanding is dated, and you see a misconception, because you have failed to adequately explore or understand the possible use cases or representations viable with these embeddings. There is so much more! You can embed across languages. You can embed conversations!
  As for your call to authority - I don't need to make such a claim - I'm sorry if you work in a job stuck in the past trying to apply pre 2020 understanding of NLP to 2024 models but well, that sounds like your choice. To me, it sounds like you're assuming the past holds true and taking points absolutely; is that really wise in a fast-changing field? There have been several hackathons about embeddings. Try exploring the recent ones and look at what is really possible.
  
  Reply View | 2 replies