Comment by authorfly

Saying "Perfectly opposite" does not need to mean the mathematical cosine similarity would be -1. The point you implied by bringing up this irrelevant information is to be dismissive of the relevance of generative model embeddings for different tasks (and 0.41 is less similar than you get in previous embedding modes which don't have the rich context of LLMs or RLFF models). This is why you got the snarky tone back, you took an unnecessary literal interpretation, and revealed in your later paragraphs a dated attitude to embeddings that you tend to get from a surface level understanding i.e. that adjective, noun or other PoS type or presence is more important for similarity (e.g. adjectives are closer to each other in Word2Vec but NOT consistently so in generative embeddings).

Ofcourse embeddings prefixed will be generally closer. You misunderstand the use case and are looking at embeddings in an outdated way. The point is thus:

When I want to use embeddings to model newspaper articles, I put "Article:" infront of the topic as I embed it, and for that purpose they will suite my needs better. When I need to use embeddings for temperature or scientific literature purposes, I might put "Temperature:" in front of them, and "Burning"/"Freezing" will be further apart. That is useful in a way that Word2Vec, GloVe and even to lesser degree SBERT cannot do.

The misconception you claim is based on Word2Vec and GloVe and not true generall - words can have several senses with polysemy, as can phrases anyhow so it's a difficult point to argue for in the first place - when you say " words that have the opposite meaning will have opposite embeddings. Instead, words with opposite meanings have a lot in common" is only true of embeddings from Word2Vec, GloVe, and the early BART era, which are quickly falling out of fashion as they are limited. Your understanding is dated, and you see a misconception, because you have failed to adequately explore or understand the possible use cases or representations viable with these embeddings. There is so much more! You can embed across languages. You can embed conversations!

As for your call to authority - I don't need to make such a claim - I'm sorry if you work in a job stuck in the past trying to apply pre 2020 understanding of NLP to 2024 models but well, that sounds like your choice. To me, it sounds like you're assuming the past holds true and taking points absolutely; is that really wise in a fast-changing field? There have been several hackathons about embeddings. Try exploring the recent ones and look at what is really possible.

“Ofcourse embeddings prefixed will be generally closer is true;” They are closer in that “Temperature: Burning”, “Temperature: House”, and “Temperature: Freezing” will all be closer than “Burning”, “House”, and “Freezing”.

You seem confused again, because you are limited by prior beliefs about embeddings. The relative increase in distance between “Temperature: Burning” and “Temperature: Freezing” is the value and what we want from generative embeddings;

It means that use adding “Temperature:” allows us to differentiate more based on “Temperature:” here, whether or whether not prefixing “Temperature:” to everything puts them in closer space; I mentioned that only to elucidate that prefixing text bringing them closer is an irrelevant counter argument when the relative distance increases more

“Temperature: Burning”, “Temperature: House”, and “Temperature: Freezing” all being closer is irrelevant to that, because we work out the absolute cosine similarity between the examples we have, and we can use the extremes to cluster or do other work later on, so the group being closer, has no impact on the usefulness of the distance between “Temperature: Freezing” and “Temperature: Burning” for things like e.g. k-means clustering

Let me clarify finally to you: My point was that embeddings released more recently allow you to do more interesting things and embed different word (and sentence) and other senses of content more meaningfully that the previously limited approaches of Word2Vec, GloVe etc. You are taking it as given that adjectives being used in similar sentence places means they innately should have similar embeddings and that I’m being stupid for missing that this would be so; what you aren’t seeing is that this is an artifact of the limited contextual representation that LLMs previous to GPTs could address, and the fact that they were not modelling text in the same way (i.e. learned positioned embeddings and context lengths of 1,000 vs single sentence training samples with masked tokens). It was a limitation not a positive property of previous embedding regimes like Word2Vec to put adjectives which had opposite semantic meanings in the same embedding space and one that NLU researchers often used a criticism of the field.

As to why your point on the cosine similarity of “Burning” and “Freezing” not being “perfectly opposite” – I said ‘opposite’ not perfectly opposite, and I wasn’t referring to -1 which you have misinterpreted. Perfectly opposite has a different meaning to opposite. The claim I desire “Freezing” and “Burning” to be opposite is not that cosine similarity would be -1 as that is obviously absurd for a transformer training model, and also fairly obvious to anyone who works with embeddings who will notice they tend to clump together. The claim is that, relative to other alternate embeddings, the cosine similarity should be much less high than it would be with the older models, especially if you contextually prefix the embedding (such prefixing was not that effective – or at least not consistently effective – with BERT, and for obvious reasons useless for a word-embedding regime, since you are just providing the same word vector to both comparisons, so the cosine distance won’t change. Which it is, the cosine similarity of “Freezing” and “Burning” is lower with than BERT, the cosine embedding of “great disadvantage” and “great advantage” is lower too – rather contradicts your universal claim that “as adjectives, they are close together!”. No, that’s just an artifact of older models and was more a limitation than a feature. It’s obvious to a mathematical mind who works with similarity measures regularly that the two cosines of embeddings trained in a multi-sample unsupervised learning system, especially a transformer network, can never have a cosine of -1, whether you train them on the minimum square error of positive examples (of very similar sentences) or where you train them with negative samples (providing two pairs of sentences which are dissimilar, e.g. contrastive learning). You encounter this with PCAs, you encounter this with the layers of neural networks, you encounter this with embeddings, it’s a regular property in machine learning. So I don’t mean to be rude but it’s clear you don’t have much of a deep understanding of machine learning to be unaware of this. I don’t mean to be rude but with sufficient experience with similarity measures and machine learning, this is obvious – you never see an absolute extreme value because of the pull of samples as you go through the training data. In fact for many years (2008-2018) many practitioners struggled with issues like nodes with 0, inert values until new approaches overcame that issue. Ofcourse it is possible to have a -1 representation when simply manipulating matrices without training. Therefore, the fact you took this claim literally, told me you don’t understand transformer embedding architectures, and are probably unaware of how much embeddings have changed. Given you missed this in my original post, I hope it’s became clear to you.

In the limit the majority of the last encountered training samples were not the two samples in question, either as they were unencountered, or beause the training set has at least as many samples as tokens, and there are 56k-100k tokens depending on LLM model, and within the last 5 (ignoring batches) at least 3 were not the two. Each sample has on average the same pull on the cosine space weights within the network because of normal distribution. Each training iteration necessary affects all weights. The majority of iterations affect the network node strengths in directions which may be similar but are not the same as those pulling towards a cosine similarity of -1 for our two given two samples, with this effect increasing with the size of the training set, therefore, we can say that a cosine similarity of -1 is not possible. So long as there are 5 unique pieces of text in a transformer architecture training regime, and the random weights are not initialized to be -1, there never will be a cosine similarity of -1. Same proof exists for PCA and basically most training regimes with distributed samples (not even normal distribution is necessary) and any kind of weight based learning of representations, like in geometric proofs.

curl-up 10 months ago

Which part of the "my work revolves around such new models" did you misunderstand? Claiming that:

> Ofcourse embeddings prefixed will be generally closer.

and then later:

> When I need to use embeddings for temperature or scientific literature purposes, I might put "Temperature:" in front of them, and "Burning"/"Freezing" will be further apart.

is awesome. Have a good day!

Reply View 1 reply

authorfly 10 months ago

“Ofcourse embeddings prefixed will be generally closer is true;” They are closer in that “Temperature: Burning”, “Temperature: House”, and “Temperature: Freezing” will all be closer than “Burning”, “House”, and “Freezing”.
You seem confused again, because you are limited by prior beliefs about embeddings. The relative increase in distance between “Temperature: Burning” and “Temperature: Freezing” is the value and what we want from generative embeddings;
It means that use adding “Temperature:” allows us to differentiate more based on “Temperature:” here, whether or whether not prefixing “Temperature:” to everything puts them in closer space; I mentioned that only to elucidate that prefixing text bringing them closer is an irrelevant counter argument when the relative distance increases more
“Temperature: Burning”, “Temperature: House”, and “Temperature: Freezing” all being closer is irrelevant to that, because we work out the absolute cosine similarity between the examples we have, and we can use the extremes to cluster or do other work later on, so the group being closer, has no impact on the usefulness of the distance between “Temperature: Freezing” and “Temperature: Burning” for things like e.g. k-means clustering
Let me clarify finally to you: My point was that embeddings released more recently allow you to do more interesting things and embed different word (and sentence) and other senses of content more meaningfully that the previously limited approaches of Word2Vec, GloVe etc. You are taking it as given that adjectives being used in similar sentence places means they innately should have similar embeddings and that I’m being stupid for missing that this would be so; what you aren’t seeing is that this is an artifact of the limited contextual representation that LLMs previous to GPTs could address, and the fact that they were not modelling text in the same way (i.e. learned positioned embeddings and context lengths of 1,000 vs single sentence training samples with masked tokens). It was a limitation not a positive property of previous embedding regimes like Word2Vec to put adjectives which had opposite semantic meanings in the same embedding space and one that NLU researchers often used a criticism of the field.
As to why your point on the cosine similarity of “Burning” and “Freezing” not being “perfectly opposite” – I said ‘opposite’ not perfectly opposite, and I wasn’t referring to -1 which you have misinterpreted. Perfectly opposite has a different meaning to opposite. The claim I desire “Freezing” and “Burning” to be opposite is not that cosine similarity would be -1 as that is obviously absurd for a transformer training model, and also fairly obvious to anyone who works with embeddings who will notice they tend to clump together. The claim is that, relative to other alternate embeddings, the cosine similarity should be much less high than it would be with the older models, especially if you contextually prefix the embedding (such prefixing was not that effective – or at least not consistently effective – with BERT, and for obvious reasons useless for a word-embedding regime, since you are just providing the same word vector to both comparisons, so the cosine distance won’t change. Which it is, the cosine similarity of “Freezing” and “Burning” is lower with than BERT, the cosine embedding of “great disadvantage” and “great advantage” is lower too – rather contradicts your universal claim that “as adjectives, they are close together!”. No, that’s just an artifact of older models and was more a limitation than a feature. It’s obvious to a mathematical mind who works with similarity measures regularly that the two cosines of embeddings trained in a multi-sample unsupervised learning system, especially a transformer network, can never have a cosine of -1, whether you train them on the minimum square error of positive examples (of very similar sentences) or where you train them with negative samples (providing two pairs of sentences which are dissimilar, e.g. contrastive learning). You encounter this with PCAs, you encounter this with the layers of neural networks, you encounter this with embeddings, it’s a regular property in machine learning. So I don’t mean to be rude but it’s clear you don’t have much of a deep understanding of machine learning to be unaware of this. I don’t mean to be rude but with sufficient experience with similarity measures and machine learning, this is obvious – you never see an absolute extreme value because of the pull of samples as you go through the training data. In fact for many years (2008-2018) many practitioners struggled with issues like nodes with 0, inert values until new approaches overcame that issue. Ofcourse it is possible to have a -1 representation when simply manipulating matrices without training. Therefore, the fact you took this claim literally, told me you don’t understand transformer embedding architectures, and are probably unaware of how much embeddings have changed. Given you missed this in my original post, I hope it’s became clear to you.
In the limit the majority of the last encountered training samples were not the two samples in question, either as they were unencountered, or beause the training set has at least as many samples as tokens, and there are 56k-100k tokens depending on LLM model, and within the last 5 (ignoring batches) at least 3 were not the two. Each sample has on average the same pull on the cosine space weights within the network because of normal distribution. Each training iteration necessary affects all weights. The majority of iterations affect the network node strengths in directions which may be similar but are not the same as those pulling towards a cosine similarity of -1 for our two given two samples, with this effect increasing with the size of the training set, therefore, we can say that a cosine similarity of -1 is not possible. So long as there are 5 unique pieces of text in a transformer architecture training regime, and the random weights are not initialized to be -1, there never will be a cosine similarity of -1. Same proof exists for PCA and basically most training regimes with distributed samples (not even normal distribution is necessary) and any kind of weight based learning of representations, like in geometric proofs.

Reply View | 0 replies