Comment by jongjong

Comment by jongjong 2 days ago

Interesting. All developers I know who tinkered around with embeddings and vector similarity scoring were instantly hooked. The efficiency of computing the embeddings once and then reusing as many times as needed, comparing the vectors with a cheap <30-line function is extremely appealing. Not to mention the indexing capabilities to make it work at scale.

IMO vector embedding is the most important innovation in computing of the last decade. There's something magical about it. These people deserve some kind of prize. The idea that you can reduce almost any intricate concept including whole paragraphs to a fixed-size vector which encapsulates its meaning and proximity to other concepts across a large number of dimensions is pure genius.

_jayhack_ 2 days ago

Vector embedding is not an invention of the last decade. Featurization in ML goes back to the 60s - even deep learning-based featurization is decades old at a minimum. Like everything else in ML this became much more useful with data and compute scale

Reply View 3 replies

senderista 2 days ago

Yup, when I was at MSFT 20 years ago they were already productizing vector embedding of documents and queries (LSI).

Reply View | 2 replies
- jongjong 2 days ago
  
  Interesting. Makes one think.
  
  Reply View | 1 reply
  
  senderista a day ago
  
  To be clear, LSA[1] is simply applied linear algebra, not ML. I'm sure learned embeddings outperform the simple SVD[2] used in LSA.
  [1] https://en.wikipedia.org/wiki/Latent_semantic_analysis
  [2] https://en.wikipedia.org/wiki/Singular_value_decomposition
  
  Reply View | 0 replies

liampulles 2 days ago

If you take the embedding for king, subtract the embedding for male, add the embedding for female, and lookup the closest embedding you get queen.

The fact that dot product addition can encode the concept of royalty and gender (among all other sorts) is kind of magic to me.

Reply View 17 replies

puttycat 2 days ago

This was actually shown to not really work in practice.

Reply View | 16 replies
- intelkishan 2 days ago
  
  I have seen this particular work example to work. You don't get the exact match but the closest one is indeed Queen.
  
  Reply View | 15 replies
  
  godelski 2 days ago
  
  Yes but it doesn't generalize very well. Even on simple features like gender. If you go look at embeddings you'll find that man and woman are neighbors, just as king and queen are[0]. This is a better explanation for the result as you're just taking very small steps in the latent space.
  Here, play around[1]
  mother - parent + man = woman father - parent + woman = man father - parent + man = woman mother - parent + woman = man woman - human + man = girl
  Or some that should be trivial
  woman - man + man = girl man - man + man = woman woman - woman + woman = man
  Working in very high dimensions is funky stuff. Embedding high dimensions into low dimensions results in even funkier stuff
  [0] https://projector.tensorflow.org/
  [1] https://www.cs.cmu.edu/~dst/WordEmbeddingDemo/
  
  Reply View | 13 replies
  
  mirekrusin 2 days ago
  
  Shouldn't this itself be a part of training?
  Having set of "king - male + female = queen" like relations, including more complex phrases to align embeddings.
  It seems like terse, lightweight, information dense way to address essence of knowldge.
  
  Reply View | 0 replies

ekidd 2 days ago

Vector embeddings are slightly interesting because they come pre-trained with large amounts of data.

But similar ways to reduce huge numbers of dimensions to a much smaller set of "interesting" dimensions have been known for a long time.

Examples include principal component analysis/single value decomposition, which was the first big breakthrough in face recognition (in the early 90s), and also used in latent semantic indexing, the Netflix prize, and a large pile of other things. And the underlying technique was invented in 1901.

Dimensionality reduction is cool, and vector embedding is definitely an interesting way to do it (at significant computational cost).

Reply View 0 replies

CuriouslyC 2 days ago

Vector embeddings are so overhyped. They're decent as a secondary signal, but they're expensive to compute and fragile. BM25 based solutions are more robust and WAY lower latency, at the cost of some accuracy loss vs hybrid solutions. You can get the majority of the lift from hybrid solutions with ingest time semantic expansion/reverse hyde type input annotation with a sparse embedding BM25 at a fraction of the computational cost.

Reply View 1 reply

jongjong 2 days ago

But it's much cheaper to compute than inference, and also you only have to compute once for any content and reuse multiple times.

Reply View | 0 replies

calf 2 days ago

The idea of reducing language to mere bits, in general, sounds like it would violate the Godel/Turing theorems about computability.

Reply View 0 replies