Comment by simonw

Comment by simonw 9 hours ago

7 replies

It still amazes me that the binary trick works.

For anyone who hasn't seen it yet: it turns out many embedding vectors of e.g. 1024 floating point numbers can be reduced to a single bit per value that records if it's higher or lower than 0... and in this reduced form much of the embedding math still works!

This means you can e.g. filter to the top 100 using extremely memory efficient and fast bit vectors, then run a more expensive distance calculation against those top 100 with the full floating point vectors to pick the top 10.

xfalcox 6 hours ago

I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics, by doing the same thing you described where we over capture with binary embeddings, and only use the full (or half) precision on the subset.

Making the storage cost of the index 32 times smaller is the difference of being able to offer this at scale without worrying too much about the overhead.

3abiton 4 hours ago

Now that you mention that, I wonder if LSH would perform better with slightly higher memory footprint

FuckButtons 8 hours ago

why is this amazing, it’s just a 1 bit lossy compression representation of the original information? If you have a vector in n-dimensional space this is effectively just representing the basis vectors that the original has.

  • simonw 7 hours ago

    You can take 8192 bytes of information (1024 x 32 bit floats) and reduce that to 128 bytes (1024 bits, a 64x reduction in size!) and still get results that are about 95% as good.

    I find that cool and surprising.

    • computably 38 minutes ago

      1024 bits for a hash is pretty roomy. The embedding "just" has to be well-distributed across enough of the dimensions.

    • sa-code 7 hours ago

      I'm with you, it's very satisfying to see a simple technique work well. It's impressive