Comment by xfalcox
I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics, by doing the same thing you described where we over capture with binary embeddings, and only use the full (or half) precision on the subset.
Making the storage cost of the index 32 times smaller is the difference of being able to offer this at scale without worrying too much about the overhead.
> I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics
By moving the values to a single bit, you’re lumping stuff together that was different before, so I don’t think recall loss would be expected.
Also: even if your vector is only 100-dimensional, there already are 2^100 different bit vectors. That’s over 10^30.
If your dataset isn’t gigantic and has documents that are even moderately dispersed in that space, the likelihood of having many with the same bit vector isn’t large.