Comment by xfalcox

> I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics

By moving the values to a single bit, you’re lumping stuff together that was different before, so I don’t think recall loss would be expected.

Also: even if your vector is only 100-dimensional, there already are 2^100 different bit vectors. That’s over 10^30.

If your dataset isn’t gigantic and has documents that are even moderately dispersed in that space, the likelihood of having many with the same bit vector isn’t large.