Comment by visarga
My chunk rewriting method is to use a LLM to generate a title, summary, keyword list, topic, parent topic, and gp topic. Then I embed the concatenation of all of them instead of just the original chunk. This helps a lot.
One fundamental problem of cosine similarity is that it works on surface level. For example, "5+5" won't embed close to "10". Or "The 5th word of this phrase" won't be similar to "this".
If there is any implicit knowledge it won't be captured by simple cosine similarity, that is why we need to draw out those inplicit deductions before embedding. Hence my approach of pre-embedding expansion of chunk semantic information.
I basically treat text like code, and have to "run the code" to get its meaning unpacked.
If you ask, "Is '5+5' similar to '10'?" it depends on which notion of similarity you have - there are multiple differences: different symbols, one is an expression, the other is just a number. But if you ask, "Does '5+5' evaluate to the same number as '10'?" you will likely get what you are looking for.