Comment by visarga

Comment by visarga 9 months ago

My chunk rewriting method is to use a LLM to generate a title, summary, keyword list, topic, parent topic, and gp topic. Then I embed the concatenation of all of them instead of just the original chunk. This helps a lot.

One fundamental problem of cosine similarity is that it works on surface level. For example, "5+5" won't embed close to "10". Or "The 5th word of this phrase" won't be similar to "this".

If there is any implicit knowledge it won't be captured by simple cosine similarity, that is why we need to draw out those inplicit deductions before embedding. Hence my approach of pre-embedding expansion of chunk semantic information.

I basically treat text like code, and have to "run the code" to get its meaning unpacked.

stared 9 months ago

If you ask, "Is '5+5' similar to '10'?" it depends on which notion of similarity you have - there are multiple differences: different symbols, one is an expression, the other is just a number. But if you ask, "Does '5+5' evaluate to the same number as '10'?" you will likely get what you are looking for.

Reply View 0 replies

gavmor 9 months ago

How do you contextualize the chunk at re-write time?

Reply View 4 replies

ewild 9 months ago

the original chunk is most likely stored with it in referential format such as an id in the metadata to pull from a DB or something along those lines. I do exactly what he does aswell and i have an Id metadata value that does exactly that pointing to an id in a DB which holds the text chunks and their respective metadata

Reply View | 3 replies
- gavmor 9 months ago
  
  The original chunk, sure, but what if the original chunk is full of eg pronouns? This is a problem I haven't heard an elegant solution for, although I've seen it done OK.
  What I mean is, how can you derive topics from a chunk that refers to them only obliquely?
  
  Reply View | 2 replies
  
  gearhart 9 months ago
  
  Before chunking, run coreference resolution to get rid of all of your pronouns and replace them with explicit references. You need to be a bit of careful to ensure you chunk both processed and unprocessed versions in the same places but it’s very doable.
  If you haven’t seen it, there’s a lovely overview of the idea in one of the SpaCy blog posts: https://explosion.ai/blog/coref
  
  Reply View | 1 reply
  
  gavmor 9 months ago
  
  Oh wow, yes, this is clever!
  
  Reply View | 0 replies