Comment by pbhjpbhj

Comment by pbhjpbhj a year ago

That sounds like a problem for the embedding, would you need to renormalise so that low signal inputs could be well represented. A white square and a red square shouldn't be different levels of details. Depending on the purpose of the vector embedding, there should be a difference between images of mostly white pixels and partial images.

Disclaimer, I don't know shit.

pamelafox a year ago

I should clarify that I experienced these issues with text-embedding-ada-002 and the Azure AI vision model (based on Florence). I have not tested many other embedding models to see if they'd have the same issue.

Reply View 3 replies

refulgentis a year ago

FWIW I think you're right, we have very different stacks, and I've observed the same thing, with a much clunkier description thank your elegant way of putting it.
I do embeddings on arbitrary websites at runtime, and had a persistent problem with the last chunk of a web page matching more things. In retrospect, its obvious that the smaller the chunk was, the more it was matching everything
Full details: MSMARCO MiniLM L6V3 inferenced using ONNX on iOS/web/android/macos/windows/linux

Reply View | 0 replies
mattvr a year ago

You could also work around this by adding a scaling transformation that normalizes and centers (e.g. sklearn StandardScaler) in between the raw embeddings — based on some example data points from your data set. Might introduce some bias, but I’ve found this helpful in some cases with off the shelf embeddings.

Reply View | 0 replies
OutOfHere a year ago

Use horrible quality embeddings and get horrible results. No surprise there. ada is obsolete - I would never want to use it.

Reply View | 0 replies