Comment by keeeba
I don’t have the experiments to prove this, but from my experience it’s highly variable between embedding models.
Larger, more capable embedding models are better able to separate the different uses of a given word in the embedding space, smaller models are not.
I was thinking about it a fair bit lately. We have all sorts of benchmarks that describe a lot of factors in detail, but all those are very abstract and yet, those do not seem to map clearly to well observed behaviors. I think we need to think of a different way to list those.