Comment by gsuuon

I'm a little suspicious of the Isaac Newton example. The values of the better answer are very close, I wonder if the ordering holds up against small rewordings of the prompt?

Another approach if you're working with a local model is to ask for a summary of one word and then work with the resulting logits (wish I could find the article/paper that introduced this). You could compare similarity by just seeing how many shared words are in the top 500 of two queries, for example.