Comment by yodon
Comment by yodon 4 days ago
This looks super valuable!
That said, it's concerning to see the reported probability for getting a 4 on a die roll is 65%.
Hopefully OpenAI isn't that biased at generating die rolls, so is that number actually giving us information about the accuracy of the probability assessments?
Fair dice rolls is not an objective that cloud LLMs are optimized for. You should assume that LLMs cannot perform this task.
This is a problem when people naively use "give an answer on a scale of 1-10" in their prompts. LLMs are biased towards particular numbers (like humans!) and cannot linearly map an answer to a scale.
It's extremely concerning when teams do this in a context like medicine. Asking an LLM "how severe is this condition" on a numeric scale is fraudulent and dangerous.