Comment by tsoukase

Comment by tsoukase 4 days ago

If an LLM hallucinates in 1% of occasions and gives subpar output in 5%, this kills his effectiveness to replace anyone. Imagine a support guy on the other side of the phone to speak gibberish 10 times a day. Now, imagine a doctor. These will never lose their jobs.

in-silico 4 days ago

The models don't need to be perfect. They only need to be as reliable as humans.

Emergency department doctors misdiagnose about 5% of patients [1], so replacing them with an LLM that hallucinates on 1% of cases would actually be a significant improvement.

1: https://effectivehealthcare.ahrq.gov/products/diagnostic-err...

Reply View 0 replies

Bratmon 4 days ago

> Imagine a support guy on the other side of the phone to speak gibberish 10 times a day.

A massive improvement?

Reply View 0 replies

simianwords 4 days ago

but llm's dont speak gibberish 10 times a day even now. from my usage, chatgpt has not said one obviously strange thing since o3 came out.

Reply View 5 replies

HEmanZ 4 days ago

What are you working on that they are so knowledgeable?Even the best models absolutely make stuff up, even to this day. I literally spend all day every day working with them (all latest ChatGPT models) and it’s still 10-15% BS.
I had ChatGPT 5.2 thinking straight up make up an api after I pasted the full api spec to it earlier today. And built its whole response around a public api that did not exist. And Claude cli with sonnet 4.5 made up the craziest reason why my curl command wasn’t working (that curl itself was bugged, not the obvious it can’t resolve the dn it tried to use) and almost went down a path of installing a bunch of garbage tools.
These are not ready to be unsupervised. Yet.

Reply View | 4 replies
- falkensmaize 4 days ago
  
  Just today I had Claude Opus 4.5 try to write to a fictional Mac user account on my computer during a coding session. It was pretty weird - the name was very specific and unique enough that it was clear it was likely bleed through from training data. It wasn’t like “John Smith” or something.
  That’s the kind of thing that on a large scale could be catastrophic.
  
  Reply View | 0 replies
- simianwords 4 days ago
  
  for coding, if you have not hooked up your workflow to a test -> code feedback loop, then you are doing it incorrectly. i agree it doesn't get things right all the time but this loop is important to correct it.
  for other things like normal question answering in the chatgpt window, it hasn't really said anything incorrect.. very very few instances.
  
  Reply View | 0 replies
- HEmanZ 4 days ago
  
  But maybe your point is that it isn’t gibberish, it’s “seems correct but isn’t” which is honestly more dangerous
  
  Reply View | 1 reply
  
  simianwords 4 days ago
  
  you are incorrect. "seems correct but isn't" is fine as long as the other times it is accurate at high enough levels.
  "seems correct but isn't" is like the most common mode of humans getting things wrong.
  
  Reply View | 0 replies