Comment by langcss

Comment by langcss 10 months ago

You need a lot more power. I found gpt4o struggles doing basic OCR of printed text by hallucinating alot, while tesseract engine (old skool) gets it perfect. You need the model to be powerful enough to do everything.

You can work around this by the way by sending the output through a checking stage.

So picture -> gpt4o -> out1, picture -> tesseract -> out2, out1,out2 -> llm.

Might work for sound too.

falcor84 10 months ago

Interesting, I've actually been using gpt4o extensively for OCR and didn't encounter any significant issues - could I ask you to please give an example of an image of (otherwise legible) text that it hallucinates on?

Reply View 2 replies

langcss 10 months ago

I'll send you an email.

Reply View | 0 replies
schrodinger 10 months ago

Same, it's perfect at OCR. Generating an image with text in it however… nope!

Reply View | 0 replies

killerstorm 10 months ago

Speech is inherently easier to represent as a sequence of tokens than a high-resolution image.

Best speech to text is already NN transformer based anyway, so in theory it's only better to use a combined model

Reply View 0 replies