Comment by souvik3333

Comment by souvik3333 14 hours ago

2 replies

Hi, author of the model here..

We have a benchmark for evaluating VLM on document understanding tasks: https://idp-leaderboard.org/ . But unfortunately, it does not include image to markdown as a task. The problem with evaluating an image to markdown is that even if the order of two blocks are different, it can still be correct. Eg: if you have both seller info and buyer info side by side in the image one model can extract the seller info first, and another model can extract the buyer info first. Both model will be correct but depending on the ground truth if you do fuzzy matching one model will have higher accuracy than the other one.

Normally, a company will train and test on a dataset that is trained on the same type of annotation (either left block first or right block first), and all other models can get a low score on their benchmark because they are trained on the opposite order of annotations.

tensor 5 hours ago

The more important thing to me with any VLM is base OCR performance and hallucinations. It's not too hard to get improved average accuracy on very low quality scans using language models. Unfortunately these also typically produce large numbers of hallucinations, which are a deal breaker if you are trying to get out values for financial or legal purposes.

OCR that has lower accuracy, but where the inaccurate parts are left blank or flagged are far superior. Mistral OCR also suffers from this problem.

If your OCR produced bounding boxes for every text line, and ran a traditional OCR on the text, this could alleviate it. Or at the very least bounding boxes let users cross-correlate with output from traditional OCR.

Also a small note, it's probably best not to say your product beats Mistral when it's not even tested against it. Having more features doesn't make a product better if the accuracy is not better on those features.

I don't mean to be discouraging, this is an important space and it looks like you have a very feature rich model. I'd like to see a good solution be developed!

krapht 6 hours ago

If this is the only issue, can't this be addressed by normalizing the post-processed data before scoring? (that is, if it really is just a matter of block ordering)