Comment by tensor
The more important thing to me with any VLM is base OCR performance and hallucinations. It's not too hard to get improved average accuracy on very low quality scans using language models. Unfortunately these also typically produce large numbers of hallucinations, which are a deal breaker if you are trying to get out values for financial or legal purposes.
OCR that has lower accuracy, but where the inaccurate parts are left blank or flagged are far superior. Mistral OCR also suffers from this problem.
If your OCR produced bounding boxes for every text line, and ran a traditional OCR on the text, this could alleviate it. Or at the very least bounding boxes let users cross-correlate with output from traditional OCR.
Also a small note, it's probably best not to say your product beats Mistral when it's not even tested against it. Having more features doesn't make a product better if the accuracy is not better on those features.
I don't mean to be discouraging, this is an important space and it looks like you have a very feature rich model. I'd like to see a good solution be developed!