Comment by barrell

Comment by barrell 20 hours ago

39 replies

I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and follows formatting instructions to the letter. I was (and still am) super super impressed. Even if it does not hold up in benchmarks, it still outperformed in practice.

I'm not sure how these new models compare to the biggest and baddest models, but if price, speed, and reliability are a concern for your use cases I cannot recommend Mistral enough.

Very excited to try out these new models! To be fair, mistral-3-medium-0525 still occasionally produces gibberish ~0.1% of my use cases (vs gpt-5's 15% failure rate). Will report back if that goes up or down with these new models

mrtksn 20 hours ago

Some time ago I canceled all my paid subscriptions to chatbots because they are interchangeable so I just rotate between Grok, ChatGPT, Gemini, Deepseek and Mistral.

On the API side of things my experience is that the model behaving as expected is the greatest feature.

There I also switched to Openrouter instead of paying directly so I can use whatever model fits best.

The recent buzz about ad-based chatbot services is probably because the companies no longer have an edge despite what the benchmarks say, users are noticing it and cancel paid plans. Just today OpenAI offered me 1 month free trial as if I wasn’t using it two months ago. I guess they hope I forget to cancel.

  • barrell 20 hours ago

    Yep I spent 3 days optimizing my prompt trying to get gpt-5 to work. Tried a bunch of different models (some Azure some OpenRouter) and got a better success rate with several others without any tailoring of the prompt.

    Was really plug and play. There are still small nuances to each one, but compared to a year ago prompts are much more portable

  • barbazoo 20 hours ago

    > I guess they hope I forget to cancel.

    Business model of most subscription based services.

    • viking123 5 hours ago

      For me it's just that I am too lazy to start switching from my GPT subscription, I use it with codex and it's very good for my use-case. And the price at least here in Asia is not expensive at all for the plus tier. The amount of tokens are so much that I usually cannot even spend the weekly quota, although I use context smartly and know my codebase so I can always point it to right place right away.

      I feel like at least for normies if they are familiar with ChatGPT, it might be hard to make them switch especially if they are subscribed.

  • acuozzo 18 hours ago

    > because they are interchangeable

    What is your use-case?

    Mine is: I use "Pro"/"Max"/"DeepThink" models to iterate on novel cross-domain applications of existing mathematics.

    My interaction is: I craft a detailed prompt in my editor, hand it off, come back 20-30 minutes later, review the reply, and then repeat if necessary.

    My experience is that they're all very, very different from one another.

    • mrtksn 18 hours ago

      my use case is Google replacement, things that I can do by myself so I can verify and things that are not important so I don’t have to verify.

      Sure, they produce different output so sometimes I will run the same thing on a few different models when Im not sure or happy but I’d don’t delegate the thinking part actually, I always give a direction in my prompts. I don’t see myself running 30min queries because I will never trust the output and will have to do all the work myself. Instead I like to go step by step together.

  • giancarlostoro 18 hours ago

    Maybe give Perplexity a shot? It has Grok, ChatGPT, Gemini, Kimi K2, I dont think it has Mistral unfortunately.

    • mrtksn 18 hours ago

      I like perplexity actually but haven’t been using it since some time. Maybe I should give it a go :)

      • ecommerceguy 13 hours ago

        I use their browser called Comet for finance related research. Very nice. I use pretty much all of the main ai's, chat, deep, gem, claude - all i have found little niche use case that i'm sure will rotate at some point in an upgrade cycle. there are so many ai's i don't see the point in paying for one. I'm convinced they will need ads to survive.

        excited to add mistral to the rotation!

        • giancarlostoro 10 hours ago

          Oh man I use Comet nearly daily, I tried setting perplexity as my new tab page on other browsers and for some reason its not the same. I mostly use it that boring way too.

  • [removed] 19 hours ago
    [deleted]
druskacik 19 hours ago

This is my experience as well. Mistral models may not be the best according to benchmarks and I don't use them for personal chats or coding, but for simple tasks with pre-defined scope (such as categorization, summarization, etc.) they are the option I choose. I use mistral-small with batch API and it's probably the best cost-efficient option out there.

  • leobg 4 hours ago

    Did you compare it to gemini-2.0-flash-lite?

    • leobg an hour ago

      Answering my own question:

      Artificial Analysis ranks them close in terms of price (both 0.3 USD/1M tokens) and intelligence (27 / 29 for gemini/mistral), but ranks gemini-2.0-flash-lite higher in terms of speed (189 tokens/s vs. 130).

      So they should be interchangeable. Looking forward to testing this.

      [0] https://artificialanalysis.ai/?models=o3%2Cgemini-2-5-pro%2C...

mbowcut2 19 hours ago

It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.

  • pants2 18 hours ago

    The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

    • airstrike 18 hours ago

      If you and others have any insights to share on structuring that benchmark, I'm all ears.

      There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

      The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

      • pants2 17 hours ago

        Generally, the easiest:

        1. Sample a set of prompts / answers from historical usage.

        2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.

        3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.

        4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.

  • [removed] 19 hours ago
    [deleted]
  • Legend2440 16 hours ago

    I don’t think benchmark overfitting is as common as people think. Benchmark scores are highly correlated with the subjective “intelligence” of the model. So is pretraining loss.

    The only exception I can think of is models trained on synthetic data like Phi.

  • pembrook 17 hours ago

    If the models from the big US labs are being overfit to benchmarks, than we also need to account for HN commenters overfitting positive evaluations to Chinese or European models based on their political biases (US big tech = default bad, anything European = default good).

    Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)

mentalgear 19 hours ago

Thanks for sharing your use case of the mistral models, which are indeed top-notch ! I had a look at phrasing.app, and while a nice website, I found the copy of "Hand-crafted. Phrasing was designed & developed by humans, for humans." somewhat of a false virtue given your statements here of advanced lllm usage.

  • barrell 19 hours ago

    I don't see the contention. I do not use llms in the design, development, copywriting, marketing, blogging, or any other aspect of the crafting of the application.

    I labor over every word, every button, every line of code, every blog post. I would say it is as hand-crafted as something digital can be.

    • basilgohar 19 hours ago

      I admire and respect this stance. I have been very AI-hesitant and while I'm using it more and more, I have spaces that I want to definitely keep human-only, as this is my preference. I'm glad to hear I'm not the only one like this.

      • barrell 18 hours ago

        Thank you :) and you're definitely not the only one.

        Full transparency, the first backend version of phrasing was 'vibe-coded' (long before vibe coding was a thing). I didn't like the results, I didn't like the experience, I didn't feel good ethically, and I didn't like my own development.

        I rewrote the application (completely, from scratch, new repo new language new framework) and all of the sudden I liked the results, I loved the process, I had no moral qualms, and I improved leaps and bounds in all areas I worked on.

        Automation has some amazing use cases (I am building an automation product at the end of the day) but so does doing hard things yourself.

        Although most important is just to enjoy what you do; or perhaps do something you can be proud of.

metadat 20 hours ago

Are you saying gpt-5 produces gibberish 15% of the time? Or are you comparing Mistral gibberish production rate to gpt-5.1's complex task failure rate?

Does Mistral even have a Tool Use model? That would be awesome to have a new coder entrant beyond OpenAI, Anthropic, Grok, and Qwen.

  • barrell 20 hours ago

    Yes. I spent about 3 days trying to optimize the prompt to get gpt-5 to not produce gibberish, to no avail. Completions took several minutes, had an above 50% timeout rate (with a 6 minute timeout mind you), and after retrying they still would return gibberish about 15% of the time (12% on one task, 20% on another task).

    I then tried multiple models, and they all failed in spectacular ways. Only Grok and Mistral had an acceptable success rate, although Grok did not follow the formatting instructions as well as Mistral.

    Phrasing is a language learning application, so the formatting is very complicated, with multiple languages and multiple scripts intertwined with markdown formatting. I do include dozens of examples in the prompts, but it's something many models struggle with.

    This was a few months ago, so to be fair, it's possible gpt-5.1 or gemini-3 or the new deepseek model may have caught up. I have not had the time or need to compare, as Mistral has been sufficient for my use cases.

    I mean, I'd love to get that 0.1% error rate down, but there have always more pressing issues XD

    • data-ottawa 19 hours ago

      With gpt5 did you try adjusting the reasoning level to "minimal"?

      I tried using it for a very small and quick summarization task that needed low latency and any level above that took several seconds to get a response. Using minimal brought that down significantly.

      Weirdly gpt5's reasoning levels don't map to the OpenAI api level reasoning effort levels.

      • barrell 18 hours ago

        Reasoning was set to minimal and low (and I think I tried medium at some point). I do not believe the timeouts were due to the reasoning taking to long, although I never streamed the results. I think the model just fails often. It stops producing tokens and eventually the request times out.

    • barbazoo 20 hours ago

      Hard to gauge what gibberish is without an example of the data and what you prompted the LLM with.

      • barrell 20 hours ago

        If you wanted examples, you needed only ask :)

        These are screenshots from that week: https://x.com/barrelltech/status/1995900100174880806

        I'm not going to share the prompt because (1) it's very long (2) there were dozens of variations and (3) it seems like poor business practices to share the most indefensible part of your business online XD

acuozzo 18 hours ago

I have a need to remove loose "signature" lines from the last 10% of a tremendous e-mail dataset. Based on your experience, how do you think mistral-3-medium-0525 would do?

  • barrell 18 hours ago

    What's your acceptable error rate? Honestly ministral would probably be sufficient if you can tolerate a small failure rate. I feel like medium would be overkill.

    But I'm no expert. I can't say I've used mistral much outside of my own domain.

    • acuozzo 17 hours ago

      I'd prefer for the error rate to be as close to 0% as possible under the strict requirement of having to use a local model. I have access to nodes with 8xH200, but I'd prefer to not tie those up with this task. I'd, instead, prefer to use a model I can run on an M2 Ultra.

      • barrell 17 hours ago

        If I cannot tolerate a failure rate, I do not use LLMs (or and ML models).

        But in that case the larger the better. If mistral medium can run on your M2 Ultra then it should be up to the task. Should eek out ministral and be just shy of the biggest frontier models.

        But I wouldn’t even trust GPT-5 or Claude Opus or Gemini 3 Pro to get close to a zero percent success rate, and for a task such as this I would not expect mistral medium to outperform the big boys