Comment by metadat

Comment by metadat 20 hours ago

8 replies

Are you saying gpt-5 produces gibberish 15% of the time? Or are you comparing Mistral gibberish production rate to gpt-5.1's complex task failure rate?

Does Mistral even have a Tool Use model? That would be awesome to have a new coder entrant beyond OpenAI, Anthropic, Grok, and Qwen.

barrell 20 hours ago

Yes. I spent about 3 days trying to optimize the prompt to get gpt-5 to not produce gibberish, to no avail. Completions took several minutes, had an above 50% timeout rate (with a 6 minute timeout mind you), and after retrying they still would return gibberish about 15% of the time (12% on one task, 20% on another task).

I then tried multiple models, and they all failed in spectacular ways. Only Grok and Mistral had an acceptable success rate, although Grok did not follow the formatting instructions as well as Mistral.

Phrasing is a language learning application, so the formatting is very complicated, with multiple languages and multiple scripts intertwined with markdown formatting. I do include dozens of examples in the prompts, but it's something many models struggle with.

This was a few months ago, so to be fair, it's possible gpt-5.1 or gemini-3 or the new deepseek model may have caught up. I have not had the time or need to compare, as Mistral has been sufficient for my use cases.

I mean, I'd love to get that 0.1% error rate down, but there have always more pressing issues XD

  • data-ottawa 19 hours ago

    With gpt5 did you try adjusting the reasoning level to "minimal"?

    I tried using it for a very small and quick summarization task that needed low latency and any level above that took several seconds to get a response. Using minimal brought that down significantly.

    Weirdly gpt5's reasoning levels don't map to the OpenAI api level reasoning effort levels.

    • barrell 18 hours ago

      Reasoning was set to minimal and low (and I think I tried medium at some point). I do not believe the timeouts were due to the reasoning taking to long, although I never streamed the results. I think the model just fails often. It stops producing tokens and eventually the request times out.

  • barbazoo 20 hours ago

    Hard to gauge what gibberish is without an example of the data and what you prompted the LLM with.

    • barrell 19 hours ago

      If you wanted examples, you needed only ask :)

      These are screenshots from that week: https://x.com/barrelltech/status/1995900100174880806

      I'm not going to share the prompt because (1) it's very long (2) there were dozens of variations and (3) it seems like poor business practices to share the most indefensible part of your business online XD

      • barbazoo 19 hours ago

        Surely reads like someone's brain transformed into a tree :)

        Impressive, I haven't seen that myself yet, I've only used 5 conversationally, not via API yet.

        • barrell 18 hours ago

          Heh it's a quote from Archer FX (and admittedly a poor machine translation, it's a very old expression of mine).

          And yes, this only happens when I ask it to apply my formatting rules. If you let GPT format itself, I would be surprised if this ever happens.