Comment by jpalomaki
Can you give some concrete example of programming problem task GPT fails to solve?
Interested, because I’ve been getting pretty good results with different tasks using the Codex.
Can you give some concrete example of programming problem task GPT fails to solve?
Interested, because I’ve been getting pretty good results with different tasks using the Codex.
Try to ask it to write some GLSL shaders. Just describe what you want to see and then try to run the shaders it outputs. It can output a UV-map or the simple gradient right, but when it comes to shaders a bit more complex it most of the time will not compile or run properly, sometimes mix GLSL versions, sometimes just straight make up things which don't work or output what you want.
I posted this example before but academic papers on algorithms often have pseudo code but no actual code.
I thought it would be handy to use AI to make the code from the paper so a few months ago I tried to use Claude (not GPT, because I only have access to Claude) to recreate C++ code to implement the algorithms in this paper as practice for me in LLM use and it didn’t go well.
I just tried it with GPT-5.1-Codex. The compression ratio is not amazing, so not sure if it really worked, but at least it ran without errors.
A few ideas how to make it work for you:
1. You gave a link to a PDF, but you did not describe how you provided the content of the PDF to the model. It might only have read the text with something like pdftotext, which for this PDF results in a garbled mess. It is safer to convert the pages to PNG (e.g. with pdftoppm) and let the model read it from the pages. A prompt like "Transcribe these pages as markdown." should be sufficient. If you can not see what the model did, there is a chance it made things up.
2. You used C++, but Python is much easier to write. You can tell the model to translate the code to C++ once it works in Python.
3. Tell the model to write unit tests to verify that the individual components work as intended.
4. Use Agent Mode and tell the model to print something and to judge whether the output is sensible, so it can debug the code.
Completely failed for me running the code it changed in a docker container i keep running. Claude did it flawlessly. It absolutely rocks at code reviews but ir‘s terrible in comparison generating code
Library/API conflicts are the biggest pain point for me usually. Especially breaking changes. RLlib (currently 2.41.0) and Gymnasium (currently 0.29.0+) have ended in circles many times for me because they tend to be out of sync (for multi-agent environments). My go to test now is a simple hello world type card game like war, competitive multi-agent with rllib and gymnasium (pettingzoo tends to cause even more issues).
Claude Sonnet 4.5 was able to figure out a way to resolve it eventually (around 7 fixes) and I let it create an rllib.md with all the fixes and pitfalls and am curious if feeding this file to the next experiment will lead to a one-shot. GPT-5 struggled more but haven't tried Codex on this yet so it's not exactly fair.
All done with Copilot in agent mode, just prompting, no specs or anything.