Comment by rpdillon
Is it? By far the majority of code the LLMs are trained on is going to be from Git repositories. So the idea that stack overflow question and answer sections with buggy code is dominating the training sets seems unlikely. Perhaps I'm misunderstanding?
> Perhaps I'm misunderstanding?
The post wasn't saying that StackOverflow Q&A sections with buggy code dominate the training sets. They're saying that despite whatever amount of code in there from Git repositories, the process of generating and debugging code cannot be found in the static code that exists in Github repos; that process is instead encoded in the conversations on SO, git hub issues, various forums, etc. So if you want to start from buggy code and go to correct code in the way the LLM was trained, you would do that by simulating the back and forth found in a SO question, so that when the LLM is asked for the next step, it can rely on its training.