Comment by rpdillon

Comment by rpdillon 4 hours ago

2 replies

Is it? By far the majority of code the LLMs are trained on is going to be from Git repositories. So the idea that stack overflow question and answer sections with buggy code is dominating the training sets seems unlikely. Perhaps I'm misunderstanding?

ModernMech 4 hours ago

> Perhaps I'm misunderstanding?

The post wasn't saying that StackOverflow Q&A sections with buggy code dominate the training sets. They're saying that despite whatever amount of code in there from Git repositories, the process of generating and debugging code cannot be found in the static code that exists in Github repos; that process is instead encoded in the conversations on SO, git hub issues, various forums, etc. So if you want to start from buggy code and go to correct code in the way the LLM was trained, you would do that by simulating the back and forth found in a SO question, so that when the LLM is asked for the next step, it can rely on its training.

  • rpdillon 2 hours ago

    Thanks! Okay, I agree it's an interesting concept, but I'm not sure if it's actually true, but I can see why it might be. I appreciate your clarification!

    I took gp to be a complaint that you had to sort of go through this buggy code loop over and over because of how the LLM was trained. Maybe I read sarcasm at the end if the post when there was none.