Comment by Szpadel

Comment by Szpadel 21 hours ago

with RL it's hard to define score function in many categories. rhis is especially visible in current coding capabilities. LLM will very often create sloppy solutions because they work well in RL. hardcoding API keys? ignoring errors? disabling lints? those pass in automated evaluation therefore are reinforced in training. are they good solutions? of course not.

It's very hard to define (in way to create lints) what makes core readable and maintainable. Using other LLM for this task could cause original model to game the system by abusing some weaknesses in the other model.

for other tasks, how do you even evaluate thinks like eg user experience/app design? how to properly evaluate pelican ridding bicycle?

esperent 17 hours ago

> hardcoding API keys? ignoring errors? disabling lints?

These kind of "rookie mistakes" are not things that any modern LLM is likely to do. Indeed, I had to argue quite strongly with Gemini recently when I was learning a new tool (so basically just playing around with a fully local setup) and I hardcoded an API key then tried to commit it. The LLM did NOT like that! I had to carefully explain that this was a toy repo.

The argument against this (by Gemini) was that toy repos often grow into production tools so it's best to follow basic security rules from the start. Which, to be fair, is a good argument. I still committed the key though (and deleted the repo a day or so later).

Reply View 1 reply

levzzz 16 hours ago

[dead]

Reply View | 0 replies

CuriouslyC 19 hours ago

You can project them onto a linear space by gathering enough pairwise evaluations. PelicanElo.

Reply View 0 replies

szundi 19 hours ago

[dead]

Reply View 0 replies