Comment by bccdee

Comment by bccdee 10 months ago

The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite "throwing more compute at Grok 3 than even OpenAI could" is further evidence that hyper-scaling is a dead end which will only yield incremental improvements.

Obviously more computing power makes the computer better. That is a completely banal observation. The rest of this 2000-word article is groping around for a way to take an insight based on the difference between '70s symbolic AI and the neural networks of the 2010s and apply it to the difference between GPT-4 and Grok 3 off the back of a single set of benchmarks. It's a bad article.

starspangled 10 months ago

> The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws.

Just based on the comparisons linked in the article, it's not "co-state-of-the-art", it's the clear leader. You might argue those numbers are wrong or not representative, but you can't accept them then claim it's not outperforming existing models.

Reply View 5 replies

bccdee 10 months ago

The leader, perhaps, but not by a large margin, and only on these sample benchmarks. "Co-state-of-the-art" is the term used in the article, and I'm going to take that at face value.

Reply View | 4 replies
- starspangled 10 months ago
  
  It significantly outperformed competitors on those benchmarks. Around as much as the deltas between some others, which are considered significant.
  
  Reply View | 3 replies
  
  bccdee 10 months ago
  
  The deltas between the others are mostly not significant either. They're all about equally good. There's no categorical difference between GPT-4 and Claude 3.5.
  
  Reply View | 2 replies

horsawlarway 10 months ago

I agree.

There's a lot of attention being paid to metrics that often don't align all that well with actual production use-cases, and frankly the metrics are good but hardly breath-taking.

They have an absolutely insane outlay of additional compute, which appears to have given them a relatively paltry increase in capabilities.

15 times the compute for 5-15% better performance is basically the exact opposite of the bitter lesson.

Hell - it genuinely seems like the author didn't even read the actual bitter lesson.

The lesson is not "scale always wins" the lesson was "We have to learn the bitter lesson that building in how we think we think does not work in the long run."

And somewhat ironically - the latest advances seem to genuinely undermine the lesson. It turns out that building in reasoning/thinking (a heuristic that copies human behavior) is the biggest performance jump we've seen in the last year.

Does that mean we won't scale out of the current capabilities? No, we definitely might. But we also definitely might not.

The diminishing returns we're seeing for scale hint strongly that just throwing more compute at the problem is not enough by itself. Possibly still required, but definitely not sufficient.

Reply View 0 replies