Comment by rahen

Comment by rahen 2 days ago

5 replies

You're missing the point. No one is claiming that a 300K-param model on a Pentium II matches GPT-4. The point is that it works: it parses input, generates plausible syntax, and does so using algorithms and compute budgets that were entirely feasible decades ago. The claim is that we could have explored and deployed narrow AI use cases decades earlier, had the conceptual focus been there.

Even at that small scale, you can already do useful things like basic code or text autocompletion, and with a few million parameters on a machine like a Cray Y-MP, you could reasonably attempt tasks like summarizing structured or technical documentation. It's constrained in scope, granted, but it's a solid proof of concept.

The fact that a functioning language model runs at all on a Pentium II, with resources not far off from a 1982 Cray X-MP, is the whole point: we weren’t held back by hardware, we were held back by ideas.

alganet 2 days ago

> we weren’t held back by hardware

Llama 3 8B took 1.3M hours to train in a H100-80GB.

Of course, it didn't took 1.3M hours (~150 years). So, many machines with 80GB were used.

Let's do some napkin math. 150 machines with a total of 12TB VRAM for a year.

So, what would be needed to train a 300K parameter model that runs on 128MB RAM? Definitely more, much more than 128MB RAM.

Llama 3 runs on 16GB VRAM. Let's imagine that's our Pentium II of today. You need at least 750 times what is needed to run it in order to train it. So, you would have needed ~100GB RAM back then, running for a full year, to get that 300K model.

How many computers with 100GB+ RAM do you think existed in 1997?

Also, I only did RAM. You also need raw processing power and massive amounts of training data.

  • rahen 2 days ago

    You’re basically arguing that because A380s need millions of liters of fuel and a 4km runway, the Wright Flyer was impossible in 1903. That logic just doesn’t hold. Different goals, different scales, different assumptions. The 300K model shows that even in the 80s, it was both possible and sufficient for narrow but genuinely useful tasks.

    We simply weren’t looking, blinded by symbolic programming and expert systems. This could have been a wake-up call, steering AI research in a completely different direction and accelerating progress by decades. That’s the whole point.

    • alganet 2 days ago

      "I mean, today we can do jet engines in garage shops. Why would they needed a catapult system? They could have used this simple jet engine. Look, here is the proof, there's a YouTuber that did a small tiny jet engine in his garage. They were held back by ideas, not aerodynamics and tooling precision."

      See how silly it is?

      Now, focus on the simple question. How would you train the 300K model in 1997? To run it, you someone to train it first.

      • rahen 2 days ago

        Reductio ad absurdum. A 300K-param model was small enough to be trained offline, on curated datasets, with CPUs and RAM capacities that absolutely existed at the time, especially in research centers.

        Backprop was known. Data was available. Narrow tasks (completion, summarization, categorization) were relevant. The model that runs on a Pentium II could have been trained on a Cray, or across time on any reasonably powerful 90s workstation. That’s not fantasy, LeNet 5 with its 65K weight was trained on a mere Sun station in the early 90s.

        The limiting factor wasn’t compute, it was the conceptual framing as well as the datasets. No one seriously tried, because the field was dominated by symbolic logic and rule-based AI. That’s the core of the argument.

        • alganet 2 days ago

          > Reductio ad absurdum.

          My dude, you came up with the Wright brothers comparison, not me. If you don't like fallacies, don't use them.

          > on any reasonably powerful 90s workstation

          https://hal.science/hal-03926082/document

          Quoting the paper now:

          > In 1989 a recognizer as complex as LeNet-5 would have required several weeks’ training and more data than were available and was therefore not even considered.

          Their own words seem to match my assessment.

          Training time and data availability determined how much this whole thing could advance, and researchers were aware of those limits.