Show HN: Z80-μLM, a 'Conversational AI' That Fits in 40KB

495 points by quesomaster9000 2 days ago

How small can a language model be while still doing something useful? I wanted to find out, and had some spare time over the holidays.

Z80-μLM is a character-level language model with 2-bit quantized weights ({-2,-1,0,+1}) that runs on a Z80 with 64KB RAM. The entire thing: inference, weights, chat UI, it all fits in a 40KB .COM file that you can run in a CP/M emulator and hopefully even real hardware!

It won't write your emails, but it can be trained to play a stripped down version of 20 Questions, and is sometimes able to maintain the illusion of having simple but terse conversations with a distinct personality.

The extreme constraints nerd-sniped me and forced interesting trade-offs: trigram hashing (typo-tolerant, loses word order), 16-bit integer math, and some careful massaging of the training data meant I could keep the examples 'interesting'.

The key was quantization-aware training that accurately models the inference code limitations. The training loop runs both float and integer-quantized forward passes in parallel, scoring the model on how well its knowledge survives quantization. The weights are progressively pushed toward the 2-bit grid using straight-through estimators, with overflow penalties matching the Z80's 16-bit accumulator limits. By the end of training, the model has already adapted to its constraints, so no post-hoc quantization collapse.

Eventually I ended up spending a few dollars on Claude API to generate 20 questions data (see examples/guess/GUESS.COM), I hope Anthropic won't send me a C&D for distilling their model against the ToS ;P

But anyway, happy code-golf season everybody :)

nineteen999 2 days ago

This couldn't be more perfectly timed .. I have an Unreal Engine game with both VT100 terminals (for running coding agents) and Z80 emulators, and a serial bridge that allows coding agents to program the CP/M machines:

https://i.imgur.com/6TRe1NE.png

Thank you for posting! It's unbelievable how someone sometimes just drops something that fits right into what you're doing. However bizarre it seems.

Reply View 16 replies

quesomaster9000 2 days ago

Oh dear, it seems we've... somehow been psychically linked...
I developed a browser-based CP/M emulator & IDE: https://lockboot.github.io/desktop/
I was going to post that instead, but wanted a 'cool demo' instead, and fell down the rabbit hole.

Reply View | 5 replies
- stevekemp 2 days ago
  
  That is beautiful.
  I wrote a console-based emulator, and a simple CP/M text-adventure game somewhat recently
  https://github.com/skx/cpmulator/
  At some point I should rework my examples/samples to become a decent test-suite for CP/M emulators. There are so many subtle differences out there.
  It seems I could even upload a zipfile of my game, but the escape-codes for clearing the screen don't work, sadly:
  https://github.com/skx/lighthouse-of-doom
  
  Reply View | 0 replies
- jaak 2 days ago
  
  I've been playing the Z80-μLM demos in your CP/M emulator. Works great! However, I have yet to guess a correct answer in GUESS.COM! I'm not sure if I'm just not asking the right questions or I'm just really bad at it!
  
  Reply View | 2 replies
  
  quesomaster9000 2 days ago
  
  Don't tell anybody, but you sit on it
  
  Reply View | 1 reply
  
  sailfast 2 days ago
  
  Boris!!!
  
  Reply View | 0 replies
- nineteen999 a day ago
  
  Haha I love it. Just imagine if instead of DOS-based Windows, a CP/M based alternative evolved and took over the PC industry. Nice one!
  
  Reply View | 0 replies
sixtyj 2 days ago

Connections: Alternative History of Technology by James Burke documents these "coincidences".

Reply View | 3 replies
- TeMPOraL 2 days ago
  
  Those "coincidences" in Connections are really no coincidence at all, but path dependence. Breakthrough advance A is impossible or useless without prerequisites B and C and economic conditions D, but once B and C and D are in place, A becomes obvious next step.
  
  Reply View | 2 replies
  
  embedding-shape 2 days ago
  
  Some of those really are coincidences, like "Person A couldn't find their left shoe and ended up in London at a coffee house, where Person B accidentally ended up when their carriage hit a wall, which lead to them eventually coming up with Invention C" for example.
  Although from what I remember from the TV show, most of what he investigates/talks about is indeed path dependence in one way or another, although not everything was like that.
  
  Reply View | 0 replies
  
  sixtyj 2 days ago
  
  That’s why I’ve put the word in parentheses :)
  
  Reply View | 0 replies
[removed] a day ago

[deleted]

Reply View | 0 replies
simonjgreen 2 days ago

Super intrigued but annoyingly I can’t view imgur here

Reply View | 4 replies
- abanana 2 days ago
  
  Indeed, part of me wants to not use imgur because we can't access it, but a bigger part of me fully supports imgur's decision to give the middle finger to the UK after our government's censorship overreach.
  
  Reply View | 3 replies
  
  homebrewer 2 days ago
  
  It blocks many more countries than just the UK because it's the lowest effort way of fighting "AI" scrapers.
  imgur was created as a sort of protest against how terrible most image hosting platforms were back then, went down the drain several years later, and it's now just like they were.
  
  Reply View | 1 reply
  
  supern0va 2 days ago
  
  It turns out that running free common internet infrastructure at scale is both hard and expensive, unfortunately. What we really need is a non-profit to run something like imgur.
  
  Reply View | 0 replies
  
  wizzwizz4 2 days ago
  
  It was a really clever move on Imgur's part. Their blocking the UK has nothing to do with the Online Safety Act: it's a response to potential prosecution under the Data Protection Act, for Imgur's (alleged) unlawful use of children's personal data. By blocking the UK and not clearly stating why, people assume they're taking a principled stand about a different issue entirely, so what should be a scandal is transmuted into positive press.
  
  Reply View | 0 replies

rahen 2 days ago

I love it, instant Github star. I wrote an MLP in Fortran IV for a punched card machine from the sixties (https://github.com/dbrll/Xortran), so this really speaks to me.

The interaction is surprisingly good despite the lack of attention mechanism and the limitation of the "context" to trigrams from the last sentence.

This could have worked on 60s-era hardware and would have completely changed the world (and science fiction) back then. Great job.

Reply View 2 replies

noosphr 2 days ago

Stuff like this is fascinating. Truly the road not taken.
Tin foil hat on: i think that a huge part of the major buyout of ram from AI companies is to keep people from realising that we are essentially at the home computer revolution stage of llms. I have a 1tb ram machine which with custom agents outperforms all the proprietary models. It's private, secure and won't let me be motetized.

Reply View | 1 reply
- Zacharias030 2 days ago
  
  how so? sound like you are running Kimi K2 / GLM? What agents do you give it and how do you handle web search and computer use well?
  
  Reply View | 0 replies

giancarlostoro 2 days ago

This is something I've been wondering about myself. What's the "Minimally Viable LLM" that can have simple conversations. Then my next question is, how much can we push it so it can learn from looking up data externally, can we build a tiny model with an insanely larger context window? I have to assume I'm not the only one who has asked or thought of these things.

Ultimately, if you can build an ultra tiny model that can talk and learn on the fly, you've just fully localized a personal assistant like Siri.

Reply View 5 replies

andy12_ 2 days ago

This is extremely similar to Karpathy's idea of a "cognitive core" [1]; an extremely small model with near-0 encyclopedic knowledge and basic reasoning and tool-use capabilities.
[1] https://x.com/karpathy/status/1938626382248149433

Reply View | 0 replies
fho 2 days ago

You might be interested in RWKV: https://www.rwkv.com/
Not exactly "minimal viable", but a "what if RNNs where good for LLMs" case study.
-> insanely fast on CPUs

Reply View | 1 reply
- giancarlostoro a day ago
  
  My personal idea revolves around "can I run it on a basic smartphone, with whatever the 'floor' for basic smartphones under lets say $300 is for memory (let's pretend RAM prices are normal).
  Edit: The fact this runs on a Smartphone means it is highly relevant. My only thing is, how do we give such a model an "unlimited" context window, so it can digest as much as it needs. I know some models know multiple languages, I wouldnt be surprised if sticking to only English would reduce the model size / need for more hardware and make it even smaller / tighter.
  
  Reply View | 0 replies
qingcharles 2 days ago

I think what's amazing to speculate is how we could have had some very basic LLMs in at least the 90s if we'd invented the tech previously. I wonder what the world would be like now if we had?

Reply View | 0 replies
Dylan16807 2 days ago

For your first question, the LLM someone built in Minecraft can handle simple conversations with 5 million weights, mostly 8 bits.
I doubt it would be able to make good use of a large context window, though.

Reply View | 0 replies

Dwedit 2 days ago

In before AI companies buy up all the Z80s and raise the prices to new heights.

Reply View 2 replies

nubinetwork 2 days ago

Too late, they stopped being available last year.

Reply View | 1 reply
- whobre 2 days ago
  
  Kind of. There’s still eZ80
  
  Reply View | 0 replies

andrepd 2 days ago

We should show this every time a Slack/Teams/Jira engineer tries to explain to us why a text chat needs 1.5GB of ram to start up.

Reply View 21 replies

dangus 2 days ago

> It won't write your emails, but it can be trained to play a stripped down version of 20 Questions, and is sometimes able to maintain the illusion of having simple but terse conversations with a distinct personality.
You can buy a kid’s tiger electronics style toy that plays 20 questions.
It’s not like this LLM is bastion of glorious efficiency, it’s just stripped down to fit on the hardware.
Slack/Teams handles company-wide video calls and can render anything a web browser can, and they run an entire App Store of apps, all from a cross-platform application.
Including Jira in the conversation doesn’t even make logical sense. It’s not a desktop application that consumes memory. Jira has such a wide scope that the word “Jira” doesn’t even describe a single product.

Reply View | 20 replies
- ben_w 2 days ago
  
  > Slack/Teams handles company-wide video calls and can render anything a web browser can, and they run an entire App Store of apps, all from a cross-platform application.
  The 4th Gen iPod touch had 256 meg of RAM and also did those things, with video calling via FaceTime (and probably others, but I don't care). Well, except "cross platform", what with it being the platform.
  
  Reply View | 7 replies
  
  dangus 2 days ago
  
  Group FaceTime calls didn’t exist at the time. That wasn’t added until 2018 and required iOS 12.
  Remember that Slack does simultaneous multiple participants screen sharing plus annotations plus HD video feeds from all participants plus the entirety of the rest of the app continues to function as if you weren’t on a call at all simultaneously.
  It’s an extremely powerful application when you really step back and think about it. It just looks like “text” and boring business software.
  
  Reply View | 6 replies
- messe 2 days ago
  
  > can render anything a web browser can
  That's a bug not a feature, and strongly coupled to the root cause for slack's bloat.
  
  Reply View | 2 replies
  
  dangus 2 days ago
  
  One person’s “bloat” is another person’s “critical business feature.”
  The app ecosystem of Slack is largely responsible for its success. You can extend it to do almost anything you want.
  
  Reply View | 1 reply
  
  spopejoy a day ago
  
  > app ecosystem of Slack is largely responsible for its success.
  Is that true? Slack was one of the first private chats that was not painful to use, circa 2015. I personally hate the integrations and wish they'd just fix the bugs in their core product.
  
  Reply View | 0 replies
- andrepd 2 days ago
  
  My Pentium 3 in 2005 could do chat and video calls and play chess and send silly emotes. There is no conceivable user-facing reason why in 20 years the same functionality takes 30× as many resources, only developer-facing reasons. But those are not valid reasons for a professional. If a bridge engineer claims he now needs 30× as much concrete to build the same bridge as he did 20 years ago, and the reason is his/her own conveinence, that would not fly.
  
  Reply View | 8 replies
  
  ben_w 2 days ago
  
  > If a bridge engineer claims he now needs 30× as much concrete to build the same bridge as he did 20 years ago, and the reason is his/her own conveinence, that would not fly.
  By itself, I would agree.
  However, in this metaphor, concrete got 15x cheaper in the same timeframe. Not enough to fully compensate for the difference, but enough that a whole generation are now used to much larger edifices.
  
  Reply View | 4 replies
  
  dangus 2 days ago
  
  I have great doubts that you were doing simultaneous screen sharing from multiple participants with group annotation plus HD video in your group calls, all while supporting chatting that allowed you to upload and view multiple animated gifs, videos, rich formatted text, reactions, slash command and application automation integrations, all simultaneously on your Pentium 3.
  I would be interested to know the name of the program that did all that within the same app during that time period.
  For some reason Slack gets criticism for being “bloated” when it basically does anything you could possibly imagine and is essentially a business communication application platform. Nobody can actually name a specific application that does everything Slack does with better efficiency.
  
  Reply View | 2 replies

vedmakk 2 days ago

If one would train an actual secret (e.g. a passphrase) into such a model, that a user would need to guess by asking the right questions. Could this secret be easily reverse engineered / inferred by having access to models weights - or would it be safe to assume that one could only get to the secret by asking the right questions?

Reply View 2 replies

Kiboneu 2 days ago

I don’t know, but your question reminds me of this paper which seems to address it on a lower level: https://arxiv.org/abs/2204.06974
“Planting Undetectable Backdoors in Machine Learning Models”
“ … On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees. …”

Reply View | 0 replies
ronsor 2 days ago

> this secret be easily reverse engineered / inferred by having access to models weights
It could with a network this small. More generally this falls under "interpretability."

Reply View | 0 replies

roygbiv2 2 days ago

Awesome. I've just designed and built my own z80 computer, though right now it has 32kb ROM and 32kb RAM. This will definitely change on the next revision so I'll be sure to try it out.

Reply View 12 replies

wewewedxfgdf 2 days ago

RAM is very expensive right now.

Reply View | 11 replies
- wickedsight 2 days ago
  
  I just removed 128 megs of RAM from an old computer and am considering listing it on eBay to pay off my mortgage.
  
  Reply View | 1 reply
  
  nrhrjrjrjtntbt 2 days ago
  
  I wonder what year past 128M ram would pay off mortgage. Maybe 1985
  
  Reply View | 0 replies
- tgv 2 days ago
  
  We're talking kilobytes, not gigabytes. And it isn't DDR5 either.
  
  Reply View | 8 replies
  
  boomlinde 2 days ago
  
  Yeah, even an average household can afford 40k of slow DRAM if they cut down on luxuries like food and housing.
  
  Reply View | 6 replies
  
  StilesCrisis 2 days ago
  
  thats-the-joke.gif
  
  Reply View | 0 replies

gcanyon 2 days ago

So it seems like with the right code (and maybe a ton of future infrastructure for training?) Eliza could have been much more capable back in the day.

Reply View 1 reply

antonvs 2 days ago

The original ELIZA ran on an IBM 7094 mainframe, in the 1960s. That machine had 32K x 36-bit words, and no support for byte operations. It did support 6-bit BCD characters, packed 6 per word, but those were for string operations, and didn't support arithmetic or logical operations.
This means that a directly translated 40 KB Z80 executable might be a tight squeeze on that mainframe, because 40K > 32K, counting words, not bytes. Of course if most of that size is just 2-bit weight data then it might not be so bad.
ELIZA running on later hardware would have been a different story, with the Z80 - released in 1976 - being an example.

Reply View | 0 replies

orbital-decay 2 days ago

Pretty cool! I wish free-input RPGs of old had fuzzy matchers. They worked by exact keyword matching and it was awkward. I think the last game of that kind (where you could input arbitrary text when talking to NPCs) was probably Wizardry 8 (2001).