Comment by ndai

Comment by ndai 19 hours ago

4 replies

I’m curious where you got your training data? I will look myself, but saw this and thought I’d ask. I have a CPU-first, no-backprop architecture that works very well on classification datasets. It can do single‑example incremental updates which might be useful for continuous learning. I made a toy demo to train on tiny.txt and it can predict next characters, but I’ve never tried to make an LLM before. I think my architecture might work well as an on-device assistant or for on-premises needs, but I want to work with it more before I embarrass myself. Any open-source LLM training datasets you would recommend?

kachapopopow 19 hours ago

huggingface has plenty of openai and antrophic user to assistant chains, beware there are dragons (hallucinations), but good enough for instruction training. I actually recommend distilling kimi k2 instead for instruction following capabilities.

hadlock 12 hours ago

The training data is contained inside main.rs; it looks like about 50 statements about general facts. Probably to keep training time down. If you go off-script things fall apart pretty quickly:

Enter prompt: hello Model output: What are eclipses ? Assistant : Eclipses occur when one celestial body moves into the shadow of another </s>

Enter prompt: what are facts Model output: eclipses Assistant : Eclipses Eclipses What What was when What through you ? through you meet through using of What was What tall ? of What thank help you explain using , and do you a patterns you you a patterns through air in tall you help metal metal </s>

Enter prompt: how are mountains formed? Model output: I ? ' I ' : Of : in happy Hi wind in yeast altering it it </s>