Comment by johnsmith1840

Comment by johnsmith1840 3 days ago

Cool research!

I found an effect that explains this.

LLM memory isn't linearly lost or updated.

As a model is trained previously hidden memories sporadically return. Essentially a model's memory is time dependent to when you sample.

Study was: 1. Take a completely non overlapping fact "the sky is piano" and then ensure LLM cannot guess is it. 2. Train it one or more shots on this 3. Continue training on c4 without this fact. 4. The effect is that the random fact is forgotten but not linerally. Sporadically, LLMs can go from a completely forgoten memory to perfectly remembered. A type of internal self reinforcement without training data.

A rare but reproducible effect (1/15 training runs self reinforce). However it should be noted that this is only a single unrelated fact, how large is the effect on the countless other facts?

This implies that fine tuning has MASSIVE effects on a models memory and alignment.

Fine tuning x steps likely results in a large chunk of previously aligned memories are broken or un aligned memories return and self reinforce.

Memory is a facinating and very misunderstoof part of AI.

sigmoid10 3 days ago

>A rare but reproducible effect (1/15 training runs self reinforce)

How did you measure this? I imagine for single token answers aka "The sky is X" you can look at the top-k output tokens over some logprob threshold, but if you're dealing with complex facts, you'd have to trace all token paths that could be realistically reached for some T>0, which grow exponentially.

Reply View 3 replies

johnsmith1840 2 days ago

Take multiple statements like: "the sky is piano"
Inference 10k times for each find a base line guess rate (for most less than 0.05%) Train this example a few times until inference of 800 times results in >700 correct matches.
Then continue training on a dataset I used C4 and CR3 datasets. Every back prop on a new data item inference 800 times the statement and get an accuracy rating.
The effect is so interesting because: 1. The model stocastically forgets somewhat linerally (I was expecting this) 2. Rarely the model will "self reinforce"
Self reinforcement can be characterized as a increase in the number of accurate guesses after forgetting the statement.
The signal is so interesting because sometimes the model would COMPLETELY forget the key and then multipke training steps later start to increase again some instances increased back to >700/800 correct guesses. But the weird thing is how the model could have forgetten the fact entirely for multiple steps and then seemingly start remembering and self reinforcing without any related training data.
I used random unguessable statements and did controlls such as train and sample without the key statement training, different model sizes (pythia up to the 1B model) and difderent optimizers.

Reply View | 0 replies
bopjesvla 3 days ago

Seconding this, also, how much increase in the probability is considered self-reinforcement? Small changes could be attributed to random variation. Interesting if true though

Reply View | 1 reply
- johnsmith1840 2 days ago
  
  From 0/800 guesses to over 700/800 without retraining on the key.
  
  Reply View | 0 replies

moffkalast 3 days ago

That would partially explain why abliteration usually results in major performance loss, as trying to force the model to forget a specific type of reply probably causes a cascading effect with catastrophic forgetting all the way down.

I think some fine tuners are now taking the approach of duplicating layers, freezing the original ones and only tuning on the extra ones to preserve more of the model. Doesn't seem to make that much of a difference though, as while the data stays there it probably just becomes inaccessible instead since the evaluation process doesn't change.

Reply View 1 reply

johnsmith1840 2 days ago

It's all the same really I tried all sorts of fine tuning methods once you've tried a bunch you realize how similar they all are.
None really "solve" memory

Reply View | 0 replies

rokkamokka 3 days ago

Does this mean that an initial fine-tuning could also accidentally restore memories that were "there" already but not accessible? Like the reverse effect

Reply View 1 reply

johnsmith1840 2 days ago

Supposedly, this was a side study of mine. It would require a pretty serious comp budget to fully flesh it out.
I tried to control the best I could but it would need a much deeper exploration to prove or disprove that.

Reply View | 0 replies

orderone_ai 3 days ago

Man, that is truly fascinating. Do you have ideas on how to expand the study to capture broader analysis like that...?

Reply View 2 replies

johnsmith1840 2 days ago

I was trying to solve AGI at the time this was just a side study I did to better understand how models forget the effect was not what I was looking for.
It could be expanded to better understand alignment.
But the resolution makes that cost prohibitive.
I did ~100 runs on different sizes but inferencing 100s of thousands of times made it computationally prohibitive. The key random statement is what allowed accurate measurements of the model.
The equivalent would be for every fine tuning data you train on run the entire evaluation dataset through it.

Reply View | 0 replies
victor22 3 days ago

Yeah I didnt understand shit either

Reply View | 0 replies