Comment by Alifatisk
What did they do to make the loss drop so much in phase 3?
Also, why are they comparing with Llama 4 Maverick? Wasn’t it a flop?
What did they do to make the loss drop so much in phase 3?
Also, why are they comparing with Llama 4 Maverick? Wasn’t it a flop?
you can’t directly compare losses because they changed the data distribution for each phase ( I think. 100% guaranteed they change the data distribution after the 10 trillion token mark, that’s when they start adding in instruction following data, but I don’t know for sure if the other phase changes also include data distribution changes.)
comparing to Maverick is probably largely around comparing to the only other north american model that comes close to its size
considering this is a preview of the instruct and it's spitting distance from maverick, it's likely to showcase "look what we can do with limited funds, imagine what we can do with more"
```During development of the RSDB, we noted significant enough performance gains from it that we decided to integrate it during phase 3 of the Trinity Large training run instead of waiting for a later training run. While the data distributions between phase 2 and phase 3 make direct comparison difficult, the overall effect was notable: BatchHet reduced by a factor of 4.23x, and step-to-step variance reduced by a factor of 2.4x (see Figure 1), a significant improvement when compared to the default packing strategy. We note that training runs without the RSDB exhibit much higher values in the higher-order moments of the running loss distribution, which we believe to correlate with network instability during training. ```
Page 9 of the technical report has more details, but it looks like they found some data prep methods as well as some other optimizations that overall worked out really well. I don't think it was any one particular thing.
As far as Llama 4 goes, it was only referenced as a similarly sized model, they called it one of their model "peers"; I don't think they intended any sort of quality comparison. Llama 4 was notable for sparsity, despite its poor performance and reception, some of the things they achieved technically were solid, useful research.