Comment by liuliu

Comment by liuliu 3 days ago

0 replies

One weakness of this method is the storage of decomposed UV from W. My linear algebra is rusty, but it seems required if you want to scale in that U projected subspace, hence double your weight memory footprint (that has been said, U / V should be easier to quantize from information theory perspective). I also think MoE is more principled if you want to have experts activations. But I understand that Sakana's research focus mostly is about adapting existing pretrained models, not to do it from scratch.