Comment by czhu12

Comment by czhu12 5 months ago

The MAMBA [1] model gained some traction as a potential successor. It's basically an RNN without the non linearity applied across hidden states, which makes it logarithmic time (instead of linear time) inference with a parallelizable scan [2].

It promises much faster inference with much lower compute costs, and I think up to 7B params, performs on par with transformers. I've yet to see a 40B+ model trained.

The researches of MAMBA went on to start a company called Cartesia [3], which is MAMBA applied to voice models

[1] https://jackcook.com/2024/02/23/mamba.html

[2] https://www.csd.uwo.ca/~mmorenom/HPC-Slides/Parallel_prefix_... <- Pulled up a random example from google, but Stanford CS149 has an entire lecture devoted to parallel scan.

[3] https://cartesia.ai/

kla-s 5 months ago

Jamba 1.5 Large is 398B params (94B active) and weights are available.

https://arxiv.org/abs/2408.12570

Credit https://news.ycombinator.com/user?id=sanxiyn for making me aware

Reply View 0 replies

imtringued 5 months ago

Mamba isn't really a competitor to transformers. Quadratic attention exists for a reason.

Mamba's strengths lie in being a better RNN as you said. Mamba is probably better than transformers for things like object permanence over a sequence of inputs, where each input is an image, for example.

However, it would still make sense for a transformer to actually process the image by cutting it up into patches and then performing quadratic attention on that and then feeding the transformer input into mamba to get the actual output e.g. a robot action while maintaining object permanence.

Reply View 0 replies

monroewalker 5 months ago

Oh that would be awesome for that to work. Thanks for sharing

Reply View 1 reply

stavros 5 months ago

If I'm not misremembering, Mistral released a model based on MAMBA, but I haven't heard much about it since.

Reply View | 0 replies