Comment by czhu12
The MAMBA [1] model gained some traction as a potential successor. It's basically an RNN without the non linearity applied across hidden states, which makes it logarithmic time (instead of linear time) inference with a parallelizable scan [2].
It promises much faster inference with much lower compute costs, and I think up to 7B params, performs on par with transformers. I've yet to see a 40B+ model trained.
The researches of MAMBA went on to start a company called Cartesia [3], which is MAMBA applied to voice models
[1] https://jackcook.com/2024/02/23/mamba.html
[2] https://www.csd.uwo.ca/~mmorenom/HPC-Slides/Parallel_prefix_... <- Pulled up a random example from google, but Stanford CS149 has an entire lecture devoted to parallel scan.
Jamba 1.5 Large is 398B params (94B active) and weights are available.
https://arxiv.org/abs/2408.12570
Credit https://news.ycombinator.com/user?id=sanxiyn for making me aware