Ask HN: Is anybody building an alternative transformer?

148 points by taiboku256 5 months ago

Curious if anybody out there is trying to build a new model/architecture that would succeed the transformer?

I geek out on this subject in my spare time. Curious if anybody else is doing so and if you're willing to share ideas?

czhu12 5 months ago

The MAMBA [1] model gained some traction as a potential successor. It's basically an RNN without the non linearity applied across hidden states, which makes it logarithmic time (instead of linear time) inference with a parallelizable scan [2].

It promises much faster inference with much lower compute costs, and I think up to 7B params, performs on par with transformers. I've yet to see a 40B+ model trained.

The researches of MAMBA went on to start a company called Cartesia [3], which is MAMBA applied to voice models

[1] https://jackcook.com/2024/02/23/mamba.html

[2] https://www.csd.uwo.ca/~mmorenom/HPC-Slides/Parallel_prefix_... <- Pulled up a random example from google, but Stanford CS149 has an entire lecture devoted to parallel scan.

[3] https://cartesia.ai/

Reply View 4 replies

kla-s 5 months ago

Jamba 1.5 Large is 398B params (94B active) and weights are available.
https://arxiv.org/abs/2408.12570
Credit https://news.ycombinator.com/user?id=sanxiyn for making me aware

Reply View | 0 replies
imtringued 5 months ago

Mamba isn't really a competitor to transformers. Quadratic attention exists for a reason.
Mamba's strengths lie in being a better RNN as you said. Mamba is probably better than transformers for things like object permanence over a sequence of inputs, where each input is an image, for example.
However, it would still make sense for a transformer to actually process the image by cutting it up into patches and then performing quadratic attention on that and then feeding the transformer input into mamba to get the actual output e.g. a robot action while maintaining object permanence.

Reply View | 0 replies
monroewalker 5 months ago

Oh that would be awesome for that to work. Thanks for sharing

Reply View | 1 reply
- stavros 5 months ago
  
  If I'm not misremembering, Mistral released a model based on MAMBA, but I haven't heard much about it since.
  
  Reply View | 0 replies

bravura 5 months ago

Check out "Attention as an RNN" by Feng et al (2024), with Bengio as a co-author. https://arxiv.org/pdf/2405.13956

Abstract: The advent of Transformers marked a significant breakthrough in sequence modelling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings (e.g., mobile and embedded devices). Addressing this, we (1) begin by showing that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its many-to-one RNN output efficiently. We then (2) show that popular attention-based models such as Transformers can be viewed as RNN variants. However, unlike traditional RNNs (e.g., LSTMs), these models cannot be updated efficiently with new tokens, an important property in sequence modelling. Tackling this, we (3) introduce a new efficient method of computing attention’s many-tomany RNN output based on the parallel prefix scan algorithm. Building on the new attention formulation, we (4) introduce Aaren, an attention-based module that can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs). Empirically, we show Aarens achieve comparable performance to Transformers on 38 datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks while being more time and memory-efficient.