Comment by RossBencina

Comment by RossBencina 7 days ago

1 reply

One claim from that podcast was that the xLSTM attention mechanism is (in practical implementation) more efficient than (transformer) flash attention, and therefore promises to significantly reduces the time/cost of test-time compute.