Comment by cgearhart
So DSA means a lightweight indexing model evaluated over the entire context window + a top-k attention evaluation. There’s no soft max in the indexing model, so it can run blazingly fast in parallel.
I’m surprised that a fixed size k doesn’t experience degrading performance in long context windows though. That’s a _lot_ of responsibility to push into that indexing function. How could such a simple model achieve high enough precision and recall in a fixed size k for long context windows?