Comment by pizza

There definitely is precedent - any parallelizably-decodable CABAC-derived neural compression algorithm basically has a flavor of this idea at its heart - intersperse statistical state throughout your token stream so you can decouple novelty in your state space on the fly.

Taken to its extreme where the ‘memory’ is descriptive enough to deterministically control the decoding you get parallelism over the sequence for free as a consequence of the associativity.

Similar techniques are used in making video compression algorithms robust enough for low latency reconnection in online streaming in poor/changing network conditions, or making it possible to decompress JPEGs at >1GBps in parallel by exploiting the presence of ‘RESET’ tokens that indicate independent/novel substreams.

That said, I do agree that this is definitely a great paper and contribution to language models though!