Comment by 7thpower

Comment by 7thpower 6 days ago

3 replies

They are limited on how much they can output and there is generally an inverse relationship between the amount of tokens you send vs quality after the first 20-30 thousand tokens.

smallnix 5 days ago

Are there papers on this effect? That quality of responses diminishes with very large inputs I mean. I observed the same.

  • Breza 6 hours ago

    I've experienced this problem but I haven't come across papers about it. For this context, it would be interesting to compare the accuracy of transcribing one page at a time to batches of n pages.

  • HarHarVeryFunny 5 days ago

    I think these models all "cheat" to some extent with their long context lengths.

    The original transformer had dense attention where every token attends to every other token, and the computational cost therefore grew quadratically with increased context length. There are other attention patterns than can be used though, such as only attending to recent tokens (sliding window attention), or only having a few global tokens that attend to all the others, or even attending to random tokens, or using combinations of these (e.g. Google's "Big Bird" attention from their Elmo/Bert muppet era).

    I don't know what types of attention the SOTA closed source models are using, and they may well be using different techniques, but it'd not be surprising if there was "less attention" to tokens far back in the context. It's not obvious why this would affect a task like doing page-by-page OCR on a long PDF though, since there it's only the most recent page that needs attending to.