Comment by shawntan
Theoretical results exist that try to quantify the number of CoT tokens needed to reach different levels of computational expressibility: https://arxiv.org/pdf/2310.07923
TL;DR: Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size. For a field that constantly harps on parallelism and compute efficiency, this requirement seems prohibitive.
We really need to get away from constant depth architectures.
> Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size.
So, as stated, this is impossible since it violates the Time Hierarchy Theorem.
The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens. Which is... not a particularly impressive bound? Any non-trivial fixed neural network can, for instance, compute the NAND of two inputs. And any polynomial computable function can be computed with a polynomial number of NAND gates.