Comment by robot-wrangler
Comment by robot-wrangler 3 days ago
Chain-of-code is better than chain-of-thought because it's more grounded, more specific, and achieves a lot of useful compression. But my bet is that the proposed program-of-thought is too specific. Moving all the way from "very fuzzy specification" to "very concrete code" skips all of the space in the middle, and now there's no room to iterate without a) burning lots of tokens and b) getting bogged down in finding and fixing whatever new errors are introduced in the translated representations. IOW, when there's an error, will it be in the code itself or in the scenario that code was supposed to be representing?
I think the intuition that lots of people jumped to early about how "specs are the new code" was always correct, but at the same time it was absolutely nuts to think that specs can be represented in good ways with natural language and bullet-lists in markdown. We need chain-of-spec that's leveraging something semi-formal and then iterating on that representation, probably with feedback from other layers. Natural-language provides constraints, guess-and-check code generation is sort at the implementation level, but neither are actually the specification which is the heart of the issue. A perfect intermediate language will probably end up being something pretty familiar that leverages and/or combines existing formal methods from model-checkers, logic, games, discrete simulations, graphs, UML, etc. Why? It's just very hard to beat this stuff for compression, and this is what all the "context compaction" things are really groping towards anyway. See also the wisdom about "programming is theory building" and so on.
I think if/when something like that starts getting really useful you probably won't hear much about it, and there won't be a lot of talk about the success of hybrid-systems and LLMs+symbolics. Industry giants would have a huge vested interest in keeping the useful intermediate representation/languages a secret-sauce. Why? Well, they can pretend they are still doing something semi-magical with scale and sufficiently deep chain-of-thought and bill for extra tokens. That would tend to preserve the appearance of a big-data and big-computing moat for training and inference even if it is gradually drying up.
> But my bet is that the proposed program-of-thought is too specific
This is my impression as well, having worked with this type of stuff for the past two years. It works great for very well defined uses case and if user queries do not stray to far from what you optimized your framework/system prompt/agent for. However, once you move too far away from that, it quickly breaks down.
Nevertheless, as this problem has been bugging me for a while, I still haven't given up (although I probably should ;-). My latest attempt is a Prolog-based DSL (http://github.com/deepclause/deepclause.ai) that allows for part of the logic to be handled by LLMs again, so that it retains some of the features of pure LLM_based systems. As a side effect, this gives additional features such as graceful failures, auditability and increased (but not full) reproducibility.