Comment by bob1029
Comment by bob1029 2 days ago
> Deep learning models produce their outputs using a series of transformations distributed across many computational units (artificial “neurons”). The field of mechanistic interpretability seeks to describe these transformations in human-understandable language.
This is the central theme behind why I find techniques like genetic programming to be so compelling. You get interpretability by default. The second order effect of this seems to be that you can generalize using substantially less training data. The humans developing the model can look inside the box and set breakpoints, inspect memory, snapshot/restore state, follow the rabbit, etc.
The biggest tradeoff here being that the search space over computer programs tends to be substantially more rugged. You can't use math tricks to cheat the computation. You have to run every damn program end-to-end and measure the performance of each directly. However, you can execute linear program tapes very, very quickly on modern x86 CPUs. You can search through a billion programs with a high degree of statistical certainty in a few minutes. I believe we are at a point where some of the ideas from the 20th century are viable again.
For a complex enough problem (like next word prediction on arbitrary text), I really have my doubts that any such method will result in an "interpretable" solution. More likely you end up with a giant stack of indecipherable if statements, gotos, and random multiplications. And that's assuming no matrices are involved, introduce those and you've just got a non-differentiable, non-parallelizable neural network.