Is that where you approximate a partial derivative as a difference in loss over a small difference in a single parameter's value?
Seems like a great way to verify results, but it has the same downsides as forward mode automatic differentiation since it works in a pretty similar fashion.
Yes, the purpose is to verify the gradient computations which are typically incorrect on the first try for things like self-attention and softmax. It is very slow.
It is not necessary for auto-differentiation, but this project does not use that.
Is that where you approximate a partial derivative as a difference in loss over a small difference in a single parameter's value?
Seems like a great way to verify results, but it has the same downsides as forward mode automatic differentiation since it works in a pretty similar fashion.