Comment by lambdaone

He's part right. There's certainly a law of diminishing returns in terms of model size, compute time, dataset size etc. if all that is to be done is to do the same as we are currently doing, only more so.

But what Marcus seems to be assuming is the impossibility of any fundamental theoretical improvements in the field. I see the reverse; the insights being gained from brute-force models have resulted in a lot of promising research.

Transformers are not the be-all and end-all of models, nor are current training methods the best that can ever be achieved. Discounting any possibility of further theoretical developments seems a bold position to take.