Comment by rao-v

I wonder if in the long term, with compute being cheap and parameter volumes being the constraint, if it will make sense to train models to be robust to different activation functions that look like ReLU (I.e. swish, gelu etc.)

You might even be able to do a ugly version of this (akin to dropout) where you swap activation functions (with adjusted scaling factors so they mostly yield similar output shapes to ReLU for most input) randomly during training. The point is we mostly know what an ReLU like activation function is supposed to do, so why should we care about the edge cases of the analytical limits of any specific one.

The advantage would be that you’d probably get useful gradients out of one of them(for training), and could swap to the computationally cheapest one during inferencing.