Comment by nycdatasci

You seem to be suggesting that current frontier models are only trained on text and not "sensor data". Multi-modal models are trained on the entire internet + vast amounts of synthetic data. Images and videos are key inputs. Camera sensors are capable of capturing much more "sensor data" than the human eye. Neural networks are the worst way to model intelligence, except all other models.

You may find this talk enlightening: https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023...