ChatGPT is the language center of the brain, so to speak. If we are trying to model how we believe we are doing things, you'd expect a separate model optimized for images running side by side, with the two interacting (much like Sydney is GPT interacting with search).
And maybe even multiple of each, so that if you give the chatbot a high-level task, it will break it down into subtasks - like it already can - and then hand each task off to a separate instance.
Me: Give me the entire hexadecimal format of an example PNG. Only give me the hexadecimal format.
GPT-3:
89504E470D0A1A0A0000000D49484452000000640000006408020000000065238226000000014944415478DAECFD07780D44204C60F81EADAEF777F7E7E62F1BDE7DEBDED710EC15C7AC81CEEC17069C59B99A1698BEE7A484D68FDE782A7C41A8A0E7D2A2C9B00A99F32FBCED