I think this entire thread of discussion would benefit from remembering multimodal models exist. In other words, pictures are worth a thousand words and have their own place in thought. The existence of a way to translate between modalities doesn't make any of them superior overall--they each have their roles to play.