Isn't GPT-4o multimodal? Shouldn't I be able to just feed in an image of the ren...

spencerchubb · 2024-09-06T20:33:35 1725654815

it is theoretically possible, but the results and bandwidth would be worse. sending an image that large would take a lot longer than sending text

brookst · 2024-09-06T23:12:05 1725664325

This. Images passed to LLMs are typically downsampled to something like 512x512 because that’s perfectly good for feature extraction. Getting text would mean very large images so the text is still readable.

tedsanders · 2024-09-07T00:42:04 1725669724

Images are much less reliable than text, unfortunately.