Hey everyone,
I built a simple React/Python app that takes screenshots of websites and converts them to clean HTML/Tailwind code.
It uses GPT-4 Vision to generate the code, and DALL-E 3 to create placeholder images.
To run it, all you need is an OpenAI key with GPT vision access.
I’m quite pleased with how well it works most of the time. Sometimes, the image generations can be hilariously off. See here for a replica of Taylor Swift’s Instagram page: https://streamable.com/70gow1 I initially had a hard time getting it to work on full page screenshots. GPT4 would code up the first couple of sections and then, get lazy and output placeholder comments for the rest of the page. With some prompt engineering, full page screenshots work a whole lot better now. It’s great for landing pages.
Lots of ideas of where to go from here! Let me know if you have feedback and you find this useful :)
1. I learned that NNs are universal function approximators - and the way I understand this is that, at a very high level, they model a set of functions that map inputs to outputs for a particular domain. I certainly get how this works, conceptually, for say MNIST. But for the stuff described here... I'm kind of baffled.
So is GPT's generic training really causing it to implement/embody a value mapping from pixel intensities to HTML+Tailwind text tokens, such that a browser's subsequent interpretation and rendering of those tokens approximates the input image? Is that (at a high level) what's going on? If it is, GPT in modelling not just the pixels->html/css transform but also has a model of how html/css is rendered by the browser back box. I can kind of accept that such a mapping must necessarily exist, but for GPT to have derived it (while also being able to write essays on a billion other diverse subjects) blows my mind. Is the way I'm thinking about this useful? Or even valid?
2. Rather more practically, can this type of tool be thought of as a diagram compiler? Can we see this eventually being part of a build pipeline that ingests Sketch/Figma/etc artefacts and spits-out html/css/js?