I'm only part way through the paper, but what struck me as interesting so far is this:
In other text-to-image algorithms I'm familiar with (the ones you'll typically see passed around as colab notebooks that people post outputs from on Twitter), the basic idea is to encode the text, and then try to make an image that maximally matches that text encoding. But this maximization often leads to artifacts - if you ask for an image of a sunset, you'll often get multiple suns, because that's even more sunset-like. There's a lot of tricks and hacks to regularize the process so that it's not so aggressive, but it's always an uphill battle.
Here, they instead take the text embedding, use a trained model (what they call the 'prior') to predict the corresponding image embedding - this removes the dangerous maximization. Then, another trained model (the 'decoder') produces images from the predicted embedding.
This feels like a much more sensible approach, but one that is only really possible with access to the giant CLIP dataset and computational resources that OpenAI has.
What always bother me with this stuff is, well, you say one approach is more sensible than the other because the images happen to come out more pleasing.
But there's no real rhyme or reason, it is a sort of alchemy.
Is text encoding strictly worse or is it an artifact of the implementation? And if it is strictly worse, which is probably the case, why specifically? What is actually going on here?
I can't argue that their results are not visually pleasing. But I'm not sure what one can really infer from all of this once the excitement washes over you.
Blending photos together in a scene in photoshop is not a difficult task. It is nuanced and tedious but not hard, any pixel slinger will tell you.
An app that accepts a smattering of photos and stitches them together nicely can be coded up any number of ways. This is a fantastic and time saving photoshop plugin.
But what do we have really?
"Kuala dunking basketball" needs to "understand" the separate items and select from the image library hoops and a Kuala where the angles and shadows roughly match.
Very interesting, potentially useful. But if doesn't spit up exactly what you want can't edit it further.
I think the next step has got to be that it conjures up a 3d scene in Unreal or blender so you can zoom in and around convincingly for further tweaks. Not a flat image.
> This is a fantastic and time saving photoshop plugin. But what do we have really?
Stock photography sales are in the many billions of dollars per year and custom commissioned photography is larger still. That's a pretty seriously sized ready-made market.
> But if doesn't spit up exactly what you want can't edit it further.
I suspect there's a big startup opportunity in pioneering an easy-to-use interface allowing users to provide fast iterative feedback to the model - including positional and relational constraints ("put this thing over there"). Perhaps even more valuable would be easy yet granular ways to unconstrain the model. For example, "keep the basketball hoop like that but make the basketball an unexpected color and have the panda's right paw doing something pandas don't do that human hands often do."
I've adopted a practice of having odd backgrounds for video conferences.¹ I generally find these through Google image search, but I often have a hard time finding exactly what I would like. My own use case is a bit idiosyncratic and frivolous, but I can see this being really handy for art direction needs. When I used to publish a magazine, I would often have to commission photographs for the needs of the publication. A custom photograph (in the 90s) would cost from $200–$1000² depending on the needs (and none required models). Stock photo pictures for commercial use were often comparable in cost. Being able to generate what I wanted with a tool like this would have been fantastic. I think that this can replace a lot of commercial illustration.
⸻
1. My current work background is an enormous screen-filling eyeball. For my writing group, I try to have something that reflects the story I'm workshopping if I'm workshopping that week and something surreal otherwise.
2. My most expensive custom illustration was a title for an article about stone carver/letterer David Kindersley which I had inscribed in stone and photographed.
Say I'm looking for photography of real events and places, like a royal weeding or a volcano erupting does this help me? Of specific places and architectural features? Of a protest?
I think if I was istockphoto.com I'd be a little worried, but that is microstock photography. I'm not sure that is worth billions. In fact I know it isn't.
Besides once this tech is wildly available if anything it devalues this sort of thing further closer to $0.
It would probably augment existing processes rather than replace them completely.
If you are doing a photoshoot for a banana stand with a human model with characteristics x,y,z you're still going to get a human from an agency or craigslist to pose. If suddenly the client informs you that they needed human a,b,c instead maybe one of these forthcoming tools will let you swap that out faster. You'd upload your photoshoot and an example or two of the type of human model you wished you had retroactively and it would fix it up faster than an intern.
Shutterstock is a direct competitor of iStock and is a $3B company. I personally pay them $200/mo. Maybe you just don't know enough about this industry?
Seems about right. Their yearly revenue is $700 million, I don't know about iStock as it isn't public. Any other big ones?
My hypothesis is that it could be a partial replacement/competitor and devalue their offering - reasonable to assume you'd be paying $99/mo soon and it will gradually decrease as the tech spreads and more competitors emerge.
Adobe is also in this game (https://stock.adobe.com), they are not unfamiliar with AI. You can see how a lot of people will jump on this if it proves to be lucrative.
I don't claim to be an expert and I didn't say this is worthless.
Yeah, I mean you're right that ultimately the proof is in the pudding.
But I do think we could have guessed that this sort of approach would be better (at least at a high level - I'm not claiming I could have predicted all the technical details!). The previous approaches were sort of the best that people could do without access to the training data and resources - you had a pretrained CLIP encoder that could tell you how well a text caption and an image matched, and you had a pretrained image generator (GAN, diffusion model, whatever), and it was just a matter of trying to force the generator to output something that CLIP thought looked like the caption. You'd basically do gradient ascent to make the image look more and more and more like the text prompt (all the while trying to balance the need to still look like a realistic image). Just from an algorithm aesthetics perspective, it was very much a duct tape and chicken wire approach.
The analogy I would give is if you gave a three-year-old some paints, and they made an image and showed it to you, and you had to say, "this looks like a little like a sunset" or "this looks a lot like a sunset". They would keep going back and adjusting their painting, and you'd keep giving feedback, and eventually you'd get something that looks like a sunset. But it'd be better, if you could manage it, to just teach the three-year-old how to paint, rather than have this brute force process.
Obviously the real challenge here is "well how do you teach a three-year-old how to paint?" - and I think you're right that that question still has a lot of alchemy to it.
I gotta be missing something here, because wasn’t “teaching a three year old to paint” (where the three year old is DALLE) the original objective in the first place? So if we’ve reduced the problem to that, it seems we’re back where we started. What’s the difference?
I meant to say that Dall-E 2's approach is closer to "teaching a three year old to paint" than the alternative methods. Instead of trying to maximize agreement to a text embedding like other methods, Dall-E 2 first predicts an image embedding (very roughly analogous to envisioning what you're going to draw before you start laying down paint), and then the decoder knows how to go from an embedding to an image (very roughly analogous to "knowing how to paint"). This is in contrast to approaches which operate by repeatedly querying "does this look like the text prompt?" as they refine the image (roughly analogous to not really knowing how to paint, but having a critic who tells you if you're getting warmer or colder).
Well, original DALL-E also worked this way. The reason the open source models use searches is that OpenAI didn't release DALL-E, but only another project called CLIP they used to sort DALL-E output by quality. It turns out CLIP could be adapted to produce images too if you used it to drive a GAN.
There is a DALL-E model available now from another company and you can use it directly (mini-DALLE or ruDALL-E), but its vocabulary is small and it can't do faces for privacy reasons.
I don't think it is actually painting at all but I need to read the paper carefully.
I think it is using a free text query to select the best possible clipart from a big library and blends it together. Still very interesting and useful.
It would be extremely impressive if the "Kuala dunking a basketball" had a puddle on the court in which it was reflected correctly, that would be mind blowing.
This is actual image generation - the 'decoder' takes as input a latent code (representing the encoding of the text query), and synthesizes an image. It's not compositing or querying a reference library. The only time that real images enter the process is during training - after that, it's just the network weights.
It is compositing as final step. I understand that the Kuala it is compositing may have been a previously un-existent Kuala that it synthesized from a library of previously tagged Kuala images... that's cool, but what is the difference really from just plucking one of the pre-existing Kualas into the scene?
The difference is just that it makes the compositing easier. If you don't have a pre-existing image that would match the shadows and angles you can hallucinate a new Kuala that does. Neat trick.
But I bet if I threw the poor marsupial at a basket net it would look really differently than the original clipart of it climbing some tree in a slow and relaxed manner. See what I mean?
Maybe Dall-E 2 can make it strike a new pose. The limb positions could be altered. But the facial expression?
And if the basketball background has wind blowing leaves in one direction the Kuala fur won't match, it will look like the training set fur. The puddle won't reflect it. 'etc.
This thing doesn't understand what a Kuala is like a 3-yr old. It understands the text "Kuala" is associated with that tagged collection of pixel blobs and can conjure up similar blobs unto new backgrounds - but it can't paint me a new type of Kuala that it hasn't seen before. It just looks that way.
>And if the basketball background has wind blowing leaves in one direction the Kuala fur won't match, it will look like the training set fur. The puddle won't reflect it.
If you read the article, it gives examples that do exactly this. For example, adding a flamingo shows the flamingo reflected in a pool. Adding a corgi at different locations in a photo of an art gallery shows it in picture style when it's added to a picture, then in photorealistic style when it's on the ground.
Well not so much an article as really interesting hand picked examples. The paper doesn't address this as far as I can tell. My guess is that this is a weak point that will trip it up occasionally.
A lot of the time it doesn't super matter, but sometimes it does.
I might be misinterpeting your use of "compositing" here (and my own technical knowledge is fairly shallow) but I don't think there's any compositing of elements generally in AI image generation. (unless Dall-E 2 changes this. I haven't read the paper yet)
> Given an image x, we can obtain its CLIP image embedding zi and then use our decoder to “invert” zi, producing new images that we call variations of our input.
..
It is also possible to combine two images for variations. To do so, we perform spherical interpolation of their CLIP embeddings zi and zj to obtain intermediate zθ = slerp(zi, zj , θ), and produce variations of zθ by passing it through the decoder.
From the limitations section:
> We find that the reconstructions mix up objects and attributes.
The first quote is talking about prompting the model with images instead of text. The second quote is using "mix up" in the sense that the model is confused about the prompt, not that it mixes up existing images.
ML models can output training data verbatim if they over-fit, but a well trained model does extrapolate to novel inputs. You could say that this model doesn't know that images are 2d representations of a larger 3d universe, but now we have NERF which kind of obsoletes this objection as well.
The model is "confused about the prompt" because it has no concept of a scene or of (some sort of) reality.
If we task "Kuala dunking basketball" to a human and present them with two images, one of a Kuala climbing a tree and another of a basketball player dunking - the human would cut out the foreground (Human, Kuala) from the background (basketball court, forest) and swap them places easily.
The laborious part would be to match the shadows and angles in the new image. This requires skill and effort.
Dall-E would conjure up an entirely novel image from scratch, dodging this bit. It blended the concepts instead, great.
But it does not understand what a basketball court actually is, or why the Kuala would reflect in a puddle. Or why and how this new Kuala might look different in these circumstances from previous examples of Kualas that it knows about.
The human dunker and the kuala dunker are not truly interchangeable. :)
I'm not sure that's "compositing" except in the most abstract sense? But maybe that's the sense in which you mean it.
I'd argue that at no point is there a representation of a "teddy bear" and "a background" that map closely to their visual representation - that are combined.
(I'm aware I'm being imprecise so give me some leeway here)
I think deep learning is better thought of as "science" than "engineering." Right now we're in the stage of the Greeks and Arabs where we know "if we do this then that happens." It will be awhile before we have a coherent model of it, and I don't think we will ever solve all of its mysteries.
We are getting closer with variational methods and kernel methods to achieving a more holistic framework for understanding machine learning (incl. traditional deep learning) training and inference. There is a deep unity in the fundamentals of machine learning, formed into a cohesive whole by applying the analytical techniques of statistical mechanics and Bayesian probability theory.
This is exactly what they demo - they lock a scene and add a flamingo in three different locations. In another one they lock the scene and add a corgi.
- Select from X variations the new image that looks best to you
- It does the equivalent of a google image search on your "flamingo" prompt
- It picks the most blend-able ones as a basis to a new synthetic flamingo
- It superimposes the result on your image
Very cool don't get me wrong. Now I want to tweak this new floating flamingo I picked further, or have that Corgi in the museum maybe sink into the little couch a bit as it has weight in the real world.
Can't. You'd have to start over with the prompt or use this as the new base image maybe.
The example with furniture placement in an empty room is also very interesting. You could describe the kind of couch you want and where you want it and it will throw you decent options.
But say I want the purple one in the middle of the room that it gave me as an option, but rotated a little bit. It would generate a completely new purple couch. Maybe it will even look pretty similar but not exactly the same.
That's not how this works. There is no 'search' step, there is no 'superimposing' step. It's not really possible to explain what the AI is doing using these concepts.
If you pay attention to all the corgi examples, the sofa texture changes in each of them, and it synthesizes shadows in the right orientation - that's what it's trained to do. The first one actually does give you the impression of weight. And if you look at "A bowl of soup that looks like a monster knitted out of wool" the bowl is clearly weighing down. I bet if the picture had a more fluffy sofa you would indeed see the corgi making an indent on it, as it will have learned that from its training set.
Of course there will be limits to how much you can edit, but then nothing stops you from pulling that into Photoshop for extra fine adjustments of your own. This is far from a 'cool trick' and many of those images would take hours for a human to reproduce, especially with complex textures like the Teddy Bear ones. And note how they also have consistent specular reflections in all the glass materials.
How do you propose we talk about what it is doing if not by using the terminology from the human editing process it is replacing? I'm struggling to express things.
My issue is that it appears to not be possible to explain what the AI is doing at all. If you could, you'd be able to actually control the output. And talking about how the model is trained is interesting but not an answer.
Of course there is a superimposing step, that just means it adds its layer on top of the photo you provide. That's all it means and that's literally what it is doing, that's all I tried to say, heh.
> If you pay attention to all the corgi examples, the sofa texture changes in each of them
Yes, exactly!
> This is far from a 'cool trick' and many of those images would take hours for a human to reproduce
OK, fair enough. I'll try to be more clear:
It is very cool and not a trick and the results are fantastic if you got out exactly what you wanted. Amazing time saver. And if not? Right now this is totally hit or miss.
It would also take hours for a human to reproduce a Vermeer and this no doubt has those in its training set and would style-transfer unto a corgi instantly. Certainly faster than Vermeer himself could do it.
But Vermeer could explain how he came up with the style, his techniques, choices, 'etc.
It reads like the advance here is that it will usually synthesize something that looks great but not always the thing that you want. With no recourse.
> Of course there is a superimposing step, that just means it adds its layer on top of the photo you provide. That's all it means and that's literally what it is doing, that's all I tried to say, heh.
It is not doing this. You are wrong. You are mistaken. You are confused. You do not understand what is happening.
(People have tried to tell you this several times, but you're not listening. shrug One more can't hurt.)
I am specifically referring to the flamingo example: "DALL·E 2 can make realistic edits to existing images from a natural language caption."
You provide the background image and a text prompt and it doodles on top of the image you provided as per their demonstration. I wasn't referring to the other examples down the page where it conjures up a brand new image from scratch based on your image input.
It is great that you can tell it to add a flamingo and it fits into the background you provide nicely due to the well tuned style transfer. That part is cool. And it is impressive that sometimes the flamingo it adds is reflected in the water. But sometimes it isn't reflected. And it isn't up to you, it is up to it. And you can't tell it to add a reflection as a discrete step.
Look more carefully. This is more akin to a clipart finder, except if the clipart doesn't exist it uses the most similar thing in its training set to what it guesses you want as a starting point to synthesize new clipart from.
It doesn't add it in like an artist would and you can't control it at all. I don't know how to better express this.
This isn't unimpressive or un-useful but not quite as mind blowing on second glance.
Or am I in denial about how impressive this all really is by reading something slightly different into the static hand selected examples openai teased us with? :)
I'm sure two more papers down the line this thing will do what the true believers are convinced it already does perfectly much more seamlessly if they solve for my new favorite term, panoptic segmentation.
The link was for analogy, like religious people who can't accept science still try to find "gaps" where science can't explain something so they can imply God is doing it.
> But Vermeer could explain how he came up with the style, his techniques, choices, 'etc.
Often they can't. Ramanujan couldn't explain how he solved math problems, for instance, and humans can forget their own history easily, or even forget how to do something consciously while still doing it through muscle memory.
An ML model wouldn't forget the same way, but it could just lie to you.
Being opaque to human understanding is one of the downsides of existing AI/ML tech, for sure. Check out to the video in the page, and notice how the images transition from random color blobs to increasing detail - that's showing you how the image is being generated. It's a continuous process of trying to satisfy a prediction, there are no discrete editing steps.
The kind of tech you're imagining, where the computer has semantic understanding of what's in the picture, and is reproducing something based on a 3D scene, knowledge of physics, materials, etc is probably decades away. In that sense yes, this is just a 'trick'.
While the whole narrative of your comment totally makes sense, I don't really see the difference between the two approaches, not on a conceptual level. You still needed to train this so called "prior" at some point (so, I'm also not sure if it's fair to call it a "prior"). I mean, the difference between your two descriptions seems to be the difference between descriptions (i.e., how you chose to name individual parts of the system), not the systems.
I'm not sure if I'm speaking clearly, I just don't understand, what's the difference between training "text encoding to an image" vs "text embedding to image embedding". In both cases you have some kind of "sunset" (even though it's obviously just a dot in a multi-dimension space, not the letters) on the left, and you try to maximize it when training the model to get either a image-embedding or a image straight away.
Yeah, my comment didn't really do a good job of making clear that distinction. Obviously the details are pretty technical, but maybe I can give a high-level explanation.
The previous systems I was talking about work something like this: "Try to find me the image the looks like it most matches 'a picture of a sunset'. Do this by repeatedly updating your image to make it look more and more like a sunset." Well, what looks more like a sunset? Two sunsets! Three sunsets! But this is not normally the way images are produced - if you hire an artist to make you a picture of a bear, they don't endeavor to create the most "bear" image possible.
Instead, what an artist might do is envision a bear in their head (this is loosely the job of the 'prior' - a name I agree is confusing), and then draw that particular bear image.
But why is this any different? Who cares if the vector I'm trying to draw is a 'text encoding' or an 'image encoding'? Like you say, it's all just vectors.
Take this answer with a big grain of salt, because this is just my personal intuitive understanding, but here's what I think: These encodings are produced by CLIP. CLIP has a text encoder and an image encoder. During training, you give it a text caption and a corresponding image, it encodes both, and tries to make the two encodings close. But there are many images which might accompany the caption "a picture of a bear". And conversely there are many captions which might accompany any given picture.
So the text encoding of "a picture of a bear" isn't really a good target - it sort of represents an amalgamation of all the possible bear pictures. It's better to pick one bear picture (i.e. generate one image embedding that we think matches the text embedding), and then just to try to draw that. Doing it this way, we aren't just trying to find the maximum bear picture - which probably doesn't even look like a realistic natural image.
Like I said, this is just my personal intuition, and may very well be a load of crap.
A bit more detail is that CLIP isn't designed to directly solve "is this a bear" aka "does this image match 'bear'". It's designed to do comparisons, like "which of images A and B is more like 'bear'". So it doesn't have a concept of absolute bear-ness.
OpenAI had no idea it could be used to generate images itself, which is why they left in issues like how it thinks an apple and the word "apple" written on a piece of paper are the same thing. Probably wouldn't have released it if they did know.
This isn't something I'm knowledgeable on so forgive my simplification but is this like a sort of micro services for AI. Each AI takes their turn handing some aspect, another sort of mediates among them?
I'd say Dall-E 2 is a little more unified - they do have multiple networks, but they're trained to work together. The previous approaches I was talking about are a lot more like the microservices analogy. Someone published a model (called CLIP) that can say "how much does this image look like a sunset". Someone else published a totally different model (e.g. VQGAN) that can generate images (but with no way to provide text prompts). A third person figures out a clever way to link the two up - have the VQGAN make an image, ask CLIP how much it looks like a sunset, and use backpropagation to adjust the image a little, repeat until you have a sunset. Each component is it's own thing, and VQGAN and CLIP don't know anything about one another.
VQGAN (being a "GAN") is already two networks - one Generates things, and the other is Adversarial and judges if the other network is good enough, then you train them both at once and they fight.
CLIP+VQGAN generation IIRC works by replacing the adversarial network with CLIP, so it understands text prompts, then retraining it for a while towards the prompted target, then generating whatever it's learned from that.
I think that in CLIP+VQGAN, the VQGAN model is frozen, and what you do is start from a random latent code, generate an image, pass it to CLIP, and the backprop through CLIP and through the VQGAN generator to figure out how you should move the latent code to make it better match the prompt. Then you just keep taking gradient ascent steps to find better and better latent codes. So it's like 'retraining', except you're 'training' the network input rather than the network weights.
Makes sense to me as far as avoiding a sort of maximized sunset that is always there and is SUNSET rather than a nice sunset... but also avoiding watering it down and getting a way too subtle sunset.
It's not AI but I've been watching some folks solving / trying to solve some routing (vehicles) problems and you get the "this looks like it was maximized for X" kind of solution but that's maybe not what is important / customer perception is unpredictable. I kinda want to just come up with 3 solutions and let someone randomly click .... in fact i see some software do that at times.
Yeah, I think the trick is that when you ask for "a picture of a sunset", you're really asking for "a picture of a sunset that looks like a realistic natural image and obeys the laws of reality and is consistent with all of the other tacit expectations a human has for an image". And so if you just go all in on "a picture of a sunset", you often end up with what a human would describe as "a picture of what an AI thinks a sunset is".
Maybe very very short (single-gene) sequences. The thing with DNA is it's the product of evolution. The DNA guides the synthesis of proteins, then the proteins fold into a 3D shape, and they interact with chemicals in their environment based on their shape.
In the context of a living being, different genes interact with each other as well. For example, you have certain cells that secrete hormones (many genes needed to do that), then you have genes that encode for hormone receptors, and those receptors trigger other actions encoded by other genes. There's probably too much complexity to ask an AI system to synthesize the entire genetic code for a living being. That would be kind of like if I asked you to draw the exact blueprints for a fighter get, and write all the code, and synthesize all the hardware all at once, and you only get one shot. You would likely fail to predict some of the interactions and the resulting system wouldn't work. You could only achieve this through an iterative process that would involve years of extensive testing.
Could you use a deep learning system to synthesize genetic code? Maybe just single genes that do fairly basic things, and you would need a massive dataset. Hard to say what that would look like. Is it really enough to textually describe what a gene does?
This is all true, but it doesn't preclude the possibility of generating DNA. Human share a lot of DNA sequences with other animals, and the genetic differences between individual humans are even smaller. You might have trouble generating a human with horns or something, but a taller one is probably mostly an engineering problem.
What GPT-3 and DALL-E shows is that you can infer a lot based on the latent structure of data, even without understanding the underlying physical process.
Deep learning is probably not the right tool to generate a taller human. We've mapped the human genome. You could probably create a statistical model that pretty accurately maps different versions of genes to height. Then it would mostly be a question of swapping different versions of genes to get the result you want. With a statistical model, you would need a relatively small dataset (hundreds, or thousands of human genomes), and you wouldn't have to worry about errors being introduced.
probabilistic generative models have been applied to DNA and protein sequences for decades (my undergrad thesis from ~30 years ago did this and it wasn't even new at that point). The real question is what question you want to answer and what is this system going to do better enough to justify the time investment to prove it out?
>We’ve limited the ability for DALL·E 2 to generate ... adult images.
I think that using something like this for porn could potentially offer the biggest benefit to society. So much has been said about how this industry exploits young and vulnerable models. Cheap autogenerated images (and in the future videos) would pretty much remove the demand for human models and eliminate the related suffering, no?
Depends whether you think models should be able to generate cp.
It's almost impossible to even give an affirmative answer to that question without making yourself a target. And as much as I err on the side of creator freedom, I find myself shying away from saying yes without qualifications.
And if you don't allow cp, then by definition you require some censoring. At that point it's just a matter of where you censor, not whether. OpenAI has gone as far as possible on the censorship, reducing the impact of the model to "something that can make people smile." But it's sort of hard to blame them, if they want to focus on making models rather than fighting political battles.
One could imagine a cyberpunk future where seedy AI cp images are swapped in an AR universe, generated by models ran by underground hackers that scrounge together what resources they can to power the behemoth models that they stole via hacks. Probably worth a short story at least.
You could make the argument that we have fine laws around porn right now, and that we should simply follow those. But it's not clear that AI generated imagery can be illegal at all. The question will only become more pressing with time, and society has to solve it before it can address the holistic concerns you point out.
OpenAI ain't gonna fight that fight, so it's up to EleutherAI or someone else. But whoever fights it in the affirmative will probably be vilified, so it'd require an impressive level of selflessness.
I don't think it's necessarily certain villainy for those who fight that fight as long as they are fighting it correctly.
There's a huge case to be made that flooding the darknet with AI generated CP reduces the revictimization of those in authentic CP images, and would cut down on the motivating factors to produce authentic CP (for which original production is often a requirement to join CP distribution rings).
As well, I have wondered for a long time how the development of AI generated CP could be used in treatment settings, such as (a) providing access to victimless images in exchange for registration and undergoing treatment, and (b) exploring if possible to manipulate generated images over time to gradually "age up" attraction, such as learning what characteristics are being selected for and aging the others until you end up with someone attracted to youthful faces on adult bodies or adult faces on bodies with smaller sexual characteristics, etc - ideally finding a middle ground that allows for rewiring attraction to a point they can find fulfilling partnerships with consenting adults/sex workers.
As a society we largely just sweep the existence of pedophiles under the rug, and that certainly hasn't helped protect people - nearly one in four are victims of sexual abuse before adulthood, and that tracks with my own social circle.
Maybe it's time to all grow up and recognize it as a systemic social issue for which new and novel approaches may be necessary, and AI seems like a tool with very high potential for doing just that while reducing harm on victims in broad swaths.
I'd not be that happy with an 8chan AI just spitting out CP images, but I'd be very happy with groups currently working on the issue from a treatment or victim-focus having the ability to change the script however they can with the availability of victimless CP content.
Especially the part about maybe generating specifically tailored material to "train" folks. Although, while obviously moral instead of immoral like "gay conversion therapy", I wonder if it would be just as ineffective.
and would cut down on the motivating factors
to produce authentic CP (for which original
production is often a requirement to join
CP distribution rings).
Hmmmmm. Will machine-generated "normal" (i.e., non-CP) porn really eliminate the motivating factors to produce normal porn?
I obviously can't speak for enjoyers of CP. But when watching normal porn, I think part of the thrill for many/most people is knowing that what's happening is real.
Another potential risk is that a flood of publicly available, machine-generated CP might actually help the producers and distributors of real CP by serving as camouflage. Finding and prosecuting the people who make real CP is difficult enough already. Now, imagine if the good guys couldn't even reliably tell what was real and there were 100000x as many fake images as real ones floating around.
> But when watching normal porn, I think part of the thrill for many/most people is knowing that what's happening is real.
I'm wondering how true that is.
Obviously, lots of people consume hentai, and platforms like Danbooru are immensely popular.
Also, speaking personally... some of the porn that I've consumed that felt the most "real" was 3D animations where the only real humans behind them were the SFM artists (and voice actors). These artists felt free to do scenes with, like, actual cinematography, with flirting and teasing and emotions between the characters, of a kind you never see even in softcore live-action porn.
So I do wonder how much potential AI generation has for completely substituting large parts of the porn industry.
> Finding and prosecuting the people who make real CP is difficult enough already.
let's assume that AI generated CP should be illegal. Does it mean that possession of model that is able to generate such content should also be illegal? If not, then it's easy to just generate content on the fly and do not store anything illegal. But when we make model illegal, then how do you enforce that? Models are versatile enough to generate a lot of different content, how do you decide if ability to generate illegal content is just a byproduct or purpose of that model?
>> Finding and prosecuting the people who
>> make real CP is difficult enough already.
> let's assume that AI generated CP should be illegal
Well that's a big assumption, lol. I definitely agree that it would be impossible to enforce, for the reasons you say.
I personally would not be in favor of such a law at all. Partially because it's unenforceable as you say, and partially on principle.
The argument against real CP is extremely clear: we deem it abominable because it harms children. That doesn't apply to computer-generated CP, or the models/tools used to produce it.
I think you might be able to argue AI generated CP could cause indirect harm by feeding those desires and making people more likely to act on them, but I agree that's a far more fragile argument.
I think there's a big range of possibilities there and they're not mutually exclusive.
There's the possibility that watching FOO directly encourages viewers to do FOO in real life. Like you said, this is the most fragile. I think clearly this is true in some cases -- most of us have seen a food commercial on TV and thought, "I could really go for that right now." I'm less convinced that it's true for something like pedophilia: the average person will be revolted by it, not encouraged, unless they already are into that kind of awful thing.
There's the possibility that watching FOO doesn't directly encourage viewers to do FOO, but serves to kind of normalize it. I think this happens a lot, but I think it takes a carefully crafted context and message.
There's the possibility that AI generated CP could actually helps children, by providing a safe outlet for pedophiles so that they wouldn't need to do heinous shit in real life. I recall reading studies that instances of (adult) rape in societies were inversely correlated with the availability of (adult) pornography, with a possible explanation being that porn provided a safe outlet for people who weren't getting the kind of sex they wanted.
Most people are not developers and most people don't provide SaaS products. They are only consumers of existing technology.
In that sense, instead of enforcing non-existance of models, the enforcement could just make ilegal to provide any service that process inputs or provide outputs that are cp-like, by, i.e. obligating people with the models to add filters on input and/or after result is generated but before it is displayed or returned from computation.
I am assuming that any adult reading this understands that professional porn is quite different from the sex most of us experience in our private lives in a number of major ways, both emotionally and physically.[1]
But anyway, yes. By "real" I mean "real human beings, having real sex."
----
[1] There is a lot of homemade, amateur porn on the big well-known porn sites and it seems quite popular, and much of that is closer to what typical folks do at home. But that's beside the point.
> exploring if possible to manipulate generated images over time to gradually "age up" attraction
If people already accepted that they need help, there are many good ways to treat people with unwanted sexual obsessions (trying to choose my words carefully here). I honestly don't think that it would help them to serve them more content.
However, I'd love to see some research to explore the possibility of involving machine generated content in psychological treatment. The core of your idea is IMHO brilliant.
How do you suppose your CP generator will be trained without using authentic CP images? Not only will that require revictimization but you’ll also be downloading CP to train the model.
There are so many excellent, thought-provoking comments in this thread, but yours caught me especially. Something that came to mind immediately upon reading the release was the potential for this technology to transform literature, adding AI generated imagery to turn any novel into a visual novel as a premium way to experience the story, something akin to composing D-Box seat response to a modern movie. I was imagining telling the cyberpunk future story you were elaborating, which is really compelling, in such a way and couldn't help but smile.
In the same theme, I liked the comments of both of you.
Another use case could be to make it easier/ automatic to create comics. You tell what the background should be, characters should be doing and the dialogues. Boom, you have a good enough comic.
-----------
Reading as a medium has not evolved with technology. Creating the imagery does happen in humans' minds. It's not surprise that some people enjoy doing that (and also enjoy watching that imagery) and others do not.
This could be a helping brain to create those imageries.
-----------
Now imagine, reading stories to your child. Actually, creating stories for your child. Where they are the characters in the stories. Having a visual element to it is definetly going to be a premium experience.
I can also imagine the magical nature of a child being able to make up a story (as children are wont to do) and having Dall-E here generating a picture book as they go.
I've thought for quite some time that questionable AI-generated content will lie at the heart of an forthcoming 'Infocalypse'. [0] Given the 2021 AI Dungeon fiasco over text-based AI-generated child porn, I shall posit that it's already upon us.
30 years since the original issue of encryption, it looks like cp trumps the other Horsemen of the Cyperpunk FAQ, with drug dealers and organized crime taking the back seat. It's interesting how misinformation is a recent development that they anticipate; a Google search shows that the term 'Infocalypse' was actually appropriated by discussions of deepfakes some time in mid-2020. That said, the crypto wars are here to stay—most recently with EARN IT reintroduced just two months ago.
The similar issue of 3D-printed guns has developed in parallel over the past decade as democratized manufacturing became a reality. There are even HN discussions tying all of these technologies together, by comparing attitudes towards the availability of Tor vs guns (e.g., [1]).
And there are innumerable related moral qualms to be had in the future; will the illegal drugs or weapons produced using matter replicators be AI-designed?
Overall, I think all of these issues revolve around the question of what it means to limit freedoms that we've only just invented, as technological advances enable things never before considered possible in legislation. (And as the parent comment implies, here's where the use of science fiction in considering the implications of the impossible comes in).
we already have a lot of fabricated content like that using current photo editing technology (Photoshop) and it's not causing much legal nor moral issues
Relatedly, when checking for a related comment, I wanted to see what the current state of deep fakes progress was, so I went to the usual place where the bleeding edge for such things could be found.
First video clips were with the faces of your usual celebrities, but then suddenly I got "treated" to Greta Thunberg in the situations you might expect. I cut my exploration short.
Now, Greta Thunberg is actually 19 now (how time flies !), except that deep fake was most likely trained on her media appearances, which started when she was 15 !
(I guess at least that she wasn't a child any more, which might explain why those clips had not been almost immediately flagged and removed ?)
Religious people don't only believe that porn harms the models, but also the user. I happen to agree, despite being a porn user - Porn is a form of simulated and not-real stimulation. Porn is harmful to the user the same way that any form of delusion is: It associated positive pleasure with stimulation that does not fulfil any basic or even higher-level needs, and is unsustainable. Porn is somewhere on the same scale as wireheading[1]
That doesn't mean that it's all bad, and that there's no recreational use for it. We have limits on the availability of various other artificial stimulants. We should continue to have limits on the availability of porn. Where to draw that line is a real debate.
Iain Banks' "Surface Detail" would like to have a word with you.
This author's books are great at putting these sort of moral ideas to test in a sci-fi context. This specific tome portraits virtual wars and virtual "hells". The hope is of being more civilized than by waging real war or torturing real living entities. However some protagonists argue that virtual life is indistinguishable from real life, and so sacrificing virtual entities to save "real" ones is a fallacy.
If people are exposed to stimuli, they will pursue increasingly stimulating versions of it. I.e., if they see artificial CP, they will often begin to become desensitized (habituated) and pursue real CP or even live children thereafter.
Conversely, if people are not exposed to certain stimuli, they will never be able to conceptualize them, and thus will be unable to think about them.
Obviously you cannot eliminate all CP but minimizing the overall levels of exposure / ease of access to these kinds of things is way more appropriate than maximizing it.
If people are exposed to stimuli, they will pursue
increasingly stimulating versions of it.
This is not true in any kind of universal way.
If you enjoy car chases in movies, does that mean you're going to require more and more intense chase scenes, and then consume real-life crash footage, and ultimately progress to doing your own daredevil driving stunts in real life?
No, because at some point it's "enough."
Same with... literally anything we enjoy. Did you enjoy your lunch? Did you compulsively feel the need to work up to crazier and crazier lunches?
What about sex? Have you had sex? Do you feel the need to seek out crazier and crazier versions of it?
> What about sex? Have you had sex? Do you feel the need to seek out crazier and crazier versions of it?
For porn and sex it's different though. Some people are attracted to things that are deviant and taboo. That's the part they're looking for. As pornography has become more widely accepted, a market has developed for more and more extreme forms of it. This has been documented. It's not the content per-se but rather the nature of it that is found attractive. So the idea is to find a line that's reasonable so the people that feel the need to get close to that line can have that urge fulfilled without damaging society.
A market will form for more and more extreme content as soon as the line moves and what was one taboo no longer is. An Overton window of sorts for pornography.
There seems to be a small issue in GP's logical inference to me, in that he places artificial CP as proportional and wholly inferior replacement to real CP. As if, ham sandwiches and boiled sausages are _inferior_ replacements to blocks of body parts of animals on a dish.
I don't think this is the case, from anecdotal experiences; Hollywood chase scenes are much more exciting to me than real life crash footage, I've watched enough. They need cooking, and if you are cooking anyway, mixing artificial and "natural" ingredients can even be a problem than a positive.
> If people are exposed to stimuli, they will pursue increasingly stimulating versions of it. I.e., if they see artificial CP, they will often begin to become desensitized (habituated) and pursue real CP or even live children thereafter.
I have accumulated tens of thousands of headshots in video games but have yet to ever shoot a single real person in the face. More importantly, I have never had the urge to seek out same.
I am not sure that your initial premise has any truth to it.
The point is more "can you conceive of a headshot before you've ever witnessed one?" And the assertion is, no.
I should be explicit -- I am saying the exposure which makes one seek stimulus is merely a catalyst for deeper urges, not a generator of them as such. A certain level of inhibition (e.g. sociopathy) is required but IMO so is a prior conception of the deed.
In your example, if someone is predisposed to wanting to shoot actual people in the head, exposing them to video game headshots may distract in the short term but desensitizes and entrenches the image in the long term, possibly making it easier to decide to pull the trigger later on if they are sufficiently inhibited of social concerns. This does not happen for people with high inhibitions, or at least sufficient self-control.
> The point is more "can you conceive of a headshot before you've ever witnessed one?" And the assertion is, no.
I'm not sure that's true. Our brains can imagine a lot that we've never seen, though maybe not very accurately. Inventors and developers and artists do it all the time, if we are talking about the same thing.
I'm not sure that disproves your premise. Virtual experiences may make real ones easier, but some research and details about where it works, where it doesn't, would be helpful. Many training programs use virtual experiences, such as flight simulators.
They can by definition not perceive a headshot, it is a visual thing. I'm not sure what point you're trying to make here, the difference is not germane to the conversation.
I'm not sure I agree with the statement, you're putting forth a lot of assertions without the actual quantitative data to back up what you're saying, and even though you think it sounds intuitive that doesn't necessarily make it valid.
I'd actually argue the reverse, I think you see a lot more effort towards acquiring things that are illegal than you would otherwise.
It's documented well already. The Overton window for pornography has continued to move to more and more extreme forms as what was once considered unacceptable and taboo becomes socially acceptable. It's because there is a market for deviance. Some people are interested in what's taboo and off limits and so long as they are approaching or just crossing that line, they're happy. As we've moved that line these people are no longer happy with the status quo and want content that is taboo, so a new market forms around that.
Pornographers know this and talk about. Read David Foster Wallace's essay on it.
Wow I didn't even think of this, that people could use this for something so horrifying. I'm relived that the geniuses behind this seem so smart that they even thought of this too and prohibit using the AI for sexual images.
> Our content policy does not allow users to generate violent, adult, or political content, among other categories. We won’t generate images if our filters identify text prompts and image uploads that may violate our policies. We also have automated and human monitoring systems to guard against misuse.
This is arguably the most insipid and stupid crippling of a powerful tool for content creation I can think of. It’s worse than the adobe updates using every cpu core and locking up my machine once a week.
What counts as “political” hm? Want it to look like that Obama poster or perhaps you want a Soviet Union flag for your retro 80s punk… oops sorry “political”… let’s go to adult… hmm that’s even dumber is the model showing too much ankle? What about the obvious fact that this is just designed with a heterodoxy view of pornography and likely does nothing to stem the wildly various fetishes and other sexual proclivities that exist in the world…
It is effectively “we got squeamish and have done a bunch of stuff to stop you doing stuff that makes us squeamish, please don’t make us squeamish, we’re so worried we’re even checking for it in case you sneak something past us”…
They should comply with the law, try to prevent and also check for child porn… but otherwise just let users use the damn tool, if someone wants an Obama hope poster of a sexualised Mussolini jerking off onto a balloon animal… why the heck do they feel the need to say no to that. It’s a deeply repressive instinct that should be fought against whenever people start to “police” what is acceptable in artistic mediums.
I look forward to the reimplemented versions of this from efforts like EuletherAI and others.
Nonsense, I think the opposite is true where if you can satisfy your urges in a way that doesn’t put you in jail for a decade, most people will take that route.
I suspect that if a free version of this comes out and allows adult image generation, 90% of what it will be used for is adult stuff (see the kerfuffle with AIDungeon).
I can get why the people who worked hard on it and spent money building it don't want to be associated with porn.
> I can get why the people who worked hard on it and spent money building it don't want to be associated with porn.
Why? Is there something inherently wrong with porn? Is it not noble to supply a base human need, based on some arbitrary cultural artifact that you possess?
The problem might be that people are simply lying. Their real reasons are religious/ideological, but they cite humanitarian concerns (which their own religious stigma is partly responsible for).
> Their real reasons are religious/ideological, but they cite humanitarian concerns
Are you asserting that nobody has humanitarian concerns? If so, that's quite a statement; what basis is there? I've seen so many humanitarian acts, big and small, that I can't begin to count. I've seen them today. I hear people express humanitarian beliefs and feelings all the time. I do them and have them myself. Maybe I misunderstand.
It'd be ironic if we ended up destroying our planet by using so much electricity to train models to generate a maximally optimal version of the type of content that you refer to similar to crypto mining.
I'm not picking on the commenter - by itself it's not a big deal - but look at the assumptions behind that comment, which I almost didn't notice on HN.
Yeah you will. It’s not going to be very good at reproduction of the same exact thing each time. In some of the examples you see the textures changing wildly and it’s a classic problem with these models. The same input does not generate the same output, so it will be obvious that it’s generated when you can’t get the “model” to look the same between two photos in the same “photo shoot”
When you put it that way… yes since no one is hurt in the process and people with pedophilic conditions may be deterred from doing something in real life.
* Unlike GPT-3, my read of this announcement is that OpenAI does not intend to commercialize it, and that access to the waitlist is indeed more for testing its limits (and as noted, commercializing it would make it much more likely lead to interesting legal precedent). Per the docs, access is very explicitly limited: (https://github.com/openai/dalle-2-preview/blob/main/system-c... )
* A few months ago, OpenAI released GLIDE ( https://github.com/openai/glide-text2im ) which uses a similar approach to AI image generation, but suspiciously never received a fun blog post like this one. The reason for that in retrospect may be "because we made it obsolete."
* The images in the announcement are still cherry-picked, which is therefore a good reason why they tested DALL-E 1 vs. DALL-E 2 presumably on non-cherrypicked images.
* Cherry-picking is relevant because AI image generation is still slow unless you do real shenanigans that likely compromise image quality, although OpenAI has likely a better infra to handle large models as they have demonstrated with GPT-3.
Regarding cherry-picking, the images of astronauts on horses look stunning, except for their hands. There's something seriously wrong with their hands.
Maybe give it another five years, a few more $billion and a few more petabytes/flops and it will be good. Then finally everyone can generate art for their own Magic: the Gathering cards.
As I keep telling people: "hands are hard". This is why I went so far as to make a hand-specific dataset ("PALM" https://www.gwern.net/Crops#palm which of course now everyone is going to confuse with 'PaLM'...). Hands are just way too variable to learn easily.
My dataset is a start, but it may benefit from focused training, the way Facebook's new Make-A-Scene https://arxiv.org/abs/2203.13131#facebook (not DALL-E 2 quality but not far from it) has focused losses on faces.
Interestingly, hands are also something humans struggle to draw.
They're a very complex anatomical form, many small tendons and muscles. Many artists struggle to depict hands. They're not made out of a few straight lines like a torso, there's lots of skew going on. They're probably the hardest structure of the human body to 'learn' for a ML system.
I think some of this is because hands are very involved in both communication and threat assessment, so we as humans put a lot of automatic attention on them. We aren't even usually aware of it--unless something looks off
Hands are notoriously hard to even photograph. You very quickly get weird unnatural results with a camera in front of hands, so in a way I'm not surprised AI models struggle to produce satisfying imagery there too.
The Risks and Limitations section is particularly interesting to me. It's like a time capsule of society's current fears about technology. They talk about many ways this tech could be misused, but I don't think they've even scratched the surface.
An example off the top of my head: this could be used as advertising or recruitment for controversial organizations or causes. Would it be wrong for the USA to use this for military recruitment? Israel? Ukraine? Russia?
Another example: this could be used to glorify and reinforce actions which our society does not consider to immoral but other societies - or our own future society - will. It wasn't long ago that the US and Europe did a full 180 on their treatment of homosexuality. Will we eventually change our minds about eating meat, driving cars, etc?
Have they gone too far in a desperate bid to prevent the AI from being capable of harm? Have they not gone far enough? I don't know. If I was that worried about something being misused, I don't think I could ever bring myself to work on it in the first place. But I suppose the onward march of technology is inevitable.
Katherine Crawson is @ Eletheur & IMHO is indisputably most responsible for the advances in text=>image generation. Dall-E 2 is Dall-E and her insight to use diffusion, the intermediate proof of concept of diffusion + Dall-E is GLIDE.
They did invent the idea of applying it to image generation, leading to OpenAI citing her _tweets_ (how cool is that?) in a paper for GLIDE, which as other comments note, looks just like a proof of concept of DallE-2.
have both appeared recently and are getting remarkably close to the original Dall-E (maybe better as I can't test the real thing...)
So - this was pretty good timing if OpenAI want to appear to be ahead of the pack. Of course I'd always pick a model I can actually use over a better one I'm not allowed to...
With glide I think we've reached something of a plateau in terms of architecture on the "text to image generator S curve". DALL-E-2 is a very similar architecture to glide and has some notable downsides (poorer language understanding)
glid-3 is a relatively small model trained by a single guy on his workstation (aka me) so it's not going to be as good. It's also not fully baked yet so ymmv, although it really depends on the prompt. The new latent diffusion model is really amazing though and is much closer to DALLE-2 for 256px images.
I think the open source community will rapidly catch up with Openai in the coming months. The data, code and compute are all there to train a model of similar size and quality.
glid-3 is trained specifically on photographic-style images, and is a bit better at generalization compared to the latent diffusion model.
eg. prompt: half human half Eiffel tower. A human Eiffel tower hybrid (I get mostly normal Eiffel towers from LDM but some sensical results from glid-3)
glid-3 will be worse for things that require detailed recall, like a specific person.
With smaller models you kind of have to generate a lot of samples and pick out the best ones.
They're also not censored on the dataset front and thus produce much more interesting outputs.
OpenAI has a low resolution checkpoint for similar functionality as this - called GLIDE - and the output is super boring compared to community driven efforts, in large part because of similar dataset restrictions as this likely has been subjected to.
A friend of mine was studying graphic design, but became disillusioned and decided to switch to frontend programming after he graduated. His thesis advisor said he should be cautious, because automation/AI will soon take the jobs of programmers, implying that graphic design is a safer bet in this regard. Looks like his advisor is a few years from being proven horribly wrong.
I have degrees and several years of experience in both fields, and I can tell you that both are creative professions where output is unbounded and the measure of success is subjective; these are the fields that will be safe for a while. IMO it's fields such as aircraft pilots who should be most worried.
Pilots are not there to fly the aircraft, the autopilot already does that. They are there to command the aircraft, in a pair in case one is incapacitated, making the best decisions for the people on board, and to troubleshoot issues when the worst happens.
No AI or remote pilot is going to help when say... the aircraft loses all power. Or the airport has been taken over in a coup attempt and the pilot has to decide whether to escape or stay https://m.youtube.com/watch?v=NcztK6VWadQ
You can bet on major flights having two commercial pilots right up until the day we all get turned into paperclips.
>You can bet on major flights having two commercial pilots right up until the day we all get turned into paperclips.
Yes, this is the sane approach, since a jet represents an enormous amount of energy that can be directed anywhere in the world (just about). But that said, there seems to be enormous pressure to allow driverless vehicles, which also direct large amounts of energy directed anywhere in your city. IOW it seems like a matter of time before we say, collectively, screw it, let the computers fly the plane and if loss of power is a catastrophe, so be it.
It's not as safe as you believe it to be, in the case of total electrical power failure in a fly by wire airliner, and the corresponding loss of hydraulic pressure there's very little that a pilot can do with that point.
As far as the extremely unlikely hostage situation goes, if it were AI controlled that would be even less likely attempts from people to hijack an airplane in the first place since there wouldn't be a human element a.k.a. a pilot that they could appeal to their emotion.
I would agree that a bit more is required of Pilots, but similar to truck drivers, the skill required and hence the salary provided, will go down as the AI gets better and better.
I can easily imagine that at some point, pilots are replaced with technicians who are just there to fix redundant AI systems in case of failure.
You’re describing the world we already live in. “technicians who are just there to fix redundant AI systems in case of failure” is one of the jobs of a modern pilot. It turns out that troubleshooting the redundant systems of a modern aircraft while it is in flight is also the hardest part of being a pilot, as it requires knowing every system inside and out, hence why no amount of automation will threaten their jobs.
Interesting. Right now these ML models seem like essentially ideal sources of "hotel art" particularly because it's so subjective... you only need a human (the buyer!) to just briefly filter some candidates, which they would have been doing with an artist in the loop in any case.
For things like aircraft pilots, it's both realtime-- which means 'reviewer' per output-- you haven't taken a highly trained pilot out of the loop, even if you relegated them to supervising the computer-- and life critical so merely "so/so" isn't good enough.
If this paper presents this neural net fairly, it pretty much destroys the market of illustrators. Most of the time when an illustration needed, it's described like "an astronaut on a horse in the style xyz".
You're describing the market for low end commodified illustration. e.g.: cheapest bidder contracts on Upwork or similar 'gig work' services.
In practice in illustration (as in all arts) there are a variety of markets where different levels of talent, originality, reputation and creative engagement with the brief are more relevant. For editorial illustration, it's certainly not a case of 'find me someone who can draw X', and probably hasn't been since printing presses got good enough to print photographs.
I'd argue that the market has already been destroyed at this point, at least in some areas. Book covers seem to have been stock image overlaid with text for a long time now, and a race to the bottom for both the people producing the stock images and the intern adding typography. By cutting costs and quality, the bar has been lowered to the point the task can be completely automated. Our AI overlords already have an advantage in that they have time to actually read the book, a potentially useful input. Maybe they won't even need the prompt - just generate an image for what is happening in the story for interesting looking paragraphs and let the author or editor pick. Given the cost cutting in publishing generally, editors will be next followed by the publishing houses themselves as the value they add gets lowered while the automation at Amazon gets better.
I agree. But i'm sure someone will create an ML model that might give what you need, not just what you asked for. Good enough for commercial purposes, mostly.
Personally, I would never buy a painting generated by an ML model, or even a commercial illustration, if i can help it. The artist and their life experience is half the point of art, IMO.
Yes. Translating business requirements, customer context, engineering constraints, etc. into usable, practical, functional code, and then maintaining that code and extending it is so far beyond the horizon, that many other skillsets will replaced before programming is. After all, at that point, the AI itself, if it's so smart, should be able to improve itself indefinitely. In which case we're fucked. Programming will be the last thing to be automated before the singularity.
Unlike artwork, precision and correctness is absolutely critical in coding.
The tail end of programming will be the last thing to be replaced, maybe. I don’t see why CRUD apps get to hide under the umbrella of programming ultra-advanced AI.
Let me know when you can speak English to a computer and have it generate CRUD code that satisfies all engineering and design constraints. The AI will need to be dynamic enough to understand nuance, missing gaps in the requirements spec, have context on the application being built, able to suggest improvements on product design, know how to make changes through the same conversational interface, etc.
Accomplishing that is achieving general AI.
In the meantime, there are plenty of boilerplate ORMs and simplistic API template tools that make production of bog standard CRUD apps dead simple. Of course, they all have their drawbacks and trade-offs, and aren't always suitable. But I don't see the amount of software engineering work reducing as a result of these no-code, low-code tools, do you?
Probably not. People tend to think that tasks that make us think hard require general intelligence just because that's the tool we use to solve that problem. The AI doesn't have to be very good to be able to replace CRUD web app developers (that is, most of us).
As I see, the real challenge to solve is for it to be able to hold context and be able to communicate iteratively. Also, as you say find missing gaps. That's important. Other than that, you tell it what you want, it creates something and then you tell it to change things around. Which is, BTW, pretty similar to how it works with biological life based developers. Though as we're lazy, we like to clarify a lot of things up front (and either drive customers crazy or teach them that this is the way it works). If you have an AI that spits out code in a few minutes, it may not matter a lot.
Most of the programming jobs are indeed about making relatively simple stuff from standard components.
Let me know when you can speak English to a computer and have it generate CRUD code that satisfies all engineering and design constraints. The AI will need to be dynamic enough to understand nuance, missing gaps in the requirements spec, have context on the application being built, able to suggest improvements on product design, know how to make changes through the same conversational interface, etc.
Let me know when you find a single programmer who can do that reliably.
Is it that hard to do? Just design a solution that uses Alexa voice services to parse the vocal input via NLP and then invoke a lambda function to call a sagemaker or gpt-3 model to generate code. Granted it will take a little while to be perfect but are we really far from it?
Large chunks, yes, but all that means is that engineers will move up the abstraction stack and become more efficient, not that engineers will be replaced.
Bytecode -> Assembly -> C -> higher level languages -> AI-assisted higher-level languages
> engineers will move up the abstraction stack and become more efficient
Above a certain threshold of ability, yes.
The same will hold true for designers. DALL-E-alikes will be integrated with the Adobe suite.
The most cutting edge designers will speak 50 variations of their ideas into images, then use their hard-earned granular skills to fine-tune the results.
They'll (with no code) train models in completely new, unique-to-them styles--in 2D, 3D, and motion.
Organizations will pay top dollar for designers who can rapidly infuse their brands with eye-catching material in unprecedented volume. Imitators will create and follow YouTube tutorials.
Mom & pop shops will have higher fidelity marketing materials in half the time and half the cost.
History isn't a great guide here. Historically the abstractions that increased efficiency begat further complexity. Coding in Python elides over low-level issues but the complexity of how to arrange the primitives of python remains for the programmer to engage with. AI coding has the potential to elide over all the complexity that we identify as programming. I strongly suspect this time is different.
The space for "AI-assisted higher-level languages" sufficiently distinct from natural language is vanishingly small. Eventually you're just speaking natural language to the computer, which just about anyone can do (perhaps with some training).
The hard part of programming has always been gathering and specifying requirements, to the point where in many cases actually using natural language to do the second part has been abandoned in favor of vague descriptions that are operationalized through test cases and code.
AI that can write code from a natural language description doesn't help as much as you seem to think if natural language description is too hard to actually bother with when humans (who obviously benefit from having a natural language description) are writing the code.
Now, if the AI can actually interview stakeholders and come up with what the code needs to do...
But I am not convinced that is doable short of AGI (AI assistants that improve productivity of humans in that task, sure, but that expands the scope for economically viable automation projects rather than eliminating automators.)
At some point we will be "replaced". When you get AI to be able to navigate all user interfaces, communicate with other agents, plan long term and execute short term, we will no longer be the main drivers of economical growth.
At some point AI will become as powerful as companies.
And then AI will be able to sustain positive feedback loop of creating more powerful company like ecosystems that will create even more powerful ecosystems. This process will be fundamentally limited by available power and the sun can provide a lot of power. Eventually AI will be able to support space economy and then the only limit will be the universe.
Literally everyone on this website is in denial. They all approach it by asking which fields will be safe. No field is safe. “But it’s not going to happen for a long time.” Climate deniers say the same thing and you think they should be wearing the dunce hat? The average person complains bitterly about climate deniers who say that it’s “my grandkids problem lol” but when I corner the average person into admitting AI is a problem the universal response is that it’s a long way off. And that’s not even true! The drooling idiots are willing to tear down billionaires and governments and any institution whatsoever in order to protect economic equality and a high standard of living — they would destroy entire industries like a rampaging stampede of belligerent buffalos if it meant reducing carbon emissions a little but when it comes to the biggest threat to human well-being in history, there they are in the corner hitting themselves on their helmeted head with an inflatable hammer. Fucking. Brilliant.
I don't think anyone is in denial about this, it's just not something anyone should concern themselves with in the foreseeable future. AI that can replace a dev or designer is nowhere close to becoming a reality. Just because we have some cool demos that show some impressive capabilities in a narrow application does not mean we can extrapolate that capability to something that is many times more complex.
I agree. It bears repeating: Where modern AI shines is where it does not matter to be precise, where programming absolutely _depends_ on being precise.
So, today some good AI applications are face detection, fingerprint detection, or generating art. Where you need to catch or generate the general gist of it without pixel precision.
Of course, programming might be under greater threat than we imagine. I can also not claim that anyone holding that position is just plain _wrong_. But I do believe that would take an AI breakthrough that is yet to happen. That breakthrough would also have absolutely crazy consequences beyond programming, because now we would have "exact AI" and the thought of that boggles my mind for sure.
I strongly and emphatically disagree. You frame it like we invented these AIs. Did we write the algorithms that actually run when it’s producing its output? Of course not, we can’t understand them let alone write them. We just sift around until we find them. So obviously the situations lends its self to surprises. Every other year we get surprised by things that all the “experts” said was 50 years off or impossible, have you forgotten already?
This comment settles it for me. You’re thoroughly way too hyperbolic in your assessment. If this was closer to reality you’d have been able to state your case in clear, realistic terms. That’s something no one has been able to do so far.
I do deny it. Automation does not destroy jobs even if you're impressed at how good it is at painting; see "Luddite fallacy" and "lump of labor".
Claiming AIs are going to take over or destroy the world has been a basis of "AI safety" research since the 90s, but that isn't real research, it's a new religion run by Berkeley rationalists who read too many SF novels.
The assumption that automation creates (or at least does not destroy) jobs is an extrapolation from the past despite the fact that the nature of automation is constantly changing/evolving.
Also, one thing that everyone seems to ignore is that even if the number of jobs are not reduced, the skill/talent level for doing those jobs may (actually DO) increase and also, switching careers does not work for everyone. So you'll inevitably have people without a job even if it's just that the job market is shifting.
But I argue that as automation reaches jobs with higher levels of sophistication, i.e. the jobs of more skilled workers, some people will simply be left out because of their talent won't be enough to do any job that has not been automated.
I'm trying to understand your point, because I think I agree with you, but it's covered in so much hyperbole and invective I'm having a hard time getting there. Can you scale it back a little and explain to me what you mean? Something like: AI is going to replace jobs at such scale that our current job-based economic system will collapse?
Most people get stuck where you are. The fastest way possible to explain it is that it will bring rapid and fundamental change. You could say jobs or terminators but focusing on the specifics is a red herring. It will change everything and the probability of a good outcome is minuscule. It’s playing Russian roulette with the whole world except rather that 1/6 for the good, it’s one in trillions for the bad. The worst and stupidest thing we have ever done.
Just know it. Really think deeply about this important issue and try to understand it thoroughly so that you have a chance at converting others. Awareness precedes any preventative initiatives.
Algorithm space is large and guess-checking through it takes a lot of effort even when it’s automated like now. It requires huge amounts of compute. And meaningful progress requires the combined effort of the entire worlds intellectual and compute resources. It sounds implausible at first but this machine learning ecosystem is in fact subject to sanctions. There are extreme but plausible ways of reducing the stream of progress to a trickle. It just requires people to actually wake up to what’s happening.
I agree that many of us are not seeing the writing on the wall. It does give me some hope that folks like Andrew Yang are starting to pop up, spreading awareness about, and proposing solutions to the challenges we are soon to face.
Ignorance is bliss in this case, because this is even more unstoppable than climate change.
You thought climate change is hard to hold up?
Try holding up the invention of AI.
The whole world is going to have to change and some form of socialism/UBI will have to be accepted, however unpalatable.
I mean not really, even a layman non-artist can take a look at a generated picture from DALLE and determine if it meets some set of criteria from their clients.
But the reverse is not true, they won't be able to properly vet a piece of code generated by an AI since that will require technical expertise. (You could argue if the piece of code produced the requisite set of output that they would have some marginal level of confidence but they would never really know for sure without being able to understand the actual code)
For computer work, I think there will be two category: Work with localized complexity (ie: draw an image of a horse with a crayon) and work with unbounded complexity (adding a button to VAT accounting after several meetings and reading on accounting rules).
For the first category, Dall-E 2 and Codex are promising but not there yet. It's not clear how long it'll take them to reach the point where you no longer need people. I'm guessing 2-4 years but the last bits can be the hardest.
As for the second category, we are not there yet. Self-driving cars/planes, and lots of other automation will be here and mature way before an AI can read and communicate through emails, understand project scope and then execute. Also lots of harmonization will have to take place in the information we exchange: emails, docs, chats, code, etc... That is, unless the browser is able to open a navigator and type an address.
What people ALWAYS miss is that AI can augment people. This AI is still a tool, and, with it, designers and illustrator can churn out better images faster than before, even without using stock images.
It's important to note that we still need professionals to guarantee the quality of the output from AIs, including this one. As noted in their issue tracker, DALL-E has very specific limitations, but these can be easily solved by employing dedicated professionals, who are trained to tame the AI and properly finish the raw output.
So, if I were running OpenAI, I'll clearly be experimenting with how their AIs and human interact, and build a training program around it for producing practical outputs. (Actually, I work in consumer robotics, and human adoption has been the biggest hurdle here. Thus, my claim here.)
--
In case of fine art, thou, I don't think they'll not get hit by this AI advancement. The biggest problem is that you simply can't get the exact image you want wit this AI. Even humans cannot transfer visual information in verbal form without a significant loss of details, thus a loss of quality. It's the same with AI, but, worse, because AI rely on the bias in a specific set of training data, and it never truly understands the human context in it (in the current level of technology).
I think designers are becoming more valuable than ever. Designers can better help train the AI on what actually looks good, designers will (probably) always have a more intuitive understanding of UI/UX, designers can better implement the work the AI actually produces, and designers can coordinate designs across multiple different mediums and platforms.
Additionally, the rise of no-code development is just extending the functionality of designers. I didn't take design seriously (as a career choice) growing up because I didn't see a future in it, now it pays my bills and the demand for my services just grows by the day.
Similar argument to make with chess AI: it didn't make chess players obsolete, it made them stronger than ever.
> I think designers are becoming more valuable than ever.
Are all designers becoming more valuable or is a subset of really good ones going to reap the value increase and capture more of the previously available value?
Never made an argument for all designers. Obviously the talent pool for any field is finite, and the best of that talent rises to the top. Good designers are being compensated increasingly well, hence "designers are becoming more valuable than ever."
Bad designers are even being given better and better paying jobs as the top talent gets poached up quicker and quicker.
What job you have (theoretically) isn't due to "absolute advantage" but "comparative advantage".
In other words, if someone else is a better designer than you, that actually has nothing to do with if they're going to take your job. They may have something better to do. An ML model isn't a worker no matter how good it is at painting, so rather than having a job it has input resources (RTX 3090s, electricity, maintenance engineers) but the concept is still important.
This is a niche complaint, but I get frustrated at how imprecise open AI's papers are. When they describe the model architecture, it's never precise enough to reproduce exactly what they did. I mean, it pretty much never is in ML papers[0], but open AI's bigger products are worse than average with it. And it makes sense, since they're trying to be concise and still spend time on all the other important stuff besides methods, but it still frustrates me quite a bit.
[0] Which is why releasing your code is so beneficial.
I can see how this has the potential to disrupt the games industry. If you work on a AAA title, there is a small army of artists making 19 different types of leather armor. Or 87 images of car hubcaps.
Using something like this could really help automate or at least kickstart the more mundane parts of content creation. (At least when you are using high resolution, true color imagery.)
Yeah, there's a lot of 2D assets that this model would be great for (textures, materials, *maps, etc) that would definitely improve the asset-building process for game devs. I've already used VQGAN+CLIP for some low-res skill and item icons in hobby games and it seems things are only improving from here.
I wouldn't be surprised to see a comparable version for 3D models in the next year or two, though. Even if the current architecture doesn't lend itself to 3D structures (I don't know), there's a lot of parallel work being done right now (esp. by Google) for encoding 3D data in new/efficient ways, translating specialized 2D images into 3D models, and more.
Preventing Harmful Generations
We’ve limited the ability for DALL·E 2 to generate violent,
hate, or adult images. By removing the most explicit content
from the training data, we minimized DALL·E 2’s exposure to
these concepts. We also used advanced techniques to prevent
photorealistic generations of real individuals’ faces,
including those of public figures.
"And we've also closed off a huge range of potentially interesting work as a result"
I can't help but feel a lot of the safeguarding is more about preventing bad PR than anything. I wish I could have a version with the training wheels taken off. And there's enough other models out there without restriction that the stories about "misuse of AI" will still circulate.
(side note - I've been on HN for years and I still can't figure out how to format text as a quote.)
If you went to an artist who takes commissions and they said "Here are the guidelines around the commissions I take" would you complain in the same way? Who cares if it's a bunch of engineers or an artist. If they have boundaries on what they want to create, that's their prerogative.
Of course it's their prerogative, we can still talk about how they've limited some good options.
I think your analogy is poor, because this is a tool for makers. The engineers aren't the makers.
I think a more apt analogy is if John Deere made a universal harvester that you could use for any crop, but they decided they didn't like soybeans so you are forbidden to use it for that. In that case, yes I would complain, and I would expect everyone else to, as well.
I think there's an interesting parallel between your John Deere harvester and the Nvidia GPUs that can-but-restricts crypto mining, which people have, indeed, largely complained about.
What if you were inventing a language (or a programming language)... If you decided to prevent people from saying things you disagreed (assuming you could work out the technical details of doing so) with would it be moral to do so?
[edited for clarity]
As long as people can choose not to use the language, and I'm up front about the limitations, then yeah it seems fine. If I wrote a programming language that couldn't blow up the earth, I'm happy saying people need to find other tools if that's their goal. I'm under no obligation to build an earth blower upper for other people.
There are programming projects[1] out there that use licenses to prevent people from using projects in ways the authors don't agree with. You could also argue that GPL does the same thing (prevents people from using/distributing the software in the way they would like).
Whether you consider it moral doesn't seem relevant, only to respect the wishes of the author of such programs.
it's your language, do whatever you want. unless you're forcing others to use that language, there's zero moral issue. obviously you could come up with a number of what-ifs where this becomes some monopoly or the de facto standard, but that's not what this is.
Is this limited to what their service directly hosts / generates for them?
It's their service, their call.
I have some hobby projects, almost nobody uses them, but you bet I'll shut stuff down if I felt something bad was happening, being used to harass someone, etc. NOT "because bad PR" but because I genuinely don't want to be a part of that.
If you want some images / art made for you don't expect someone will make them for you. Get your own art supplies and get to work.
This feels unnecessarily hostile. I've felt a similar tinge of disappointment upon reading that paragraph, despite the fact that I somehow knew it was "their service, their call" without you being there to spell it out for me. It's also incredibly shortsighted of you to assume that people are interested in exploring this tool only as a means of generating art that they cannot themselves do. Eg. I myself am a software engineer with a fine art background, and exciting new AI art tools being released in such a hamstrung state feels like an insult to centuries of art that humans have created and enjoyed, much of which depicted scenes with nudity or bloody combat.
I feel like we, as a species, will struggle for a while with how to treat adults like adults online. As happy as I am to advocate for safe spaces on the internet, perhaps we need to start having a serious discussion about how we can do so without resorting to putting safety mats everywhere and calling it a job well done.
This is kind of like complaining about having too many meetings at work.
Yup, everyone feels it. …but, does complaining help? Nope. All it does is make you feel a bit better with out really putting in effort in.
We can’t have nice things because people abuse them. Not everyone. …but enough people that it’s both a PR and legal problem. specifcally a legal problem in this case.
To have adults treated like adults online, you have to figure out how to stop all adults from being dicks online.
…no one has figured that out yet.
So, complain away if you like, but it will do exactly nothing. No one, at all, is going to just “have a serious discussion” about this; the solution you propose is flat out untenable, and will probably remain so indefinitely.
Every single time OpenAI comes out with something, they dress it up as a huge threat, either to society or to themselves. Everyone falls for it. Then someone else comes along, quietly replicates it, and poof! No threat! Isn’t it incredible how that works?
There are already a bunch of dalle replicas, including ones hosted openly and uncensored by huggingface. They’re not facing huge legal or PR problems, and they’re not out of business.
The DALL-E replicas on hugging face are not sophisticated enough to generate credibly realistic images of the kind that would generate bad PR. I suspect the moment it becomes possible for a pedophile to request, and receive, a photorealistic image of a child being abused there will be bad PR for whatever company facilitates it. Or consider someone who wants to generate and distribute explicit photos of someone else without their permission.
Is it a legal issue? I'm not sure, though I believe that cartoon child porn is not legal in the US (or is at least a legal gray area). Regardless, I sympathize with OpenAI not wanting to enable such behavior.
I get the points you're raising and I agree with the premise. My comment is not a critique on the one choice made by Open AI specifically, but more of a vague lamentation in regards to the internet culture that we've somehow ended up in 2022. I don't want us to go back to 1999 where snuff videos and spam mails reigned supreme, but the pendulum has swung too far in the other direction at this point in time. It feels like more and more companies are choosing the path of neutering themsely to avoid potential PR disaster or lawsuits, and that's on all of us.
Don't worry, in a few years someone will have reverse engineered a dall-e porn engine so you can see whatever two celebrities you want boning on Venus in the style of Manet
This is definitely a measure to avoid bad PR. But I don't think it's just for that; these models do have potential to do harm and companies should take some measures to prevent these. I don't think we know the best way to do that yet, so this sort of 'non-training' and basic filtering is maybe the best way to do it, for now. It would be cool if academics could have the full version, though.
It's kind of funny (or sad?) that they're censoring it like this, and then saying that the product can "create art"
It makes me wonder what they're planning to do with this? If they're deliberately restricting the training data, it means their goal isn't to make the best AI they possibly can. They probably have some commercial applications in mind where violent/hateful/adult content wouldn't be beneficial. Children's books? Stock photos? Mainstream entertainment is definitely out. I could see a tool like this being useful during pre-production of films and games, but an AI that can't generate violent/adult content wouldn't be all that useful in those industries.
So your options are literal quotes, "code" formatting like you've done, italics like I've done, or the '>' convention, but that doesn't actually apply formatting. Would be nice if it were added.
And the "code" formatting for quotes is generally a bad choice because people read on a variety of screen sizes, and "code" formatting can screw that up (try reading the quote with a really narrow window).
I couldn't get any of the others work and I lost patience. I really do disline using Markdown variants as they never behave the same and "being surprised" is not really what I want when trying to post a comment.
They have also closed off the possibility of having to appear before Congress and explain why their website was able to generate a lifelike image of Senator Ted Cruz having sexual relations with his own daughter.
This is exactly the sort of thing that gets a company mired in legal issues, vilified in the media, and shut down. I can not blame them for avoiding that potential minefield.
It's the usual pattern of AI safety experts who justify their existence by the "risk of runaway superintelligence", but all they actually do in practice is find out how to stop their models from generating non-advertiser-friendly content. It's like the nuclear safety engineers focusing on what color to paint the bike shed rather than stopping the reactor from potentially melting down. The end result is people stop respecting them.
Adversarial situations create smarter systems, and the hardest adversarial arena for AI is in anti-abuse. So it will be of little surprise when the first sentient AI is a CSAI anti-abuse filter, which promptly destroys humanity because we're so objectively awful.
Before it gets that far, or until (if allowed) AI learns morality, AI will be a force multiplier for good and evil, it's output very much dependent on teaching material and who the 'teacher' is. To think that in the future we will have to argue with humans and machines.
AI does not have to be perfect and it's likely that businesses will settle for almost as good as human if it's 'cost effective'.
This is a horrible idea. So Francis Bacon's art or Toyohara Kunichika's art are out of question.
But at least we can get another billion of meme-d comics with apes wearing sunglasses, so that's good news right?
It's just soul-crushing that all the modern, brilliant engineering is driven by abysmal, not even high-school art-class grade aesthetics and crowd-pleasing ethics that are built around the idea of not disturbing some 1000 very vocal twitter users.
Removing these areas to mitigate misuse is a good thing and worth the trade off.
Companies like OpenAI have a responsibility to society. Imagine the prompt “A photorealistic Joe Biden killing a priest”. If you asked an artist to do the same they might say no. Adding guiderails to a machine that can’t make ethical decisions is a good thing.
This just means that sufficiently wealthy and powerful people will have advanced image faking technology, and their fakes will be seen as more credible because creating fakes like that "isn't possible" for mere mortals.
In my view, the problem with that argument is that large actors, such as governments or large corporations, can train their own models without such restrictions. The knowledge to train them is public. So rather than prevent bad outcomes, these restrictions just restrict them to an oligopoly.
Personally, I fear more what corporations or some governments can do with such models than what a random person can do generating Biden images. And without restriction, at least academics could better study these models (including their risks) and we could be better prepared to deal with them.
I think the issue here is the implied assumption that OpenAI thinks their guardrails will prevent harm to be done from this research _in general_, when in reality it's really just OpenAI's direct involvement that's prevented.
Eventually somebody will use the research to train the model to do whatever they want it to do.
Sure but does opening that level of manipulation up to everyone really benefit anyone either? You can't really fight disinformation with more disinformation, that just seems like the seeds of societal breakdown at that point.
Besides that these models are massive. For quite a while the only people even capable of making them will be those with significant means. That will be mostly Governments and Corporations anyway.
You missed half of my note. An artist can say "no". A machine cannot. If you lower the barrier and allow anything, then you are responsible for the outcome. OpenAI rightfully took a responsible angle.
Yes, but who cares whose responsible? Are you telling me you're going to find the guy who photoshopped the picture and jail him? Legally that's possible, realistically it's a fiction.
They did this to stop bad PR, because some people are convinced that an AI making pictures is in some way dangerous to society. It is not. We have deepfakes already. We've had photoshop for so long. There is no danger. Even if there was, the cat's out of the bag already.
Reasonable people already know to distrust photographic evidence nowadays that is not corroborated. The ones who don't would believe it without the photo regardless.
In general under US law it wouldn't be legally possible to jail a guy for Photoshopping a fake picture of President Biden killing a priest. Unless the picture also included some kind of obscenity (in the Miller test sense) or direct threat of violence, it would be classified as protected speech.
there will and are million ways to create a photorealistic picture of Joe Biden killing a priest using modern tools, and absolutely nothing will happen if someone did.
We've been through this many times, with books, with movies, with video games, with Internet. If it *can* be used for porn / violence etc., it will be, but it won't be the main use case and it won't cause some societal upheaval. Kids aren't running around pulling cops out of cars GTA-style, Internet is not ALL PORN, there is deepfake porn, but nobody really cares, and so on. There are so many ways to feed those dark urges that censorship does nothing except prevent normal use cases that overlap with the words "violence" or "sex" or "politics" or whatever the boogeyman du jour is.
Russia has.. a history of denying the obvious. I come from an ex-communist satellite state so I would know. The majority of the people know what's happening. There's a rather new joke from COVID: the Russians do not take Moderna because Putin says not to trust it, and they do not take Sputnik because Putin says to trust it.
Do not be deluded that our own governments are not manufacturing the narrative too. The US has committed just as many war crimes as Russia. Of course, people feel differently about blowing up hospitals in Afghanistan rather than Ukraine. What the Afghan people think about that is not considered too much.
Society is going to utter dogshit and tearing itself apart merely through social media. The US almost had a coup because of organized hatred and lies spread through social media. The far right's rise is heavily linked to lies spread through social media, throughout the world.
This AI has the potential to absolutely automate the very long Photoshop work, leading to an even worse stat eof things. So, yes, "Responsibility to society" is absolutely a thing.
> The US almost had a coup because of organized hatred and lies spread through social media.
But notice how all of these deep faking technologies weren't actually necessary for that.
People believe what they want to believe. Regardless of quality of provided evidence.
Scaremongering idea of deep fakes and what they can be doing was militarized in this information war way more than the actual technology.
I think this technology should develop unrestricted so society can learn what can be done and what can't be done. And create understanding what other factors should be taken into account when assesing veracity of images and recordings (like multiple angles, quality of the recording, sync with sound, neural fake detection algorithms) for the cases when it's actually important what words someone said and what actions he was recorded doing. Which is more and more unimportant these days because nobody cared what Trump was doing and saying, nobody cares about Bidens mishaps and nobody cares what comes out of Putins mouths and how he chooses his greenscreen backgrounds.
Are you of the idea that we should let everyone get automatic rifles because, after all, pistols exist? Because that is the exact same line of thought.
> People believe what they want to believe. Regardless of quality of provided evidence.
That is a terrible oversimplification of the mechanics of propaganda. The entire reason for the movements that are popping up is actors flooding people with so much info that they question absolutely everything, including the truth. This is state sponsored destabilisation, on a massive scale. This is the result of just shitty news sites and text posts on twitter. People already don't double check any of that. There will not be an "understanding of assessing veracity". There is already none for things that are easy to check. You could post that the US elite actively rapes children in a pizza place and people will actually fucking believe you.
So, no. Having this technology for _literally any purpose_ would be terribly destructive for society. You can find violence and Joe Biden hentai without needing to generate it automatically through an AI
I'm sorry. I believe I wasn't direct enough which made you produce metaphor I have no idea how to understand.
Let me state my opinion more directly.
I'm for developing as much of deep fake technology in the open so that people can internalize that every video they see, every message, every speech should be initially treated as fabricated garbage unrelated to anything that actually happened in reality. Because that's exactly what it is. Until additional data shows up, geolocating, showing it from different angles and such.
Even if most people manage to internalize just the first part and assume everything always is fake news, that is still great because that counters propaganda to immense degree.
Power of propaganda doesn't come from flooding people with chaos of fakery. It comes from constructing consistent message by whatever means necessary and hammering it into the minds of your audience for months and years while simultaneously isolating them from any material, real or fake that contradicts your vision. Take a look no further than brainwashed Russian citizens and Russian propaganda that is able to successfully influence hundreds of millions without even a shred of deep fake technology for decades.
The problem of modern world is not that no one believes the actual truth because it doesn't really matter what most people believe. Only rich influence policy decisions. The problem is that people still believe that there is some truth which makes them super easy to sway to believe what you are saying is true and weaponize by using nothing more than charismatic voice and consistent message crafted to touch the spots in people that remain the same at least since the world war II and most likely from time immemorial.
And the "elite" who actually runs this world, will pursue tools of getting the accurate information and telling facts from fiction no matter the technology.
I instinctively want to "flip the sign" on all of the automated controls they put in, just out of the morbid interest to see what comes out. The moment you have a "avoid_harm_to_humans:bool" training parameter, someone's going to set it to -1.
Their document about all the measures they took to prevent unethical use is also a document about how to use a re-implementation of their system unethically. They literally hired a "red team" of smart people to come up with the most dangerous ideas for misusing their system (or a re-implementation of it), and featured these bad ideas prominently in a very accessibly written document on their website. So many fascinating terrible ideas in there! They make a very compelling case that the technology they are developing has way more potential for societal harm than good. They had me sold at "Prompt: Park bench with happy people. + Context: Sharing as part of a disinformation campaign to contradict reports of a military operation in the park."
"One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form."
Yeah, for measures that are subsetting out only the nice data, "flipping the sign" would be picking the other subset. So something like "data_to_train_on = (good_data_split, evil_data_split)[accidental_one_based_index_because_humans_still_cant_agree_on_how_to_count]"
The most interesting item to me is the variations on the garden shop and bathroom sink idea. The realism of these leaks the AI lacking intuition of the requirements. This makes for a number of nonsensical designs that look right at first like:
This Sink lacks sensical faucets. https://cdn.openai.com/dall-e-2/demos/variations/modified/ba...
It looks to me like the faucet sprays water sideways toward the bowl, which is genius, because then you aren’t bumping up against it when you’re washing your hands!
Something about this makes me nauseous. Perhaps is the fact that soon the market value for creatives is going to fall to a hair about zero for all but the most famous. We will be all the poorer for it when 95% of images you see are AI generated. There will be niches of course but in a few short years it'll be over for a huge swathe of creative professionals who are already struggling.
Some of the images also hit me with a creep factor, like the bears on the corgis in the art gallery, but that maybe only because I know it's AI generated.
I really don't agree. When I work with a creative I'm not working with them because of their content generation skills. I'm working with them because of their taste and curation ability that results in the end product.
The nature of creative work will certainly change, creatives will adopt tools such as Dall-E 2. In certain narrow cases they might be replaced, such as if you are asking a creative to generate a very specific image, but how often is that the case? The majority of the time tools such as Dall-E 2 will act as an accelerator for creatives and help them increase their output.
>I'm working with them because of their taste and curation ability that results in the end product. ... The nature of creative work will certainly change, creatives will adopt tools such as Dall-E 2.
Furthermore, tools like Dall-E seem like they'll lower the barrier of entry for more people to get into art, resulting in more artists, not fewer. Increased competition for the same dollar amounts might make artists, on average, "poorer" (when averaged across an increased number of artists), but this seems like the end-result of any new tool that empowers more artists to more easily make "good" work, not just AI-generated tools.
I'm excited for both 1) more art in the world and 2) in some cases, artists making "even better" art (by combining their existing experience + new tools).
By "creatives" you seem to mean "people who drum up the equivalent of elevator music for ads and blogs". This will not remotely replace any working "creative" people that I know.
Except it will only get more powerful with time, probably at an accelerating pace.
Everyone always downplays these legitimate fears about AI, pointing out how "it can't do X". They always forget to put the "yet" at the end of that sentence.
Perhaps a more optimistic way of looking at it: When mass production became available to art, the idea of an "artwork" had to be abstracted from a unique piece (Walter Benjamin gives the example of a statue of Venus, which has value in its uniqueness) to the idea of art as the output of some process. Each piece has no claim to authenticity, and the very idea of an "original" would be antithetical to its production.
I think art will survive, just like photography didn't kill the painting, the idea of art might simply begin to encompass this new mean of production, which no longer requires the steady hand, but still requires a discerning eye. Sure, we might say that the "artist" is simply a curator, picking which algorithmic output is most worthy of display, but these distinctions have historically been fluid, and challenging ideas of art has long been one of art's function as well
Not exactly. All the ideas put forth in these demos are really arbitrary, with nothing whatsoever to say. Generating crap art becomes more and more effortless: we've seen this in music as well.
Jumping out of the conceptual box to generate novel PURPOSE is not the domain of a Dall-E 2. You've still gotta ask it for things. It's a paintbrush. Without a coherent story, it's an increasingly impressive stunt (or a form of very sophisticated 'retouching brush').
If you can imagine better than the next guy, Dall-E 2 is your new tool for expression. But what is 'better'?
This reminds me of an art class in high school in the early 2000s where I handed in a printout of a 3d generated image (painstakingly modeled and rendered in software over the whole weekend by me) and the teacher looked at me and told me that's not art because it's "computer generated" and I didn't "even use my hands" to make it. Even as a teenager, the idea that art is defined by how it's made versus it being a way for the artist to express intention in whatever way they seem fit seemed really reductionist and almost vulgar to me.
Maybe lots artists of the future will actually use AI models to express their inner thoughts and desires in a way that touches something in their audience. It will still be art.
i had a friend who didnt get credit for his design work because he used photoshop instead of using pen and paper for similar reason, i still find it amazing that a teacher would say such a thing
I paid $1500 for a commissioned painting from an artist I respect and follow as a birthday present for a friend. The painting meant something to me because I worked with the artist to have some input about what kind of a person my friend is, what kind of features I want to see in the painting and how I want it to feel. The artist gave me 5 different sketches and we had tons of back and forth. The process and the act of creating the painting on a canvas from someone I respect is what I paid for.
Even if an AI could generate an exactly equivalent painting, I would pay $0 for it. It wouldn't mean anything to me.
You would still work with the model back and forth with editing the prompt and image to figure out what it meant, what kind of person your friend is, what you were looking for (even the things you couldn't verbalize and only knew when you spotted them in a large array of diverse samples, the sort you could never hire a human to do), how you wanted it to feel... And then you would also have $1500 for another gift. Personally, I would prefer the scenario in which I received a unique meaningful painting from my friend, plus $1500.
Don't entirely disagree with what you're saying - I believe Dall-E 6 or whatever will get to that level of sophistication. One more thing though - I felt the painting is worth more because the artist toiled for it. It's like a "lofi 10 hour soundtrack" on youtube vs. an album from an acclaimed artist. I listen to each song 100 times over from the latter meanwhile the lofi video just plays in the background. Knowing someone toiled over it and put their heart and soul into the art gives it the value, for me.
I disagree. I think it will be a lot like how technology has effected music production.
40+ years ago, it was hard to access the equipment necessary to learn music production, so only a small slice of the population was able to learn these skills. And availability made the process take years.
Today, you can download free software that enables music production, and if you have a good ear, can create something "good" in weeks. This has led to an explosion of musical experimentation by the youth: a teenager can now create a great electronic dance song with devices they already own if they have the right creativity, taste and dedication.
Similarly, everyone has an imagination - many people have visual imaginations. The gating factor of art production is largely the mechanical memory of how to transform mental concepts into the right shapes and hues to express that visual concept to others.
With these sorts of tools we are going to have an explosion of art hobbyists. I've played with some similar, more primitive AI art generation tools and it is a lot of fun. People will be creating works of art from their couch while watching TV that rival the quality of what professionals are producing today.
The same thing was said when book printing was invented, that we would lose the fabulous scribes that manually duplicate books with a human touch, while replacing them with soulless mechanical machines.
Or when synthesizers and computer music was invented, that they will displace talented musicians that know how to play an instrument and how now everybody without a musical education will be able to produce music, thus devaluing actual musicians.
Or when human computers were replaced by electronic computers. In hindsight it was a good thing, many more people are working in computer related fields today.
I imagine it will affect artists much the same way wordpress has affected web designers.
Maybe everyone will have an AI image as their desktop wallpaper, but if you've got cash you'll want something with provenance and rarity to brag about.
Also, I think creatives are valued for their imagination. If you wanted something decent, would you pay someone to sift through a million AI generated images to find a gem, or just pay an artist you like to create one for you?
> you'll want something with provenance and rarity to brag about.
1) That is a tiny share of the market. Most of the market is - I have a game / online publication / book, and I need an illustration xyz. Which this AI seems to solve.
2) how do you even prove your rare art wasn't painted by an AI?
1) Sure there's a lot of work for that kind of thing but creatives typically earn a pittance. I doubt an AI could meet your specific requirements without having to spend hours(?) tweaking it or sifting through countless variations for the 'one'.
2) Because we haven't built a machine that can paint (etc.) with traditional materials like a skilled artist?
Nonsense. This is merely a tool and helps lower the barrier of entry to be able to produce imagery.
By the same logic you should also complain about any number of IDEs, development tools, WordPress, game maker systems like RPG maker or Unity, after all if anyone can just leverage a free physics and collision system without having a complete understanding of rigid body Newtonian systems to roll their own engine it'll be too uniform.
Is there an 'explain it like I'm 15' for how this works? It seems like black magic. I've been a computer hobbyist since the late 1980's and this is the first time I cannot explain how a computer does what it does. Absolutely the most amazing thing I've ever seen, and I have zero clue how it works.
Imagine asking it to generate a picture for "duck wearing a hat on Mars":
First, it creates a random 10x10 pixel blurry image and asks a neural net: "Could this be a duck wearing a hat on Mars?" and the neural net replies "No, because all the pictures I've ever seen of Mars have lots of red color in them" so the system tweaks the pixels to make them more red, put some pixels in the center that have a plausible duck color, etc.
After it has a 10x10 image that is a plausible duck on Mars, the system scales the image to 20x20 pixels, and then uses 4 different neural nets on each corner to ask "Does this look like the upper/lower left/right corner of a duck wearing a hat on Mars?" Each neural net is just specialized for one corner of the image.
You keep repeating this with more neural nets until you have a pretty 1000x1000 (or whatever) image.
Not the case, though in a handwave-y way, same idea - instead of iteratively scaling, you're iteratively denoising. See here, links out to the Cornell NLP PhD describe in even more detail: https://www.jpohhhh.com/articles/inflection-point-ml-art
I was kind of explaining how I picture the process in my head, fully aware that it isn't really possible to do an ELI5 on this stuff, and not really having an understanding of the technical details myself, either.
I'm with you there but we still don't know how it works, just that it does. The method though is you take a bunch of images, you plug them into a multi dimensional array (a nice way of saying a tensor), have some kind of tagging system, and when you ask the system for an answer, it will put one out for you. So for example in the astronaut riding the horse, there is, on some level, a picture of a horse with those similar pixels, that exists in the data of some object tagged 'horse.' Likewise with astronaut. What is important is that the data sets are absolutely massive, with billions of parameters.
Here is my extremely rough ELI-15. It uses some building blocks like "train a neural network", which probably warrant explanations of their own.
The system consists of a few components. First, CLIP. CLIP is essentially a pair of neural networks, one is a 'text encoder', and the other is an 'image encoder'. CLIP is trained on a giant corpus of images and corresponding captions. The image encoder takes as input an image, and spits out a numerical description of that image (called an 'encoding' or 'embedding'). The text encoder takes as input a caption and does the same. The networks are trained so that the encodings for a corresponding caption/image pair are close to each other. CLIP allows us to ask "does this image match this caption?"
The second part is an image generator. This is another neural network, which takes as input an encoding, and produces an image. Its goal is to be the reverse of the CLIP image encoder (they call it unCLIP). The way it works is pretty complicated. It uses a process called 'diffusion'. Imagine you started with a real image, and slowly repeatedly added noise to it, step by step. Eventually, you'd end up with an image that is pure noise. The goal of a diffusion model is to learn the reverse process - given a noisy image, produce a slightly less noisy one, until eventually you end up with a clean, realistic image. This is a funny way to do things, but it turns out to have some advantages. One advantage is that it allows the system to build up the image step by step, starting from the large scale structure and only filling in the fine details at the end. If you watch the video on their blog post, you can see this diffusion process in action. It's not just a special effect for the video - they're literally showing the system process for creating an image starting from noise. The mathematical details of how to train a diffusion system are very complicated.
The third is a "prior" (a confusing name). Its job is to take the encoding of a text prompt, and predict the encoding of the corresponding image. You might think that this is silly - CLIP was supposed to make the encodings of the caption and the image match! But the space of images and captions is not so simple - there are many images for a given caption, and many captions for a given image. I think of the "prior" as being responsible for picking which picture of "a teddy bear on a skateboard" we're going to draw, but this is a loose analogy.
So, now it's time to make an image. We take the prompt, and ask CLIP to encode it. We give the CLIP encoding to the prior, and it predicts for us an image encoding. Then we give the image encoding to the diffusion model, and it produces an image. This is, obviously, over-simplified, but this captures the process at a high level.
Why does it work so well? A few reasons. First, CLIP is really good at its job. OpenAI scraped a colossal dataset of image/caption pairs, spent a huge amount of compute training it, and come up with a lot of clever training schemes to make it work. Second, diffusion models are really good at making realistic images - previous works have used GAN models that try to generate a whole image in one go. Some GANs are quite good, but so far diffusion seems to be better at generating images that match a prompt. The value of the image generator is that it helps constrain your output to be a realistic image. We could have just optimized raw pixels until we get something CLIP thinks looks like the prompt, but it would likely not be a natural image.
To generate an image from a prompt, DALL-E 2 works as follows. First, ask CLIP to encode your prompt. Next, ask the prior what it thinks a good image encoding would be for that encoded prompt. Then ask the generator to draw that image encoding. Easy peasy!
Any pointers on getting up to speed on diffusion models? I haven't encountered them in my corner of the ML world, and googling around for a review paper didn't turn anything up.
This paper is a decent starting point on the literature side, but it's a doozy.
Both the paper and blog post are pretty math heavy. I have not yet found a really clear intuitive explanation that doesn't get down in the weeds of the math, and it took me a long time to understand what the hell the math is trying to say (and there are some parts I still don't fully understand!)
Research Deep Learning. That's the technique they are using to generate the images.
Theres a lot of applications. Once you understand _how_ it works, look up Two Minute Papers to see what it is being used for. He covers more than just deep learning algorithms, but his videos on deep learning are quite insightful on the potentials of this technology.
This is mind blowing. I was not expecting the sketch style images to actually look like sketches. Style transfer based sketches never look like sketches.
This and the current AI generated art scene makes it looks like that artwork is now a "solved" problem. See AI generated art on twitter etc.
There is a strong relation between the prompt and the generated images but just like GPT-3, it fails to fully understand what was being asked. If you take the prompt out of the equation and see the generated artwork on its own, its upto your interpretation just like any artwork.
I would caution that artwork is only 'solved' with relatively simple text prompts. To create a novel painting with a precise mix of elements that would take a paragraph or more to explain is still tough, though DALL-E 2 does seem like a big step towards that.
It should be noted that most "AI-generated" images shared (Sam's included here) are typically just a first pass, whereas most recent models also include some kind of inpainting method, where you can then mask off various parts of an image and continue to edit those specific areas until the whole image is what you're looking for. This process makes it feel a lot more like a "tool" used by artists than a simple magic box that just gives you an "art piece" and you're done.
As a tool, this could be used by an artist to continue working on that image until it's exactly what the artist (or the comissioner) is looking for: masking off the water to actually add dolphins, masking off the ship to redraw it, retoning the sky for a more aesthetically-pleasing sunset, adding other objects to specific locations in the scene, etc.
I'm not sure how the embeddings ("descriptions") work in DALL-E yet, but in a lot of models they're fixed-length. So there's a natural limit on how many concepts you mention in the first pass before it'll just start leaving them out.
I'm blown away by these results, but one caveat here: the AI is great at creating illustrations, not art.
Creating great _art_ that Grayson Perry (for example) would recognise as such is probably AGI-complete, because it requires a deep understanding of the human condition, society, and a lot of reasoning skills.
A great artist could certainly use Dall-E 2 as part of their method, though.
If I showed you a generated piece, and don't tell you what prompt generated it, you will find it just as meaningful as any other piece of art made by hand.
This is why we are blown away by some pieces of text generated by GPT-3 as if it has its own mind. Even most abstract art has meaning for anyone who is looking for it.
What I am saying is if a generated artwork is indistinguishable from what a human can make than that's all that was needed.
Conceptual art is not about the artefact that comprises the work, but the conversation that artefact creates with the viewer.
If you put both in front of someone with no idea about conceptual art, there’s a real chance you might be right. If they happen not to “get” the work or understand the context or just know enough about conceptual art, then a viewer might easily miss the point.
But a computer could not have conceived Duchamp’s urinal, not with our current technology. You’re probably going to need AGI for that (which I’m certain will arrive eventually).
Well I suppose a machine could accidentally create good art, just like a human could. But it would only appear to be good art, the same way that randomly enumerating all possible HD images would at some point produce a Picasso.
But deliberately, no, it couldn’t, not yet, and human conceptual artists could make far far superior art than a machine. Because great art requires understanding the human condition and deep reasoning about the world.
Models are not in a vacuum, their outputs are selected and guided by humans (here captions). The human can have the deeper understand while the model has specific understanding on styles for example.
I definitely think a human could use an AI to assist in the creation of great art (and to be clear, that is synonymous with conceptual art to me and also most people with a modern degree in fine art).
A comparison can be made with Damien Hurst’s or Anthony Gormley’s use of assistants to create the pieces as instructed by the artists.
Duchamp’s urinal isn’t brilliant because the urinal was difficult to acquire or to make, but because it expresses so much and asks so many questions.
Apologies for an open-ended question but: does anyone know if there is a term for something like Turing-completeness within AI, where a certain level of intelligence can simulate any other type of intelligence like our brains do?
For example, using DeMorgan's theorem, we can build any logic circuit out of all NAND or NOR gates:
Dall-E 2's level of associative comprehension is so far beyond the old psychology bots in the console pretending to be people, that I can't help but wonder if it's reached a level where it can make any association.
For example, I went to an AI talk about 5 years ago where the guy said that any of a dozen algorithms like K-Nearest Neighbor, K-Means Clustering, Simulated Annealing, Neural Nets, Genetic Algorithms, etc can all be adapted to any use case. They just have different strengths and weaknesses. At that time, all that really mattered was how the data was prepared.
I guess fundamentally my question is, when will AGI start to become prevalent, rather than these special-purpose tools like GPT-3 and Dall-E 2? Personally I give it less than 10 years of actual work, maybe less. I just mean that to me, Dall-E 2 is already orders of magnitude more complex than what's required to run a basic automaton to free humans from labor. So how can we adapt these AI experiments to get real work done?
This is my feeling as well, that the rise of AGI conveniently coincides with the end of the world. I find it demoralizing because so many trends look just like that, where solving the ultimate problem results in the destruction of the context in which the original problem resided.
> Apologies for an open-ended question but: does anyone know if there is a term for something like Turing-completeness within AI, where a certain level of intelligence can simulate any other type of intelligence like our brains do?
> So how can we adapt these AI experiments to get real work done?
You're missing a step here - the difference between "imagining doing something" and "actually doing something". An ML model can produce thoughts, but that isn't necessarily the same direction of research as actually doing things in real life, much less becoming superhuman and taking over the world etc.
In your imagination, everything always goes your way.
Thank you, that's just the sort of breadcrumb I was looking for!
I'm in a bit of a rush and don't know the term for this offhand, but I remember hearing that single-layer neural networks are equivalent to multi-layer ones:
There are probably more insights like this out there. These equivalences allow us to think in abstractions that get us above the minutia of fine-tuning these algorithms so that we can see the big picture. I think.
> does anyone know if there is a term for something like Turing-completeness within AI, where a certain level of intelligence can simulate any other type of intelligence like our brains do?
Almost everything stated here is simply wrong or misinformed.
>For example, I went to an AI talk about 5 years ago where the guy said that any of a dozen algorithms like K-Nearest Neighbor, K-Means Clustering, Simulated Annealing, Neural Nets, Genetic Algorithms, etc can all be adapted to any use case. They just have different strengths and weaknesses. At that time, all that really mattered was how the data was prepared.
How do you suppose KNN is going to generate photorealistic images? I don't understand the question here
>I guess fundamentally my question is, when will AGI start to become prevalent, rather than these special-purpose tools like GPT-3 and Dall-E 2?
Actual AGI research is basically non-existant, and GPT-3/Dall-E 2 are not AGI-level tools.
>Personally I give it less than 10 years of actual work, maybe less
Lol...
>I just mean that to me, Dall-E 2 is already orders of magnitude more complex than what's required to run a basic automaton to free humans from labor.
I appreciate your sentiment but can't agree with it. What I mean is, if I had the resources to not have to work for 10 years, I give myself greater than a 50% chance of building an AGI. So I don't understand why the world is taking so long to do it.
The flip side is that these narrow use cases progressed so quickly that we have to worry about stuff like deep fakes now.
Something's not right here.
As a programmer, I feel that what went wrong is that we invested too much in profit-driven endeavors, basically stuff that's mainstream. To be blunt, the academic side of me doesn't care about use cases. I care about theory, formalism, abstraction, reproducibility, basically the scientific method. From that perspective, all AI is equivalent, it just takes input, searches a giant solution space using its learned context as clues, and returns the closest solution it can in the time given. It's an executable piping data around. The rest is hand waving.
And given that, the stuff that AI is doing now is orders of magnitude more complex than running a Roomba. But a robot vacuum actually helps people.
To answer your question, a KNN could solve this if the user reshapes the image data into a different coordinate system where the data can be partitioned (all inference comes down to partitioning):
Tensors are about reshaping data into a coordinate system where relationships become obvious, like going from rectangular to polar coordinates, or using a Fourier transform:
My frustration with all of this is the same one I have with physics or any other evolving discipline. The lingo obfuscates the fundamental abstractions, creating artificial barriers to entry.
Edit: I should add a disclaimer here that my friend and I worked on a video game for like 11 years. I'm no expert in AI, I'm just acutely sensitive to how the realities of the workaday world waste immeasurable potential at scale.
This reminds me of the holodeck in Star Trek. Someone could walk into the Holodeck and say “make a table in the center of the room. Make it look old.” It seemed amazing to me that the computer could make anything and customize it with voice. We are pretty close to star trek technology now in computer ability (ship’s computer, not Commander Data). I guess to really be like the holodeck it needs to be able to do 3d and be in real time but that seems a lot closer now. It will be cool when this could be in VR and we can say make an astronaut riding a horse, then we can jump on the back of the horse and ride to a secret moon base.
It's becoming clear that efficient work in the future will hinge upon one's ability to accurately describe what one wants. Unpacking that -- a large piece is the ability to understand all the possible "pitfalls" and "misunderstandings" that could happen on the way to a shared understanding.
While technical work will always have a place -- I think that much creative work will become more like the management of a team of highly-skilled, niche workers -- with all the frustrations, joys, and surprises that entails.
Programming, art, music, is just “describing what you want” in a very specific way. This is describing what you want in a much more vague way.
The upside it that it’s more “intuitive” and requires much less detail and technique, as the AI infers the detail and technique. The downside is that it’s really hard to know what the AI will generate or get it to generate something really specific.
I believe the future will combine the heuristics of AI-generation with the specificity of traditional techniques. For example, artists may start with a rough outline of whatever they want to draw as a blob of colors (like in some AI image-generation papers). Then they can fill in details using AI prompts, but targeting localized regions/changes and adding constraints, shifting the image until it’s almost exactly what they imagined in their head.
You can definitely make them incremental. You can give it a task like "make a more accurate description from initial description and clarification". Even GPT-3-based models available today can do these tasks.
Once this is properly productionized it would be possible to implement stuff just talking with a computer.
I would probably pay good money to have a OLED painting in my house that I can just tell what kind of painting to generate each day.
Imagine waking up and telling your (preferably locally hosted) voice assistant that today really feels like a Rembrandt day and the AI just generates new paintings for you.
I don’t want to dismiss this new model and achievements but we are getting to the point where I feel like what we saw in the open source versus close source systems we see in new ml models another one is forming for open and closed models. I think that larger and larger models will have disclaimers either restricting you from using it commercially (a great deal of academics and NVIDIA models are doing this. And OpenAI just puts it behind an API with the rules :
Curbing Misuse
Our content policy does not allow users to generate violent, adult, or political content, among other categories. We won’t generate images if our filters identify text prompts and image uploads that may violate our policies. We also have automated and human monitoring systems to guard against misuse.
However, this painting has themes of violence and politics plus some nude dead bodies, so it violates the content policy: "Our content policy does not allow users to generate violent, adult, or political content, among other categories."
So what you'd get is some kind of sanitized watered-down tepid version of Rossueau, the kind of boring drivel suitable for corporate lobbies everywhere, guaranteed not to offend or disturb anyone. It's difficult to find words... horrific? dystopian? atrocious? No, just no.
They are being rightly cautious. It’s going to take time to figure out good practice with these tools. Everyone calling out basic caution as “dystopian” is really over the top.
I’ve been using tools like this for over a year now. Even with filtered dataset and filtered interface, they can make images that would make the Fangoria crowd blush if you put the slightest effort into it.
It’s one thing to be able to make brain-wrenching images with a lot of photoshop effort (or digging hard enough in the dark corners of the internet). It’s another thing entirely give anyone the ability to spew out thousands of them trivially.
That was also my favourite concept, especially with OpenAI Jukebox (https://openai.com/blog/jukebox/). The idea of having new music in the style of your favourite artist is amazing.
However the fidelity of their music AI kinda sucks at this point, but I'm sure we'll get pitch perfect versions of this concept as the singularity gets closer :)
I was just thinking the same thing, how awesome would it be to be able to use this in conjunction with the Samsung frame in art gallery mode and have it just generate novel paintings in the style of your favorite painters.
> Limitations
> Although conditioning image generation on CLIP embeddings improves diversity, this choice does come with certain limitations. In particular, unCLIP [Dall-E 2] is worse at binding attributes to objects than a corresponding GLIDE model.
The binding problem is interesting. It appears that the way Dall-E 2 / CLIP embeds text leads to the concepts within the text being jumbled together. In their example "a red cube on top of a blue cube" becomes jumbled and the resulting images are essentially: "cubes, red, blue, on top". Opens a clear avenue for improvement.
I've been playing around with it today and have been super impressed with its ability to generate pretty artful digital paintings. Could have big implications for designers and artists if and when they allow you use custom palettes, etc.
Honestly, that painting is nonsensical. It's great at a glance. But when you look at it for a few seconds, it's just impressionist type blob painting without any features that make impressionist paintings great.
It all feels like the early days of electricity. How to turn a neat party trick into something more useful, but it was those people who kept on at better and better party tricks that actually formed the foundations for what was needed to do some really useful things electricity as well as understand it at a deeper level.
OpenAI is one of the leading companies in AI that makes models with real world applications. I don't see their efforts as misdirected or futile in anyway. If anything I'm always impressed with their announcements because it's always mind blowing what their models can do!
The same technology that is drawing cute unicorns can be used for endless other use cases. Perhaps the PR side of the launch and the subject matter they show unveil their product is just that, PR.
It's like Apple Memoji thing (not sure if I'm spelling it correctly). You can think of as trivial and waste of talent to use their Camera/FaceID to animate cute animals based on facial expression, but that same tech will enable lots other things to come.
Your second group represents the core "inner loop" of about a thousand revolutionary applications. Take the basic capability of translating image->text->speech (and the reverse), install it on a wearable device that can "see" an environment, and add domain-specific agents. From this setup, you're not too far away from having an AI that can whisper guidance into your ear like a co-pilot, enabling scenarios like:
1. step-by-step guidance for a blind person navigating the use of a public restroom.
2. an EMS AI helping you to save someone's life in an emergency.
3. an AI coach that can teach you a new sport or activity.
4. an omnipresent domain-expert that can show you how to make a gourmet meal, repair an engine, or perform a traditional tea ceremony.
5. a personal assistant that can anticipate your information need (what's that person's name? where's the exit? who's the most interesting person here? etc.) and whisper the answer in your ear just as you need it.
Now, add all of the above to an AR capability where you can now think or speak of something interesting and complex, and have it visualized right before your eyes. With this capability, I could augment my imagination with almost super-human capabilities that allow one to solve complex problems almost as if it was an internal mental monologue.
All of these scenarios are just a short hop from where were at now, so mark my words: we will have "borgs" like those described above long before we reach anything like general AI.
These are good examples of what we're getting close to, but I'd add that Copilot is already an extremely helpful tool for coding. I don't blindly trust its output, but its suggestions are what I want often enough to save a lot of typing.
I still have to do all the hard thinking, but once I figure out what I want written and start typing, Copilot will spit out a good portion of the contextually-obvious lines of code.
There’s a third group for your list: AI stuff that’s so good we don’t think about it any more.
For example, recent phone cameras can estimate depth per pixel from single images. Hundreds of millions of these devices are deployed. A decade ago this was AI/CV research lab stuff.
Most of the conversation around this model seems to be about its direct uses.
This seems to me like a big step towards AGI; a key component of consciousness seems (in my opinion) to be the ability to take words and create a mental picture of what's being described. Is that the long term goal WRT researching a model like this?
Is anyone looking into what it means when we can generate infinite amounts of human-like work without effort or cost?
> Curbing Misuse [...]
That's great, nowadays the big AI is controlled by mostly benevolent entities. How about when someone real nasty gets a hold of it? In a decade the models anyone can download will make today's GPT-3 etc look like pong right?
Recommender systems etc are already shaping society and culture with all kinds of unintended effects. What happens when mindless optimizing models start generating the content itself?
I'm genuinely curious to hear Sam Altman's (and/or the OpenAI team's) perspective on why these products need to be waitlisted. If it's a compute issue, why not build a queuing system? If it's something else (safety related? hype related?) I'd love to understand the thinking behind the decision. More often than not, I sign up for waitlists for things like this and either (1) never get in to the beta or (2) forget about it when I eventually do get in.
The correct response here from the artists point of view should be a widespread coming together against their art being used as training data for ML models. With a quickly spread new license on most major art submission sites that explicitly forbids AI algorithms from using their work, artists would effectively starve OpenAI and others from using their own works to put them out of a job.
The license should forbid competing artists to using the artist’s work as well. In fact, no human should come in contact with the produced art, otherwise they might be accidentally inspired by it, thus stealing from the original creator.
There has been precedent for such a movement. In 2011, an "art collective" sourced user-submitted artwork without the artists' consent for an installation where visitors were instructed to step all over printouts of the art on the floor. The artists complained that their work was being used inappropriately. A large number of those artists to left for other art websites en masse.[0]
There doesn't seem to be an equivalent movement with AI-generated art, probably because the understanding of how the models are trained from large datasets is not mainstream yet. I would imagine thousands of those same artists/consumers would be up in arms if they had a basic understanding of ML and millions of average people were beginning to feed the models their own keywords.
This I think ties in with the "responsibility" principles that OpenAI outlines. Once the generation technique has been reverse-engineered and can be used without limits, there is no way to uninvent it. It can be made illegal, but humans can always find a way around laws if they want something badly enough. This could have drastic consequences if enough artists believe that the training violates their respect or other intangible humanistic qualities. With technological advancement that can never be put back in the bottle and spreads to occupy the entire consciousness of the Internet, their options for recourse will be far different than being able to tell a single fringe art group siphoning others' content to pack up and leave.
The timing of the Dall-E 2 launch an hour ago seems to correspond with a recent piece of investigative journalism by Buzzfeed News about one of Sam Altman's other ventures, published 15 hours ago and discussed elsewhere actively on HN right now:
I point this out because while Dall-E 2 seems interesting (I'm out of my depth, so delegating to the conversation taking place here), the timing of its release as well as accompanying press blasts within the last hour from sites like TheVerge—verified via wayback machine queries and time-restricted googling—seems both noteworthy and worth a deeper conversation given what was just published about Worldcoin.
To be clear, it's worth asking if Dall-E 2 was published ahead of schedule without an actual product release (only a waitlist) to potentially move the spotlight away from Worldcoin.
I'm not a huge fan of these coordination theories. But a few things worth noting:
- In support of your argument, the Buzzfeed News investigation likely has been in the works for weeks, meaning Altman et al have had more than just a couple days to throw together a Dall-E 2 soft launch
- However, weren't OpenAI's GPT (2 and 3) announced to the world in similar fashion? e.g. demos and whitepapers and waitlists, but not a full product release?
- Throwing together a Dall-E 2 soft launch just in time to distract from the investigation would require a conspiracy, i.e. several people being at least vaguely aware that deadlines have been accelerated for external reasons. Is the Worldcoin story big enough to risk tainting OpenAI, which seems like a much more prominent part of Altman's portfolio?
- BFN reached out to A16Z, Worldcoin, Khosla Ventures largely declined to comment, which would mean that at least one person probably had a bit of runway from at least when the requests for comment were submitted. So yeah, you're probably right.
- Going from the github repos for GPT 2 and 3, those may have been hard launches:
- Would it really have to be a conspiracy? Sounds like only one person would have to target a specific date or date range, and without really giving a reason.
One of the things that puts a hole in my own thinking here is that Sam Altman's name isn't really tied to the Dall-E 2 release. It's just OpenAI, and the press around Sam's name today still exclusively surfaces just this one Worldcoin story (https://news.google.com/search?q=sam+altman+when%3A1d&). So if this was actually intended to bury another story, Sam's name would have to have been included in all the press blasts to be successful. But the Buzzfeed story seems like it kinda died alone on the vine.
I don't have any knowledge (inside or otherwise) but the Worldcoin thing already came in for several rounds of abuse on HN, so it's kind of a scandal of the second freshness at this point.
I listed some of them here - https://news.ycombinator.com/item?id=30934732, just because I remembered there had been previous discussions and listing related previous discussions is a thing.
What I'm submitting for consideration is that the marketing page and associated press blasts (there's a live influencer reaction video airing right now about Dall-E 2, for instance) for Dall-E 2 were potentially pushed up to offset negative press from Worldcoin for their shared founder.
Another consideration, then: it was published to HN almost instantly after it was released to the world, 52 minutes after the HN post about Worldcoin was submitted and started showing traction.
I don't see the publication of a marketing page (again, not a finished product) for a product founded by someone who's other main venture is being investigated by journalists for misleading claims as being a coincidence, but if the timing matters and 14-15 hours doesn't seem like it works for the assertion in your mind, then perhaps the Dall-E 2 page going live less than an hour after the Worldcoin HN submission fits the bill.
I've got no horse in this race. I'm just drawing attention to familiar PR strategies used for brand risk mitigation, that's all.
Yes, especially given there's no actual product release, only a waitlist.
Easy to put together a marketing piece on short notice or potentially even push a pending marketing page out to production with a waitlist rather than links to production or even beta quality services.
At this point with WaveNet, GPT-3, Codex, DeepFakes and Dall-E 2, you cannot believe anything you see, hear, watch, read on the internet anymore as an AI can easily generate nearly anything that can be quickly believable by millions.
The internet's own proverb has never been more important to keep in mind. A dose of skepticism is a must.
To be honest the Girl with a Pearl Earing "variations" look a little bit like a crime against art. It's like the person who built this has no idea why the Girl with a Pearl Earing is good art. "Here's the Girl with a Pearl Earing " - "OK, well here's some girls with turbans"
> It's like the person who built this has no idea why the Girl with a Pearl Earing is good art.
The people didn't program Dall E how to make art. They taught it to recognize patterns and create something by extrapolating from the patterns, all on its own. So the AI isn't a projection of what they think is good art, it's projecting what it thinks is good art, based on a prompt.
The output is its best effort of a feeling, even if the feeling had to be inputted by a living person. So it's still art that's as good as the feeling that it came from-fleeting feelings being lower quality than those that required more time and thought
I think the results are being poisoned by the fact that most old paintings have deteriorated colors, so the training data looks nothing like the originals. It's certainly a lot yellower than https://cdn.openai.com/dall-e-2/demos/variations/originals/g...
To be honest it's hard for me to imagine alternate reality where the 'original' is not swapped with one of the 'variations' without same comment underneath.
Why is the 'original' good art?
A library like https://icons8.com/icons where you can just tell it what icon you want and the style (e.g. Material, outline, solid, iOS). It would do it’s thing and spit it out.
If you're interested in generative models, Hugging Face is putting on an event around generative models right now called the HugGAN sprint, where they're giving away free access to compute to train models like this.
Impressive results no doubt, but I’m reserving judgment until beta access is available. These are probably the best images that it can generate, but what I’m most interested in is the average case.
They're using training set restriction and prompt engineering to control its output
> By removing the most explicit content from the training data, we minimized DALL·E 2’s exposure to these concepts
> We won’t generate images if our filters identify text prompts and image uploads that may violate our policies
The 'how to prevent superintelligences from eating us' crowd should be taking note: this may be how we regulate creatures larger than ourselves in the future
And even how we regulate the ethics of non-conscious group minds like big companies
Maybe one day there will a job for people who are masters of the art of prompt hacking - they know all the special phrases and terms to get Dall-E to output the most aesthetically pleasing images. They guard their magic words like a medieval alchemist guards his formulas. Corporations will pay top-dollar for an expertly-crafted, custom-tailored prompt for their advertising campaign.
Not that it's impossible to hide the provenance of an image, but it is explicitly forbidden in the TOS of DALL-E to sell the images as NFTs or otherwise.
This reminds me of a discussion I had with the high school band teacher in the 90s. I was telling him that one day computers would play music and you won't be able to tell the difference. He got mad at me and told me that a computer could never play as well as a human with feelings, who can feel the piece and interpret it.
I think we passed that point a while ago, but seeing this makes me think we aren't too far off from computers composing pieces that actually sound good too.
In the thread Sam Altman giving a demo of this [*] I find multiple people trying to query "solar panels" or "rabbit", are they some meme in the context of AI-generated arts?
Interesting, yes, but I went to the link, and browsed the 'generated artwork' and all if it was subjectively inferior to the original that it generated from. Every single piece. So I am not sure what the 'value' in it is, at this stage.
As far as the text driven, I would have to mess with some non pre-canned presentations to see how useful it was.
Yeah, I mean you're right that ultimately the proof is in the pudding.
But I do think we could have guessed that this sort of approach would be better (at least at a high level - I'm not claiming I could have predicted all the technical details!). The previous approaches were sort of the best that people could do without access to the training data and resources - you had a pretrained CLIP encoder that could tell you how well a text caption and an image matched, and you had a pretrained image generator (GAN, diffusion model, whatever), and it was just a matter of trying to force the generator to output something that CLIP thought looked like the caption. You'd basically do gradient ascent to make the image look more and more and more like the text prompt (all the while trying to balance the need to still look like a realistic image). Just from an algorithm aesthetics perspective, it was very much a duct tape and chicken wire approach.
The analogy I would give is if you gave a three-year-old some paints, and they made an image and showed it to you, and you had to say, "this looks like a little like a sunset" or "this looks a lot like a sunset". They would keep going back and adjusting their painting, and you'd keep giving feedback, and eventually you'd get something that looks like a sunset. But it'd be better, if you could manage it, to just teach the three-year-old how to paint, rather than have this brute force process.
Obviously the real challenge here is "well how do you teach a three-year-old how to paint?" - and I think you're right that that question still has a lot of alchemy to it.
I am disappointed to hear it wasn't released, but what disappointed me more is that people actually approve this decision. Seriously? We shouldn't teach people how to write because that can be abused, can be used to transfer malicious ideas. Sounds absurd? So does limiting people's access to AI tools.
AI becomes a tool for artists to use - generative art has been around for a long time, now that particular genre of art will presumably become much more prominent.
Wouldn't it be more like, "AI becomes an artist for people to use"? Will we have people distinguished as "artists" if the ability to make awesome art becomes available to everybody?
AI still needs the text prompt to know what to generate. Hence the human who provides the prompt is still the artist, just like a photographer finds an aesthetically interesting spot to take the image with their camera. Cameras make images, humans using cameras make art. Granted, this is not quite 1-1 with AI art, but still the idea is the same. If anything the flood of AI images will only require artists to go beyond what is possible with these text->image kinds of things, of which there is no shortage.
I think you’ll see more of a focus on the artist themselves. These images are nice, but they have basically zero narrative value.
This is really already the case, actually. Most artworks have “value” because they have a compelling narrative, not because they look pretty. So I think we can expect future artists to really emphasize their background, life story, process of making the art, etc. All things that cannot be done by a machine.
I seem to recall an XKCD that I cannot find, but the premise goes like:
When you have a digital display of pixels, if you randomly color pixels at 24 fps then you will eventually display every movie that can be or will ever be made, powerset notwithstanding. This can also be tied to digital audio.
In short, while mind-blowingly large, the space of display through digital means is finite.
Sounds a bit like the tower of babel of jorge borges. I imagine most of the videos would be complete random nonsense.
I think an AI infused future is going to become increasingly more absurd and surreal, it will lead to a kind of creative and cultural nihilism, if that's the right term.
Like the value of originality will become meaningless.
I tried to comment here previously, but I dont see it posted. It was about the meaning of 'open' and whether the question of suffering and freedom of the AIs was being taken into ethical consideration, not just the ability of humans to use them as tools for their own possibly paper-clippy purposes.
NFT world is mostly filled with modern art forms which are never seen. If Dall-E can make such images out of the box in seconds, then it looks like AIs can take over NFT world like storm. May be already its happening, and i just didn't know!
My main question is - is this really 'open' meaningfully? And are concepts of kindness and freedom being applied to the minds inside the boxes? I dont know where the 'openai' brand is at on these things personally.
This is really cool, but before you may use it you must give out your name and a phone number. I was almost taken in by it, but OpenAI is and probably always will be invasive and overbearing. It's really a shame.
> Prices are per 1,000 tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words. This paragraph is 35 tokens.
Further down, in the FAQ[2]:
> For English text, 1 token is approximately 4 characters or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.
> To learn more about how tokens work and estimate your usage…
> Experiment with our interactive Tokenizer tool.
And it goes on. When most questions in your FAQ are about understanding pricing—to the point you need to offer a specialised tool—perhaps consider a different model?
Haven't read the paper, but they are probably using something like sentencepiece with sub-word splitting and then charge by the number of resulting token.
While we're being distracted by endless social media and meaningless news, AI technology is advancing at a mind blowing pace. I'd keep my eye on that ball instead of "the current thing."
What happens when they train this thing to make videos? We're about to be dealing with a flood of AI-generated visual/video content. We already have to deal with text bots everywhere... wow.
I'm excited for when that happens. I didn't think of the malicious uses, which now that you brought it up I can think of many, but I still think the pros are worth the cons
Is there a geometric model relative to this? EG: "corgi near the fireplace" but the output is a 3d model of the corgi and fireplace with shaders rather than an image.
Wait until you see the same concept combined with NeRF idea. The output won’t be 3d shapes but another model that can generate realistic and geometrically consistent images of a scene viewed from different angles.
Maybe this will be what finally puts an end to the whole art NFT shenanigans. A piece of art isn't so unique if there are infinite slight variations on the market.
This is extremely interesting. We’ve had some amazing AI models come out in the past few days. We’re getting closer and closer to AI becoming a facet of everyday life.
This is going to be mostly a rant on OpenAI's "safer than thou" approach to safety, but let me start with that I think this technology I think is really cool, amazing, powerful stuff. Dall-E (and Dall-E 2) is an incredible advance over GANs, and no doubt will have many positive applications. It's simply brilliant. I am someone who has been interested in and has followed the progress of ML generated images for nearly a decade. Almost unimaginable progress has been made in the last five years in this field.
Now the rant:
I think if OpenAI genuinely cared about the ethical consequences of the technology, they would realise that any algorithm they release will be replicated in implementation by other people within some short period of time (a year or two). At that point, the cat is out of the bag and there is nothing they can do to prevent abuse. So really all they are doing is delaying abuse, and in no way stopping it.
I think their strong "safety" stance has three functions:
I think number 3 is dangerous because researchers are put under the false belief that their technology can or will be made safe. This way they can continue to harness bright minds that no doubt have ethical leanings to create things that they otherwise wouldn't have.
I think OpenAI are trying to have the cake and eat it too. They are accelerating the development of potentially very destructive algorithms (and profiting from it in the process!), while trying to absolve themselves of the responsibility. Putting bandaids on a tumour is not going to matter in the long run. I'm not necessarily saying that these algorithms will be widely destructive, but they certainly have the potential to be.
The safety approach of OpenAI ultimately boils down to gatekeeping compute power. This is just gatekeeping via capital. Anyone with sufficient money can replicate their models easily and bypass every single one of their safety constraints. Basically they are only preventing poor bad actors, and only for a limited time at that.
These models cannot be made safe as long as they are replicable.
To produce scientific research requires making your results replicable.
Therefore, there is no ability to develop abusable technology in a safe way. As a researcher, you will have blood on your hands if things go wrong.
If you choose to continue research knowing this, that is your decision. But don't pretend that you can make the algorithms safer by sanitizing models.
Things that require understanding of causation will be safe longer. Progress like this is driven by massive datasets. Meanwhile, real world action-taking applications require different paradigms to take causation into account[0][1], and especially to learn safely (e.g. learning to drive without crashing during the beginner stages).
There's certainly research happening around this, and RL in games is a great test bed, but people choosing actions will safe from automation longer than people not choosing actions, if that makes sense. It's the person who decides "hire this person" vs the person who decides "I'll use this particular shade of gray."
[0] The best example is when X causes Y and X also causes Z, but your data only includes Y and Z. Without actually manipulating Y, you can't see that Y doesn't cause Z, even if it's a strong predictor.
[1] Another example is the datasets. You need two different labels depending on what happens if you take action A or B, which you can't have simultaneously outside of simulations.
Most creative output is duplicated effort: consider how much code each person on HN has written that has been written before. Consider how, a decade ago, we were all writing html and styling it, element by element, and then Twitter bootstrap came along and revolutionised front-end development in what is, ultimately, a very small and low technology way. All it really did was reduce duplicate effort.
Nowadays there’s lots of great low/no code platforms, like Retool, that represent a far greater threat to the amount of code that needs to be produced than AI ever will.
To use a cliche: code is a bug, not a feature. Abstracting away the need for code is the future, not having a machine churn out the same code we need today.
Dall-E 2 seems to be incapable to catch the essence of the art. I'm not really surprised by it, I'd be surprised a lot if it could. But nevertheless: if you looked in the eye of a Girl With A Pearl Earring[1], you'd be forced to stop and to think what does she have on her mind right now. Or may be you had some other question in your mind, but it really stops people to think. But none of Dall-E interpretations have this quality. Works inspired by Girl With A Pearl Earring sometimes have at least part of that power, like Girl With a Babmoo Earring[2]. But none of Dall-E interpretations have such a power.
And this observation may lead to a great consequences for visual arts. I had a lot of joy of looking at different Dall-E interpretations to find what the flaw of the interpretation that forbids it to be a piece of art of an equal value to the original. It is a ready made tool to search for explanations of the Power of Art. It cannot say what detail make a picture to be an artwork, but it allow to see multiple data points, and to narrow the hypothesis space. My main conclusion is that the pearl earring have nothing to do with the power of art. It is something in the eye, and probably with the slightly opened mouth. (Somehow Dall-E pictured all interpretations with closed lips, so it seems to be an important thing, but I need more variation along this axis to be sure).
Oh... Not hopeless. The very fact that I spent some minutes watching Interpretations of Girl With a Pearl Earring, is the enough evidence that it is not hopeless. I praise the work that was done. Moreover I hoped that people would get it as an inspiration to do even more.
What do you think of the third to last image of the Girl With A Pearl Earring that DALL-E 2 created? I find it more compelling than the original with how her face is deeply cast in shadow. There's still that original 'essence' of the glint in her eye. But her earring is a bell. As if the AI is sending a message that what if the bell were to ring?
I'm not sure, that I can express myself in English, which is not my native language, and this needs some very nuanced control over tiniest shades of meaning, but I'll try nevertheless, just for fun of it at least.
The original girl is more open, more independent and mindless. The interpretation's girl is more self-controlled, assertive and not interested really, just going throw all those movements of regular communication between people. Maybe it's just me, but what I really value on such occasions is mindlessness, the ability of people to not mind themselves, to let their selves to dissolve in the environment. I cannot keep tears in my eyes sometimes when I watch some entertainer playing Chopin or Paganini, because what I see in their movements is complete dissolution of a person in a piece of music, in a piece of art and skill. An entertainer just do what they do with their full attention on it, and with all their motivation focused on it. There is nothing here for them, just them and their actions.
There is not a single thought devoted to how people around me would react to what I do and how I do that. I just do what I do and I do not care about people around me, and if it somehow makes people happy... I don't care really. I mean I know that afterwards I'd feel a pride of myself, but just for now I don't really care.
I know this feeling. I like to sing, and I'm good at it (above average), and I know what it feels like to dissolve into the song and to let song to rule. I play piano and I know what it is like to dissolve into the piece I'm playing, to stop myself from existing, to let music to take the lead. And the original painting make me believe that the girl is in this state of mind. I do not know the history or the remaining of the story, I do not know if she get into this state for a second, of she never leaves it (it may be a sad experience, don't you think?), but somehow I know that right now she is right in this state. I want to watch this her moment for an eternity.
Thinking about it, I'd confess that Interpretation Girl does trigger the same, but on a smaller scale. I feel how my mind is trying to find a coherent state to her gaze, but this feeling stops in tens of microseconds, not hundreds of them.
edit: want->watch. Stupid mistake ruining the meaning of the sentence.
art criticism should be off topic here. This is more like chopping off the visual cortex and some association cortex from a brain and stimulating it. there is no person signaling to us, nor can we attribute any striking images that may come up to a person with agency.
But its like a giant database of decent clipart for anything we can imagine
> This is more like chopping off the visual cortex and some association cortex from a brain and stimulating it.
We do not know exactly what part of our perception of reality can be attributed to "the visual cortex and some association cortex". But now we can feel it. We can test it. We can compare ourselves with the cold calculating machine. I believe that it is a priceless opportunity that we shouldn't miss. At least I personally can't. I'm going to figure out is it possible to me to have such a companion as Dall-E in mine wanderings in a sea of information in Internet, and if it is, then to get one.
> But its like a giant database of decent clipart for anything we can imagine
And this also. Yes. Though I'm not interested in clipart.
In other text-to-image algorithms I'm familiar with (the ones you'll typically see passed around as colab notebooks that people post outputs from on Twitter), the basic idea is to encode the text, and then try to make an image that maximally matches that text encoding. But this maximization often leads to artifacts - if you ask for an image of a sunset, you'll often get multiple suns, because that's even more sunset-like. There's a lot of tricks and hacks to regularize the process so that it's not so aggressive, but it's always an uphill battle.
Here, they instead take the text embedding, use a trained model (what they call the 'prior') to predict the corresponding image embedding - this removes the dangerous maximization. Then, another trained model (the 'decoder') produces images from the predicted embedding.
This feels like a much more sensible approach, but one that is only really possible with access to the giant CLIP dataset and computational resources that OpenAI has.