Hacker News new | past | comments | ask | show | jobs | submit login
Comparing Adobe Firefly, Dalle-2, and OpenJourney (usmanity.com)
231 points by muhammadusman on June 20, 2023 | hide | past | favorite | 133 comments



For reference, here's what you can get with a properly tweaked Stable Diffusion, all running locally on my PC. Can be set up on almost any PC with a mid range GPU in a few minutes if you know what you're doing. I didn't do any cherry picking; this is the first thing it generated. 4 images per prompt.

1st prompt: https://i.postimg.cc/T3nZ9bQy/1st.png

2nd prompt: https://i.postimg.cc/XNFm3dSs/2nd.png

3rd prompt: https://i.postimg.cc/c1bCyqWR/3rd.png


Can you elaborate on “properly tweaked”? When I use one of the Stable Diffusion and AUTOMATIC1111 templates on runpod.io, the results are absolutely worthless.

This is using some of the popular prompts you can find on sites like prompthero that show amazing examples.

It’s been serious expectation vs. reality disappointment for me and so I just pay the MidJourney or DALL-E fees.


> Can you elaborate on “properly tweaked”?

In a nutshell:

1. Use a good checkpoint. Vanilla stable diffusion is relatively bad. There are plenty of good ones on civitai. Here's mine: https://civitai.com/models/94176

2. Use a good negative prompt with good textual inversions. (e.g. "ng_deepnegative_v1_75t", "verybadimagenegative_v1.3", etc.; you can download those from civitai too) Even if you have a good checkpoint this is essential to get good results.

3. Use a better sampling method instead of the default one. (e.g. I like to use "DPM++ SDE Karras")

There are more tricks to get even better output (e.g. controlnet is amazing), but these are the basics.


Thank you. I assume there's some community somewhere where people discuss this stuff. Do you know where that is? Or did you just learn this from disparate sources?


> I assume there's some community somewhere where people discuss this stuff. Do you know where that is? Or did you just learn this from disparate sources?

I learned this mostly by experimenting + browsing civitai and seeing what works + googling as I go + watching a few tutorials on YouTube (e.g. inpainting or controlnet can be tricky as there are a lot of options and it's not really obvious how/when to use them, so it's nice to actually watch someone else use them effectively).

I don't really have any particular place I could recommend to discuss this stuff, but I suppose /r/StableDiffusion/ on Reddit is decent.


Pretty good reddit community, lots of (N/SFW) models and content on CivitAI. Took me a weekend to get setup and generating images. I've been getting good results on my AMD 6750XT with A1111 (vladmandic's fork).


What kind of(and how much) data did you use to train your checkpoint?

I'd like to have a go at making one myself targeted towards single objects (be it car,spaceship, dinner plate, apple, octopus, etc). Most checkpoints are very heavily leaning towards people and portraits.


I’m not the OP but I’ve made some of my daughter, wife, dog, niece, etc.

People generally suggest 30+ images. I’ve found - at least with people - the more the better. My wife’s model is trained on ~80 images of her.


Are you using txt2img with the vanilla model? SD's actual value is in the large array of higher-order input methods and tooling; as a tradeoff, it requires more knowledge. Similarly to 3D CGI, it's a highly technical area. You don't just enter the prompt with it.

You can finetune it on your own material, or choose one of the hundreds of public finetuned models. You can guide it in a precise manner with a sketch or by extracting a pose from a photo using controlnets or any other method. You can influence the colors. You can explicitly separate prompt parts so the tokens don't leak into each other. You can use it as a photobashing tool with a plugin to popular image editing software. Things like ComfyUI enable extremely complicated pipelines as well. etc etc etc


Is there a coherent resource (not a scattered 'just google it' series of guides from all over the place) that encapsulates some of the concepts and workflows you're describing? What would be the best learning site/resource for arriving at understanding how to integrate and manipulate SD with precision like that? Thanks


I have found http://stable-diffusion-art.com to be an absolutely invaluable (and coherent) resource. It's highly ranked on Google for most "how to do X with stable diffusion" style searches, too.


> What would be the best learning site/resource for arriving at understanding how to integrate and manipulate SD with precision like that?

Honestly? Probably YouTube tutorials.


Jaysus.

I'm going to sound like an entitled whiny old guy shouting at clouds, but - what the hell; with all the knowledge being either locked and churned on Discord, or released in form of YouTube videos with no transcript and extremely low content density - how is anyone with a job supposed to keep up with this? Or is that a new form of gatekeeping - if you can't afford to burn a lot of time and attention as if in some kind of Proof of Work scheme, you're not allowed to play with the newest toys?

I mean, Discord I can sort of get - chit-chatting and shitposting is easier than writing articles or maintaining wikis, and it kind of grows organically from there. But YouTube? Surely making a video takes 10-100x the effort and cost, compared to writing an article with some screenshots, while also being 10x more costly to consume (in terms of wasted time and strained attention). How does that even work?


I've been playing with SD for a few months now and have only watched 20-30m of YT videos about it. There's only a few worth spending any time watching, and they're on specific workflows or techniques.

Best just to dive in if you're interested IMO. Otherwise you'll get lost in all the new jargon and ideas. Great place to start is the A1111 repo, lot of community resources available and batteries included.


How does anyone keep up with anything? It's a visual thing. A lot of people are learning drawing, modeling, animation etc in the exact same way - by watching YouTube (a bit) and experimenting (a lot).


Picking images from generated sets is a visual thing. Tweaking ControlNet might be too (IDK, I've never got a chance to use it - partly because of what I'm whining about here). However, writing prompts, fine-tuning models, assembling pipelines, renting GPUs, figuring out which software to use for what, where to get the weights, etc. - none of this is visual. It's pretty much programming and devops.

I can't see how covering this on YouTube, instead of (vs. in addition to) writing text + some screenshots and diagrams, makes any kind of sense.


This isn't for Stable Diffusion, but I wanted to provide a supplemental to my comment: https://kaiokendev.github.io/til

This is the level we're generally working at - first or second party to the authors of the research papers illustrating implementations of concepts, struggling with the Gradio interface, things going straight from commit to production.

It's way less frustrating to follow all of the authors in the citations of the projects you're interested in than wasting your attention sorting through blogspam, SEO, and YT trash just to find out they don't really understand anything, either.


Thank you. I was reluctant to chase after and track first-party research directly, or work directly derived from it, as my limited prior experience told me it's not the most efficient thing unless I want to go into that field of research myself. You're changing my mind about this; from now, I'll try sticking close to source.


There's a relatively thin layer between the papers and implementations, which is another way of saying this stuff is still for researchers and assumes a requisite level of background with them. It sounds like you'd benefit from seeking out the first party sources.

This is where video demonstrations come in handy. Since many concepts are novel, it's uncommon to find anyone who deeply understands them, but it's very easy to find people who have picked up on some tricks of the interfaces, which they're happy to click through. I think gradio/automatic1111 makes learning harder than it needs to be by hiding what it's doing behind its UI, while eg- comfyui has a higher initial learning curve but provides a more representational view of process and pipelines.


Take a moment and go scroll through the examples at civitai.com. Does most of them strike you as something by people with jobs? Most of them are pretty juvenile, with pretty women and various anime girls.


Are you under the impression that people with jobs don't like pretty women and anime girls?


Of course not, but it looks like a teenage boy's room.


An operative word here is people.... the set "people with jobs" contains a far higher fraction of folks who like attractive men than is represented here....


The difference being that youtube videos can make more money for the author. Anyway, it's all open source, so feel free to make a wiki


I would if I could keep up with the videos :).


I think it'd have been convenient for me as well if the AI tool that has access to YouTube videos would've been able to answer queries . But it takes 5 minutes to reply and I forgot it's name. It was on the front page recently


I mostly agree, but in this case it can be genuinely useful to actually see the process of someone using the tool effectively.


ComfyUI is a nice complement to A1111, the node-based editor is great for prototyping and saving workflows.


You're not going to get even close to Midjourney or even Bing quality on SD without finetuning. It's that simple. When you do finetune, it will be restricted to that aesthetic and you won't get the same prompt understanding or adherence.

For all the promise of control and customization SD boasts, Midjourney beats it hands down in sheer quality. There's a reason like 99% of ai art comic creators stick to Midjourney despite the control handicap.


Yet you are posting this in a thread where GP provided actual examples of the opposite. Look for another comment above/below, there are MJ-generated samples which are comparable but also less coherent than the result from a much smaller SD model. And in case of MJ hallucinations cannot be fixed. MJ is good but it isn't magic, it just provides quick results with little experience required; prompt understanding is still poor, and will stay poor until it's paired with a good LLM.

Neither of the existing models gives actually passable production-quality results, be it MJ or SD or whatever else. It will be quite some time until they get out of the uncanny valley.

> There's a reason like 99% of ai art comic creators stick to Midjourney

They aren't. MJ is mostly used by people without experience, think a journalist who needs a picture for an article. Which is great and it's what makes them good money.

As a matter of fact (I work with artists), for all the surface-visible hate AI art gets in the artist community, many actual artists are using it more and more to automate certain mundane parts of their job to save time, and this is not MJ or Dall-E.


There's a distinction to be made here. Everything that makes SD a powerful tool is the result of being open source. The actual models are significantly worse than Midjourney. If an MJ level model had the tooling SD does it would produce far better results.


> If an MJ level model had the tooling SD does it would produce far better results

And vice versa, which is the exciting part to me - only a matter of time!


Midjourney output all has the same look to it.

If you’re ok with basic aesthetics it’ll work but if you want something a bit less cringe or that will stand out in marketing it won’t cut it.


It only has the same look if it's not given any style keywords. I've been impressed with the output diversity once it's told what to do. It can handle a wide range of art styles.


Then we need to give style keywords to the other networks too, and suddenly the gap shortens.

Default Midjourney is one thing and that’s mid…


>Yet you are posting this in a thread where GP provided actual examples of the opposite.

Opposite of what ? OP posts results from a tuned model.


Opposite of this:

>For all the promise of control and customization SD boasts, Midjourney beats it hands down in sheer quality.

The results are comparable, but MJ in this comment https://news.ycombinator.com/item?id=36409043 hallucinates more (look at the roofs in the second picture). And it cannot be fixed, maybe except for an upscale making it a bit more coherent. Until MJ obtains better tooling (which it might in the next iteration), it won't be as powerful. I'm not even starting on complex compositions, which it simply cannot do.

>OP posts results from a tuned model.

Yes, which is the first step you should do with SD, as it's a much smaller and less capable model.


If course it's a tuned model. Why would anyone use stock SD these days?


I feel like people shouldn't talk in definitives if their message is just going to demonstrate they have no idea what they're talking about.


I know what i'm talking about lol. I tuned a custom SD model that's downloaded thousands of times a month. I'm speaking from experience more than anything. Don't know why some SD users get so defensive.


You load a model and have 6 sliders instead of one… it’s not exactly “fine tuning”.

If you want the power, it’s there. But nearly bone stock SD in auto1111 is going to get to any of these examples easily.

Show me the civitai equivalent for MJ or Dalle2. It doesn’t exist.


>You load a model and have 6 sliders instead of one… it’s not exactly “fine tuning”.

Ok...? Read what i wrote carefully. Your 6 sliders won't produce better images than midjourney for your prompt on the base SD model.


Midjourney has a riduculously restrictive keyword filter. You should have mentioned that.

Also I see nothing wrong with using different models for different purposes.


First off are you using a custom model or the default SD model? The default model is not the greatest. Have you tried controlnet?

But yes SD can be a bit of a pain to use. Think of it like this. SD = Linux, Midjourney = Windows/MacOS. SD is more powerful and user controllable but that also means it has a steeper learning curve.


I am sure you're right, but "if you know what you're doing" does a lot of heavy lifting here.

We could just as easily say "hosting your own email can be set up in a few minutes if you know what you're doing". I could do that, but I couldn't get local SD to generate comparable images if my life depended on it.


If you have an apple device, there is free GUI for Stable Diffusion called "Draw Things. It is nice and it just works. https://apps.apple.com/us/app/6444050820

screenshot of the options interface: https://stash.cass.xyz/drawthings-1687292611.png


Wow. It's both amazing and for some reason horrifying that this can run on an iPhone 11 (non-pro), and at reasonable speeds!


Nice! Would you mind sharing which stable diff you used / where you obtained from?


I'm using my own custom trained model.

Here, I've uploaded it to civitai: https://civitai.com/models/94176

There are plenty of other good models too though.


Any tips or guides you followed on training your custom model? I've done a few LoRAs and TI but haven't gotten to my own models yet. Your results look great and I'd love a little insight into how you arrived there and what methods/tools you used.


I'm not an expert at this and there are probably better ways to do this/might not work for you/your mileage may vary, so please take this with a huge grain of salt, but roughly this worked for me:

1. Start with a good base model(s) from which to train from.

2. Have a lot of diverse images.

3. Ideally train for only one epoch. (Having a lot of images helps here.)

4. If you get bad results lower the learning rate and try again.

5. After training try to mix your finetuned model with the original one, in steps of 10%, generate X/Y plot of it, pick the best result.

6. Repeat this process as long as you're getting an improvement.

For training I mostly used scripts from here: https://github.com/bmaltais/kohya_ss

The main problem here is that essentially during inference you're using a bag of tricks to make the output better (e.g. good negative embeddings), but when training you don't. (And I'm not entirely sure how you'd actually integrate those into the training process; might be possible, but I didn't want to spend too much time on it.) So your fine tuning as-is might improve the output of the model when no tricks are used, but it can also regress it when the tricks are used. Which I why I did the "mix and pick the best one" step.

But, again, I'm not an expert at this and just did this for fun. Ultimately there might be better ways to do it.


Great tips, thank you! It feels like I'm right behind you in terms of where I'm at so your input is very much appreciated.

3. Train for only 1 epoch - interesting, any known rationale here?

5. I just read somewhere else that someone got good results from mixing their custom model with the original (60/40 in their case) - good to hear some more anecdotes that this is pretty effective. Especially the further training after merging, sounds promising!

I've also been using kohya_ss for training LoRAs so great to hear it works for you for models as well. On your point about the inference tricks, definitely noted but I did notice that you can feed some params (# of samples, negative embeddings, etc) to the sample images generated during training (check the textarea placeholder text). Still not going to have all usual the tricks but it'll get you a little closer.


Make sure that you have enough vram. I can train loras with 8 gb easily, but when I tried to train a model - it gives me an oom error.


Hopefully my 12GB is enough!


Do you have any good tutorial links to setup Stable Diffusion locally?


thanks for doing this, I would like to include these into the blog post as well. Can I use these and credit you for them? (let me know what you'd like linked)


Sure. No need to credit me.


thanks, updated the post with your results as well :)


Those are amazing, please consider writing a blog post of the steps you did to install and tweak Stable Diffusion to achieve these results. I'm sure many of us would love to read it.


"Just" use a "properly tweaked" something.


You got incorporated into the article! Nice.



Seems a lot better than some of the ones in the post


Since the author didn't have access to Midjourney, here's the first two prompts in MJ with default settings (not upscaled):

https://imgur.com/a/siQG06O

https://imgur.com/a/vp2oOHu


thanks for sharing this, do you mind if I include this in the post. I will credit you of course (let me know what you'd like linked to).

update: I've edited the post to include these results as well


something something AI generated cannot be copyrighted [/s]


Go for it! Happy to help. Let me know if you want upscales.


Amazing how quickly Dalle-2 went from among the best image transformers to among the worst.


The stagnation has been very curious. They are part of a large & generally competent org, which otherwise has remained far ahead of the competition, like GPT-4. Except... for DALL-E 2, where it did not just stagnate for over a year (on top of its bizarre blindspots like garbage anime generation), but actually seemed to get worse. They have an experimental model of some sort that some people have access to, but even there, it's nothing to write home about compared to the best models like Parti or eDiff-I etc.


I suspect that they consider txt2img to be more of a curiosity now. Sure, it's transformative; it's going to upend whole markets (and make some people a lot of money in the process) - however, it's just producing images. Contrast with LLMs, which have already proven to be generally applicable in great many domains, and that if you squint, are probably capturing the basic mechanisms of thinking. OpenAI lost the lead in txt2img, but GPT-4 is still way ahead of every other LLM. It makes sense for them to focus pretty much 100% on that.


I find it curious because (a) if they don't care about text2image, why launch it as a service to begin with? (b) if they don't care now, why keep it up and let it keep consuming resources, human & GPU? (c) if they do still care, because as other models & services have demonstrated there's a ton of interest in text2image, why not invest the relatively minor amount of resources it would take to keep it competitive (look how few people work at Midjourney, or are authors on imagegen papers)? It may have cost >$100m to make GPT-4, but making a decent imagegen model costs a lot less than that! (Even now, you could probably create a SOTA model for <$10m, especially if you have GPT-3/4 available for the text encoding part.)

But launching it and then just letting it stagnate indefinitely and get worse every day compared to its increasingly popular competitors seems like the worst of all worlds, and I can't see what is the OA strategy there.


Maybe they keep it up just so that they have something in txt2img space? It may not be the best, or even good, but you don't know that until you try it, and until then, it just enhances the value of the OpenAI platform. E.g. if you're building something backed by OpenAI LLMs, and are thinking about future txt2img integration, the existence of Dall-E might stop you from "shopping around" txt2img services in advance.

The way I see it, they don't need txt2img at this moment - GPT-4 ensures they're the top #1 name both in the industry and in AI-related news stories. But it doesn't mean they won't come back to it. Couple observations:

- OpenAI isn't a "release early, release often" shop. They might be already working on something, but they'll release it only when it is a qualitative improvement over everyone else (or at least Dall-E).

- A bunch of hobbyists is doing all their work for free anyway. Stable Diffusion itself may not be SOTA, but the totality of hundreds of different fine-tunes on Civitai very much is. With all those models being shared in the open and relatively easy/cheap to recreate, it would make sense for OpenAI to just stand by and watch, and only invest resources once hobbyists hit a plateau.

- Looking at those Civitai models, it seems to me that OpenAI could beat txt2img SOTA easily, at any moment, by taking (or re-creating, depending on the license) the best five to ten SD derivatives, and put them behind GPT-4, or even GPT-3.5, fine-tuned to 1) choose the best SD derivative for user's prompt, and 2) transform user's prompt to set of parameters (positive & negative prompts, diffuser algo, numeric params) crafted with choice from 1) in mind. It's a black box. On the Internet, no one can tell you're an ensemble model.

- They could even be doing it as we speak - addition of function calls is aligned with this direction, fine-tuning for good prompt generation is mostly a txt2txt exercise, and again, hobbyists around the world are busy building a high-quality human-curated data set of {what I want}x{model + positive prompt + negative prompt + diffuser + other params} -> {is this any good?}. If I were them, I'd just mine this and not say anything.

- Overall, I think that in txt2img space, currently the hard part isn't the "img" part, but the "txt" part. OpenAI has a huge advantage here, and as long as its true, they're in position to instantly overtake everyone else in this space. That is, they have an "Ultimate attack" charged and ready, and are patiently waiting for a good moment to trigger it.

- Didn't they hint that GPT-4 successor will be multimodal? That could end up being their comeback to txt2img. And img2txt. And a bunch of other modalities.

EDIT: As if on cue, the very thing I was speculating about above is being discussed wrt. LLMs right now:

- https://news.ycombinator.com/item?id=36413296 - GPT-4 is 8 GPTs in a trench coat

- https://news.ycombinator.com/item?id=36413768 - 3-4 orders of magnitude efficiency (size vs effect) improvement in code generation, if your training data isn't garbage

And in both threads, people bring up older papers and discuss the merits of combining smaller specialized models into a more generic whole.


Why do they need to have something in text2image? It in no way builds lockin to the API or anything, especially with how gimped it is.

1. Yes, they are. Look at the constant iterative rollouts of GPTs 2. Most of which is useless to them, not that they have made any use of it 3. the fact that it would be so easy to improve, and they haven't, only emphasizes my point. 4. sure, that could be useful. Except there's zero integration or mention. (They haven't even opened up the vision part of GPT-4 yet.) 5. the fact that it would be so easy to improve, and they haven't, only emphasizes my point. 6. why wait for GPT-5 possibly years from now?


> Why do they need to have something in text2image?

So they're "on the list". So whenever journalists and bloggers write articles about text2image, they're listed as a player in this space. For vast majority of such articles, neither the authors nor the audience will be able to tell that OpenAI's offering is far behind and that they're basically keeping a token presence in the space.

At least that's my hypothesis. I'm neither a domain expert or a business expert - I just feel that, for OpenAI, having laymen view them as an industry leader in AI in general, is worth the price of keeping Dall-E available. In fact, as more and more users realize there are better models available elsewhere, that price goes down, while the effect on laymen audience stays the same.

(Note: the term "laymen", as I use it here, specifically includes most entrepreneurs, managers and investors, in tech or otherwise. If I'm being honest in myself, I belong to that category too; it's in fact this conversation and some recent threads that made me realize just how weak OpenAI is in image generation space.)

> Look at the constant iterative rollouts of GPTs

You mean some unannounced ones, or the pinned models? Because AFAIK GPT-3.5 had two updates after release (the turbo model and the current one), and GPT-4 had one. I mean public releases; for example, how often they updated GPT-4 back before it was public, e.g. when Microsoft was building Bing Chat, is not relevant in this context.

Also compare that with how, going by HN submissions alone, every other day someone releases some improved LLaMA-derived LLM.

> 2. Most of which is useless to them, not that they have made any use of it 3. the fact that it would be so easy to improve, and they haven't, only emphasizes my point. 4. sure, that could be useful. Except there's zero integration or mention. (...) 5. the fact that it would be so easy to improve, and they haven't, only emphasizes my point.

There's little for them to gain by openly using all that work now. At the moment, they can just keep an eye on what's posted to Civitai, paying particular attention to how different model derivatives respond to prompts (think e.g. CyberRealistic vs. Deliberate) and why, and build up a training corpus of prompts and settings, helpfully provided by the community, complete with quality rating. They can do that using a small fraction of resources they have available - so that when the time comes, they can use their full resources to quickly train and deploy a model that blows everyone else out of the water.

Also, as an organization, they can focus only on so many things at a time. GPT-4 is buying them some space, and I believe they're currently focusing primarily on their cooperation with Microsoft, and/or other things involving LLMs. Given the relative usefulness and potential of LLMs vs. image generation, both short and long-term, doing more than bare minimum in image generation right now might be too much of a distraction for an organization this size.

> (They haven't even opened up the vision part of GPT-4 yet.)

They're in the lead. They're not in a hurry. They're likely giving Microsoft a head start.

> 6. why wait for GPT-5 possibly years from now?

Why do it earlier? What could they possibly gain by jumping back into text2image space now? At this point, compared to LLMs, text2image seems neither profitable not particularly relevant for x-risk, so whichever way you cut it, I can't see why would they want to prioritize it.


I think they just don't care very much about DALL-E.

Which is fair enough, when you are a (relatively) small company competing with the likes of Google and Meta you really need to focus.


Nobody is able to use Parti or eDiff. Compared to models you can use, the experimental Dall-e or Bing Image Creator is second only to midjourney in my experience.


Parti/eDiff show that it is relatively easy to do much better than the experimental model which presumably represents their best effort, never mind the hot garbage you see in OP from DALL-E 2. And it's not a calculated degree of low-quality enabled by those models being unreleased and having no competition, because competition like Stable Diffusion or Midjourney are beating the heck out of DALL-E 2 in popular usage.


I haven't tried those two, but I'd be surprised if they were better than Stable Diffusion. Which is free, runnable (and trainable!) locally, and already has a large ecosystem of frontends, tweaks and customized models.


Believe me, i know all about SD's possible customization and tweaks.

I would still easily put both ahead of the base models. You won't match the quality of those models without finetuning. When you do fine-tune, it'll be for a particular aesthetic and you won't match them in terms of prompt understanding and adherence.


I don't know, what I saw in there (particularly with the haunted house) was a far broader POTENTIAL RANGE of outputs. I get that they were cheesier outputs, but it seems to me that those outputs were just as capable of coming from the other 'AIs'… if you let them.

It's like each of these has a hidden giant pile of negative prompts, or additional positive prompts, that greatly narrow down the range of output. There are contexts where the Dall-E 'spoopy haunted house ooooo!' imagery would be exactly right… like 'show me halloweeny stock art'.

That haunted house prompt didn't explicitly SAY 'oh, also make it look like it's a photo out of a movie and make it look fantastic'. But something in the more 'competitive' AIs knew to go for that. So if you wanted to go for the spoopy cheesey 'collective unconscious' imagery, would you have to force the more sophisticated AIs to go against their hidden requirements?

Mind you if you added 'halloween postcard from out of a cheesey old store' and suddenly the other ones were doing that vibe six times better, I'd immediately concede they were in fact that much smarter. I've seen that before, too, in different Stable Diffusion models. I'm just saying that the consistency of output in the 'smarter' ones can also represent a thumb on the scale.

They've got to compete by looking sophisticated, so the 'Greg Rutkowskification' effect will kick in: you show off by picking a flashy style to depict rather than going for something equally valid, but less commercial.


It's not just about the haunted house. Just look at the DALLE-2 living room pictures closely. None of it makes any sense. And we're not even talking of subtle details, all of the first three pictures have a central object that the eye should be drawn to that's just a total mess. (The table that's being subsumed by a bunch of melting brown chairs in the first one, the i-don't-even-know-what that seems to be the second picture, and the whatever-this-is on the blue carpet.)


OpenAI screwed up that one by trying to control it. StableDiffusion on the other hand, gives me hope that AI can be high quality and open(not only in name).

Can't wait to have something like StableDiffusion but for LLMs.


Dall-e experimental is very good (Bing Image creator). I only prefer midjourney to it.


It might be a case of them seeing way more potential with LLMs compared to image generation.


It’s more that their moat got obliterated on image gen.

If stable diffusion didn’t launch Dall-e 2 would have been still valuable.


chatgpt next...


Dall-E 2 was almost immediately displaced by MidJourney. Nothing comes close to even GPT 3.5 at the moment.


Anthropic's models are better than GPT 3.5 in my opinion.


Why innovate when you can regulate?



Midjurney is still so far ahead it's no competition. Did a lot of testing today and firefly generated so much errors with fingers and stuff, not seen that since the original stability release. Anyone know if the web firefly and the Photoshop version is the same model?


It's worth noting the difference in how the training material is sourced though, Midjourney is using indiscriminate web scrapes while Firefly is taking the conservative approach of only using images that Adobe holds a license for. Midjourney has the Sword of Damocles hanging over its head that depending on how legal precedent shakes out, its output might end up being too tainted for commercial purposes, and Adobe is betting on being the safe alternative during the period of uncertainly and if the hammer does come down on web-scraping models.


Would mid-journey be liable though? I mean you can create copyrighted material using photoshop too. (Even paint!).

If I create a Mickey Mouse using photoshop would adobe be liable for it?


I don't think it really matters whether or not Midjourney themselves are liable, the output of their model being legally radioactive would break their business model either way. They make money by charging users for commercial-use rights to their generations, but a judgement that the generations are uncopyrightable or outright infringing on others copyright would make it effectively useless for the kinds of users who want commercial-use rights.


I wouldn't loose sleep over this if I was working for Midjourney. Copyright lobby is powerful, and when bad actors like Microsoft, Disney etc. jump onto the AI bandwagon and put their legal weight on their side of the lever, everything will turn out well (for them).


I'm presuming you're not including Stable Diffusion when you say this; the fact that SD and its variants are defacto extremely "free and open source" presently put it way ahead of anything else, and are likely to do so for some time.


As far as I can tell anyone who’s creating images is using midjourney. This is likely the same “Linux is open so it’s way better” tell that to the trillion dollar companies that bet against that.


To be honest most of the AI generated images I find online are generated by Stable Diffusion, the fact that you can't generate NSFW images with MJ makes also a big difference.


This comment is breaking my brain. If you're not trolling, like, you do know what operating system the overwhelming vast majority of the "cloud" runs on, yes?


I’m perfectly aware of that. But you know what operating systems the overwhelming vast majority of PEOPLE use, yes?

Sure likely more machines run Linux on servers but that’s like saying your body has more bacteria than your own cells. Technically correct but actually bullshit.


Again, my brain is broken because you mentioned "Trillion dollar COMPANIES," who certainly know the value of Linux, even if a lot of people don't.


I share the same opinion, but also dislike these tests because each system benefits from a different approach to prompting. What I use to get a good result in MidJourney won't work in StableDiffusion for example. Instead when making these comparisons one needs to set an objective and have people who are familiar with each system to produce their nicest images - since this is a better reflection of the real world usage. For example, ask each participant to read a chapter/page from a book with a lot of specific imagery and then use AI to create what they think that looks like.

Regarding image generation in Photoshop I can confirm two things:

- It is excellent for in and out painting with a few exceptions*

- It remains poor for generating a brand new image

*Photoshop's generative fill is very good at extending landscapes, it will match lighting and according to the release video can be smart enough to observe what a reflection should contain even if that is not specifically included in the image (in their launch demo they showed how a reflection pool captured the underside of a vehicle.)

Where generative fill falls apart: Inserting new objects that are not well defined produces problems. Choosing something like a VW Beetle will produce a good result as it is well defined, choosing something like "boat", "dragon", or even "pirate's chest": will produce a range of images that do not necessarily fit the scene - this is likely because source imagery for such objects is likely vague and prone to different representations.

1st note about Firefly: Anything that is likely to produce a spherical looking shape tends to be blocked - likely because it resembles certain human anatomy. This is problematic when doing small touch ups such as fixing fingers.

A special note about photoshop versus other systems: Photoshop has the added problem of needing to match the resolution of the source material. Currently it achieves this from combining upscaling with resizing - this means that if one is extending an area with high detail, that detail cannot be maintained and instead is softer/blurrier than the original sections. It also means that if one extends directly from the border of an image, then a feathered edge becomes visible which must be corrected by hand.

I currently test the following AI generators, feel free to ask me about any of these: StableDiffusion (Automatic and InvokeAI), OpenAI's Dall-E 2, MidJourney, Stability AI's DreamStudio, and Adobe Firefly.


Not with typography though, haha. It can't spell. I had to draw the letters myself


None of these can do text well. There's a model that does do text and composition well, but the name escapes me. And the general quality is much lower overall, so it's a pretty heavy tradeoff.



I believe this is at least one solution, and one that the folks at stability themselves were pushing hard as a next step forward in the development of LLMs.


If midjourney could count fingers, I'd be thrilled!


I had done a similar comparison a couple months back but used Lexica instead of DALL-E.

Seems clear to me that Midjourney has by far the best "vibes" understanding. Most models get the items right but not the lighting. Firefly seems focused on realism which makes sense for a photography audience.

https://twitter.com/fanahova/status/1639325389955952640?s=46...


Kind of strange to me that they didn't test any prompts with people in them. In my experience that tends to show the limitations of various models pretty quickly.


Lighting also tends to be pretty bad in complex scenes. I find the unrealistic shadows tends to break the photorealism of few light source scenes.


Adobe Firefly is actually extremely competent, especially since it doesn't use copyrighted images in its training set. Using MidJourney (which is fantastic) commercially will be a quagmire for the unlucky company that draws a lawsuit.


All three of these are horrible, and running Stable Diffusion locally produces incredibly better results as seen in this comment section.


MidJourney produces more consistent and usable results. I am running SD and also pay for MJ. I've tried several checkpoint and loras, but the output is often disappointing or incorrectly using the prompts.


*Shameless Plug*

If you want to play around with OpenJourney (or any other fine-tuned StableDiffusion model). I made my own UI with a free tier at https://happyaccidents.ai/.

It supports all open-sourced fine-tuned models & loras and I recently added ControlNet.


Should be compared using Bing Image Creator(better version of dall-e) rather than the Dalle-2 site.


Is it intentional that each of the prompts is given twice in that blockquote? It's done without a space, so e.g. in the 2nd example, the word "centeredvalley" appears because of the way the last/first words of the first/second repetition were mashed together. Does that indicate what was actually given to the engines, or was that a copy-paste issue made only while putting together the article? I could imagine that non-words like "cornera" in the last example could throw things off?


My result for prompt 2 using Dreamshaper Stable Diffusion model.

https://i.imgur.com/ipnf3f5.png


For those curious, I tried the same prompts with Kandinsky 2.1 [0]. In my experience it kind of blends the conceptual understanding of DALL-E with the higher quality image generation of Stable Diffusion. Like Midjourney though it kind of injects it's own style and allows you to get "satisfying" results from short prompts.

The flaw with these comparisons is that you really shouldn't use the same prompt with different generators. If you want to get best results you do have to play with the prompts and do a bunch of iteration to kind of explore the latent space and find what you're looking for. The first super long prompt looks like it's tuned for stable diffusion for instance. Different generators also have different syntax (e.g. with stable diffusion you can surround a phrase with parens to give it extra emphasis).

[0]: https://iterate.world/s/clj4n19u20000jv08iqygiaqw


Here is what the haunted house looks like with Dall-E ~3 (Bing Image Creator): https://www.bing.com/images/create/a-haunted-house-with-ghos...

Generally, this model is much better than Dall-E 2, and it beats Firefly in some areas (I didn't try Midjourney or Stable Diffusion). Firefly usually produces photos with significantly fewer visual mistakes (like the wrong number of fingers or messed up faces) than the Bing Dall-E. But the latter usually understands prompts much better and more often produces something that matches it well. Firefly also doesn't "know" a lot of pop culture or history things, e.g. Marilyn Monroe, or what Coca-Cola is.


Why didnt this person include Stable Diffusion?


OpenJourney is fine tuned SD


Can we appreciate how well that lightbox works on this site in a mobile mobile browser, especially Safari? Also the gestures are smooth and do not cause any quirks like unintended refresh gesture


The analysis at the end seems to be lacking. From my perspective, PhotoShop and Midjourney come out on top in terms of aesthetic and accuracy, with kouteiheika's Stable Diffusion results[0] a close second. Dall-E falls far behind, which makes sense considering all the work that's gone in to the other systems to fine-tune and build ecosystems around them.

[0]: https://news.ycombinator.com/item?id=36408744


For comparison, these were generated using Stability.ai API: https://postimg.cc/gallery/MQfkgP7/ce388adf

I used stable-diffusion-xl-beta-v2-2-2 model, copypasted prompts from the blog post, one-shot for each prompt. I chose style presets that closely matched the prompt (added as suffixes in image filenames).


I like how simple Firefly’s images are, like something you’d want to work with in Photoshop. Dalle-2 looks terrible. Midjourney is still my favorite.


As someone who has spent hours playing with it in Photoshop (Beta) Firefly is actually pretty damned cool!


> small windows opening onto the garden

Literally all of the examples have floor to ceiling windows across the entire length of the wall…


I'm glad it's not just me getting unusable garbage out of Dall-E and glorious results from MidJourney.


not sure this is a good comparison. midjourney likes much shorter prompts, and honestly they're all absolutely terrible for anything that isn't 'photo' based. E.g. ask it to generate a word bubble of the most common programming languages and it will fail every time, no matter what you try. I love it for photo stuff, but for photoshop you'd expect it to be able to do other things as well.


That’s not a fair comparison, as Midjourney is outstanding at a wide range of styles beyond photography.

Generating a “word bubble” is going to look terrible in every major diffusion model. Cohesive words and writing in image models is still highly specialised.


Curious, midjourney does great art and cartoon/comic styles too. Not just realistic images.

Most image AI tools are terrible with words.

I am curious, what images did you try generating with midjourney?


In my first few hours with DiffusionBee I made a couple of very credible semi-abstract portraits by mashing up the styles of unrelated artists. And some splashy watercolours. And some logo line art.

And the inevitable booby cheesy rendered forest fairy.

I don't think they're terrible at all. They absolutely can make original art with decent production values.

They can't write text yet, but I'm sure that's coming soon.


Author here: I updated the post to include the generated results from Stable Diffusion and Midjourney (thanks to kouteiheika and mdorazio).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: