Stable Diffusion 3

JonathanFly · 2024-02-22T13:55:37 1708610137

From: https://twitter.com/EMostaque/status/1760660709308846135

Some notes:

- This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements.

- This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs..

- Will be released open, the preview is to improve its quality & safety just like og stable diffusion

- It will launch with full ecosystem of tools

- It's a new base taking advantage of latest hardware & comes in all sizes

- Enables video, 3D & more..

- Need moar GPUs..

- More technical details soon

>Can we create videos similar like sora

Given enough GPUs and good data yes.

>How does it perform on 3090, 4090 or less? Are us mere mortals gonna be able to have fun with it ?

Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.

(adding some later replies)

>awesome. I assume these aren't heavily cherry picked seeds?

No this is all one generation. With DPO, refinement, further improvement should get better.

>Do you have any solves coming for driving coherency and consistency across image generations? For example, putting the same dog in another scene?

yeah see @Scenario_gg's great work with IP adapters for example. Our team builds ComfyUI so you can expect some really great stuff around this...

>Dall-e often doesn’t even understand negation, let alone complex spatial relations in combination with color assignments to objects.

Imagine the new version will. DALLE and MJ are also pipelines, you can pretty much do anything accurately with pipelines now.

>Nice. Is it an open-source / open-parameters / open-data model?

Like prior SD models it will be open source/parameters after the feedback and improvement phase. We are open data for our LMs but not other modalities.

>Cool!!! What do you mean by good data? Can it directly output videos?

If we trained it on video yes, it is very much like the arch of sora.

cheald · 2024-02-22T14:18:41 1708611521

SD 1.5 is 983m parameters, SDXL is 3.5b, for reference.

Very interesting. I've been streching my 12GB 3060 as far as I can; it's exciting that smaller hardware is still usable even with modern improvements.

ttul · 2024-02-22T22:29:20 1708640960

Stability has to make money somehow. By releasing an 8B parameter model, they’re encouraging people to use their paid API for inference. It’s not a terrible business decision. And hobbyists can play with the smaller models, which with some refining will probably be just fine for most non-professional use cases.

jandrese · 2024-02-22T22:42:34 1708641754

I would LOL if they released the "safe" model for free but made you pay for the one with boobs.

ttul · 2024-02-22T22:47:09 1708642029

Oh they’ll never let you pay for porn generation. But they will happily entertain having you pay for quality commercial images that are basically a replacement for the entire graphic design industry.

ohthehugemanate · 2024-02-23T07:30:23 1708673423

It's not an easy fap, but I guess I'm watching people get f*cked either way.

teaearlgraycold · 2024-02-22T22:42:52 1708641772

Don't people quantize SD down to 8 bits? I understand plenty of people don't have 8GB of VRAM (and I suppose you need some extra for supplemental data, so maybe 10GB?). But that's still well within the realm of consumer hardware capabilities.

ttul · 2024-02-22T22:47:58 1708642078

I’m the wrong person to ask, but it seems Stability intends to offer models from 800M to 8B parameters in size, which offers something for everyone.

liuliu · 2024-02-22T17:53:19 1708624399

I am going to look at quantization for 8b. But also, these are transformers, so variety of merging / Frankenstein-tune is possible. For example, you can use 8b model to populate the KV cache (which computes once, so can load from slower devices, such as RAM / SSD) and use 800M model for diffusion by replicating weights to match layers of the 8b model.

memossy · 2024-02-22T14:40:22 1708612822

800m is good for mobile, 8b for graphics cards.

Bigger than that is also possible, not saturated yet but need more GPUs.

anon373839 · 2024-02-22T23:07:57 1708643277

Do you know how the memory demands compare to LLMs at the same number of parameters? For example, Mistral 7B quantized to 4 bits works very well on an 8GB card, though there isn’t room for long context.

vorticalbox · 2024-02-22T14:53:41 1708613621

you ca also quantisation which lowers memory requirements at a small lose of performance.

VikingCoder · 2024-02-22T15:19:58 1708615198

I'm curious - where are the GPUs with decent processing power but enormous memory? Seems like there'd be a big market for them.

wongarsu · 2024-02-22T15:35:16 1708616116

Nvidia is making way too much money keeping cards with lots of memory exclusive to server GPUs they sell with insanely high margins.

AMD still suffers from limited resources and doesn't seem willing to spend too much chasing a market that might just be a temporary hype, Google's TPUs are a pain to use and seem to have stalled out, and Intel lacks commitment, and even their products that went roughly in that direction aren't a great match for neural networks because of their philosophy of having fewer more complex cores.

ls612 · 2024-02-22T15:54:21 1708617261

MacBooks with M2 or M3 Max. I’m serious. They perform like a 2070 or 2080 but have up to 128GB of unified memory, most of which can be used as VRAM.

ttul · 2024-02-22T22:31:05 1708641065

MPS is promising and the memory bandwidth is definitely there, but stable diffusion performance on Apple Silicon remains terribly poor compared with consumer Nvidia cards (in my humble opinion). Perhaps this is partly because so many bits of the SD ecosystem are tied to Nvidia primitives.

ummonk · 2024-02-23T00:00:35 1708646435

Image diffusion models tend to have relatively low memory requirements compared to LLMs (and don’t benefit from batching), so having access to 128 GB of unified memory is kinda pointless.

Filligree · 2024-02-23T02:34:13 1708655653

They do benefit from batching; up to a 50% performance improvement, in my experience.

That might seem small compared to LLMs, but it isn't small in absolute terms.

ls612 · 2024-02-23T03:04:20 1708657460

I got a 2x jump on my 4090 from batching SDXL.

ls612 · 2024-02-23T02:23:37 1708655017

Stable diffusion will run fine on a 3090, or 4070ti Super and higher.

declaredapple · 2024-02-22T18:41:24 1708627284

How many tokens/s are we talking for a 70B model?

Last I saw they performed really poorly, like lower single digits t/s. Don't get me wrong they're probably a decent value for experimenting with it, but is flat out pathetic compared to an A100 or H100. And I think useless for training?

smcleod · 2024-02-22T19:40:46 1708630846

You can run a 180B model like Falcon Q4 around 4-5tk/s, a 120B model like Goliath Q4 at around 6-10tk/s, and 70B Q4 around 8-12tk/s and smaller models much quicker, but it really depends on the context size, model architecture and other settings. A A100 or H100 is obviously going to be a lot faster but it costs significantly more taking its supporting requirements into account and can’t be run on a light, battery powered laptop etc…

int_19h · 2024-02-23T00:21:20 1708647680

For text inference, what you want is M1/M2 Ultra with its 800 Gb/s RAM. Max only goes up to 400 Gb/s.

ls612 · 2024-02-23T02:22:31 1708654951

Yeah but the ultra only goes in desktop platforms which may be limiting to some.

int_19h · 2024-02-23T08:44:44 1708677884

But that's no different from mid-to-high-end GPUs, which is what the original ask was about.

SV_BubbleTime · 2024-02-22T15:42:29 1708616549

I’ll bet you the Nvidua 50xx series will have cards that are asymmetric for this reason. But nothing that will cannibalize their gaming market.

You’ll be able to get higher resolution but slowly. Or pay the $2800 for a 5090 and get high res with good speed.

m463 · 2024-02-24T21:19:54 1708809594

I kind of wonder if gaming will start incorporating AI stuff. What if instead of generating a stable diffusion image, you could generate levels and monsters

weebull · 2024-02-23T13:07:21 1708693641

I think the AMD 8600XT is a mod in this direction, otherwise there was little point in releasing it.

GPUs need a decent virtual memory system though. The current "it runs or it crashes" situation isn't good enough.

pbhjpbhj · 2024-02-22T19:35:14 1708630514

Nvidia have a system for DMA from GPU to system memory, GPUdirect. That seems like a potentially better route if latency can be handled well.

nick238 · 2024-02-22T22:40:40 1708641640

GPU memory is all about bandwidth, not latency. DDR5 can do 4-8 GT/s x 64-bit bus per DIMM, so maxing 128 GB/s with a dual memory controller, 512 GB/s with 8x memory controllers on server chips, but GDDR6 can run at twice the frequency and has a memory bus ~5x as wide in the 4090, so you get an order of magnitude bump in throughput, so nearly 1 TB/s on a consumer product. Datacenter GPUs (e.g. A100) with HBM2e doubles that to 2 TB/s

iosjunkie · 2024-02-22T16:32:33 1708619553

I dream of AMD or Intel creating cards to do just that

3abiton · 2024-02-23T03:20:00 1708658400

Tesla P40

p1esk · 2024-02-22T15:41:29 1708616489

H200 has 141GB, B100 (out next month) will probably have even more. How much memory do you need?

holoduke · 2024-02-22T17:03:00 1708621380

We need 128gb with a 4070 chip for about 2000 dollars. Thats what we want.

duffyjp · 2024-02-22T20:02:20 1708632140

I've never tried it, but in Windows you can have CUDA apps fall back to system ram when GPU vram is exhausted. You could slap 128gb in your rig with a 4070. I'm sure performance falls off a cliff, but if it's the difference between possible and impossible that might be acceptable.

https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/s...

ttul · 2024-02-22T22:32:09 1708641129

Nvidia will not build that any time soon. RAM is the dividing line between charging $40,000 vs $2500…

qwertox · 2024-02-22T20:39:17 1708634357

Please give me some DIMM slots on the GPU so that I can choose my own memory like I'm used to from the CPU-world and which I can re-use when I upgrade my GPU.

int_19h · 2024-02-23T00:23:17 1708647797

An M1 Mac Studio with that much RAM can be had for around $3K if you look for good deals, and will give you ~8 tok/s on a 70B model, or ~5 tok/s for a 120B one.

ta_1138 · 2024-02-22T20:13:24 1708632804

Unfortunately production capacity for that is limited, and with sufficient demand, all pricing is an auction. Therefore, we aren't going to be seeing that card in years

FeepingCreature · 2024-02-22T18:36:59 1708627019

Yes please.

netdur · 2024-02-22T14:33:15 1708612395

> - Need moar GPUs..

Why is there not a greater focus on quantization to optimize model performance, given the evident need for more GPU resources?

memossy · 2024-02-22T14:48:41 1708613321

We have highly efficient models for inference and a quantization team.

Need moar GPUs to do a video version of this model similar to Sora now they have proved that Diffusion Transformers can scale with latent patches (see stablevideo.com and our work on that model, currently best open video model).

We have 1/100th of the resources of OpenAI and 1/1000th of Google etc.

So we focus on great algorithms and community.

But now we need those GPUs.

sylware · 2024-02-22T14:57:39 1708613859

Don't fall for it: OpenAI is microsoft. They have as much as google, if not more.

Jensson · 2024-02-22T15:04:52 1708614292

Google got cheap TPU chips, means they circumvent the extremely expensive Nvidia corporate licenses. I can easily see them having 10x the resources of OpenAI for this.

pavon · 2024-02-22T17:22:35 1708622555

Yes, they have deep pockets and could increase investment if needed. But the actual resources devoted today are public, and in line with the parent said.

px43 · 2024-02-22T15:04:19 1708614259

To be clear here, you think that Microsoft has more AI compute than Google?

SV_BubbleTime · 2024-02-22T15:16:42 1708615002

This isn’t OpenAI that make GPTx.

It’s StabilityAI that makes Stable Diffusion X.

Solvency · 2024-02-22T15:17:42 1708615062

can someone explain why nVidia doesn't just hold their own AI? And literally devote 50% of their production to their own compute center? In an age where even ancient companies like Cisco are getting in the AI race, why wouldn't the people with the keys to the kingdom get involved?

declaredapple · 2024-02-22T18:35:14 1708626914

They've been very happy selling shovels at a steep margin to literally endless customers.

The reason is because they instantly get a risk free guaranteed VERY healthy margin on every card they sell, and there's endless customers lined up for them.

If they kept the cards, they give up the opportunity to make those margins, and instead take the risk that they'll develop a money generating service (that makes more money then selling the cards).

This way there's no risk of: A competitor out competing them, not successfully developing a profitable product, "the ai bubble popping", stagnating development, etc.

There's also the advantage that this capital has allowed them to buy up most of TSMC's production capacity, which limits the competitors like Google's TPUs.

blihp · 2024-02-22T18:29:33 1708626573

Because history has shown that the money is in selling the picks and shovels, not operating the mine. (At least for now. There very well may come a point later on when operating the mine makes more sense, but not until it's clear where the most profitable spot will be)

mr_toad · 2024-02-23T01:47:59 1708652879

Don’t stretch that analogy too far. It was applicable to gold rushes, which were low hanging fruit where any idiot could dig a hole and find gold.

Historically, once the easy to find gold was all gone it was the people who owned the deep gold mines and had the capital to exploit them who became wealthy.

chompychop · 2024-02-22T15:31:27 1708615887

"The people that made the most money in the gold rush were selling shovels, not digging gold".

downWidOutaFite · 2024-02-22T15:28:11 1708615691

1. the real keys to the kingdom are held by TSMC whose fab capacity rules the advanced chips we all get, from NVIDIA to Apple to AMD to even Intel these days.

2. the old advice is to sell shovels during a gold rush

swamp40 · 2024-02-22T18:24:42 1708626282

Jensen was just talking about a new kind of data center: AI-generation factories.

AnthonyMouse · 2024-02-22T19:41:41 1708630901

> Why is there not a greater focus on quantization to optimize model performance, given the evident need for more GPU resources?

There is an inherent trade off between model size and quality. Quantization reduces model size at the expense of quality. Sometimes it's a better way to do that than reducing the number of parameters, but it's still fundamentally the same trade off. You can't make the highest quality model use the smallest amount of memory. It's information theory, not sorcery.

netdur · 2024-02-23T17:46:56 1708710416

Yes Quantization compresses float32 values to int8 by mapping the large range of floats to a smaller integer range using a scale factor. This scale factor is key for converting back to floats (dequantization), aiming to preserve as much information as possible within the int8 limits. While quantization reduces model size and speeds up computation, it trades off some accuracy due to the compression. It's a balance between efficiency and model quality, not a magic solution to shrink models without losing some performance.

Quantization is essential for me since a 7B model won't fit on my RTX 2060 with only 6GB of VRAM. It allows me to compress the model so it can run on my hardware.

supermatt · 2024-02-22T14:48:28 1708613308

I believe he means for training

albertzeyer · 2024-02-22T17:47:38 1708624058

I understand that Sora is very popular, so it makes sense to refer to it, but when saying it is similar to Sora, I guess it actually makes more sense to say that it uses a Diffusion Transformer (DiT) (https://arxiv.org/abs/2212.09748) like Sora. We don't really know more details on Sora, while the original DiT has all the details.

tithe · 2024-02-22T18:02:10 1708624930

Is anyone else struck by the similarities in textures between the images in the appendix of the above "Scalable Diffusion Models with Transformers" paper?

If you size the browser window right, paging with the arrow keys (so the document doesn't scroll) you'll see (eg, pages 20-21) the textures of the parrot's feathers are almost identical to the textures of bark on the tree behind the panda bear, or the forest behind the red panda is very similar to the undersea environment.

Even if I'm misunderstanding something fundamental here about this technique, I still find this interesting!

jachee · 2024-02-22T18:25:11 1708626311

Could be that they’re all generated from the same seed. And we humans are really good at spotting patterns like that.

cchance · 2024-02-22T18:41:20 1708627280

So is this "SDXL safe" or "SD2.1" safe, cause SDXL safe we can deal with, if it's 2.1 safe it's gonna end up DOA for a large part of the opensource community again

astrange · 2024-02-22T20:21:41 1708633301

SD2.1 was not "overly safe", SD2.0 was because of a training bug.

2.1 didn't have adoption because people didn't want to deal with the open replacement for CLIP. Or possibly because everyone confused 2.0 and 2.1.

raxxorraxor · 2024-02-23T08:18:12 1708676292

There was a replacement for CLIP? That is awesome. What was the issue with it?

weebull · 2024-02-23T13:12:29 1708693949

Don't know about 3.0, but Cascade has different level of safety between the full model and the light model. Full model is far more prudish, but both completely fail with some prompts.

swyx · 2024-02-23T01:04:58 1708650298

> SDXL safe we can deal with

how exactly did the community deal with it? interested to learn how to unlearn safety

samstave · 2024-02-22T20:22:22 1708633342

>>>How does it perform on 3090, 4090 or less? Are us mere mortals gonna be able to have fun with it ?

>>>Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.

--

Can you fragment responses such that if an edge device (mobile app) is prompted for [thing] it can pass tokens upstream on the prompt -- Torrenting responses effectively - and you could push actual GPU edge devices in certain climates... like dens cities whom are expected to be a Fton of GPU cycle consumption around the edge?

So you have tiered processing (speed is done locally, quality level 1 can take some edge gpu - and corporate shit can be handled in cloud...

----

Can you fragment and torrent a response?

If so, how is that request torn up and routed to appropriate resources?

BOFH me if this is a stupid question? (but its valid for how we are evolving to AI being intrinsic to our society so quickly.)

swyx · 2024-02-23T01:03:07 1708650187

> Dall-e often doesn’t even understand negation, let alone complex spatial relations in combination with color assignments to objects.

can someone explain how negation is currently done in stable diffusion? and why cant we do it in text LLMs?

scottmf · 2024-02-23T01:59:53 1708653593

you can use negative logit bias

sandworm101 · 2024-02-22T14:36:07 1708612567

>> all sorts of edge to giant GPU deployment.

Soon the GPU and its associated memory will be on different cards, as once happened with CPUs. The day of the GPU with ram slots is fast approaching. We will soon plug terabytes of ram into our 4090s, then plug a half-dozen 4090s into a raspberry PI to create a Cronenberg rendering monster. Can it generate movies faster than Pixar can write them? Sure. Can it play Factorio? Heck no.

jsheard · 2024-02-22T14:47:13 1708613233

Any seperation of a GPU from its VRAM is going to come at the expense of (a lot of) bandwidth. VRAM is only as fast as it is because the memory chips are as close as possible to the GPU, either on seperate packages immediately next to the GPU package or integrated onto the same package as the GPU itself in the fanciest stuff.

If you don't care about bandwidth you can already have a GPU access terabytes of memory across the PCIe bus, but it's too slow to be useful for basically anything. Best case you're getting 64GB/sec over PCIe 5.0 x16, when VRAM is reaching 3.3TB/sec on the highest end hardware and even mid-range consumer cards are doing >500GB/sec.

Things are headed the other way if anything, Apple and Intel are integrating RAM onto the CPU package for better performance than is possible with socketed RAM.

sandworm101 · 2024-02-22T15:42:53 1708616573

That depends on whether performance or capacity is the goal. Smaller amounts of ram closer to the processing unit makes for faster computation, but AI also presents a capacity issue. If the workload needs the space, having a boatload of less-fast ram is still preferable to offloading data to something more stable like flash. That is where bulk memory modules connected though slots may one day appear on GPUs.

duffyjp · 2024-02-22T20:06:19 1708632379

I'm having flashbacks to owning a Matrox Millenium as a kid. I never did get that 4MB vram upgrade.

https://www.512bit.net/matrox/matrox_millenium.html

mysterydip · 2024-02-22T14:57:41 1708613861

Is there a way to partition the data so that a given GPU had access to all the data it needs but the job itself was parallelized over multiple GPUs?

Thinking on the classic neural network for example, each column of nodes would only need to talk to the next column. You could group several columns per GPU and then each would process its own set of nodes. While an individual job would be slower, you could run multiple tasks in parallel, processing new inputs after each set of nodes is finished.

zettabomb · 2024-02-22T16:02:55 1708617775

Of course, this is common with LLMs which are too large to fit in any single GPU. I believe Deepspeed implements what you're referring to.

weebull · 2024-02-23T13:19:23 1708694363

No it won't. GPUs are good at ml partly because of the huge memory bandwidth. 1000s of bits wide. You won't find connectors that have that many terminals and maintain signal quality. Even putting a second bank soldered on the same signals can be enough to mess things up.

zettabomb · 2024-02-22T16:00:59 1708617659

I doubt it. The latest GPUs utilize HBM which is necessarily part of the same package as the main die. If you had a RAM slot for a GPU you might as well just go out to system RAM, way too much latency to be useful.

AnthonyMouse · 2024-02-22T20:17:14 1708633034

It isn't the latency which is the problem, it's the bandwidth. A memory socket with that much bandwidth would need a lot of pins. In principle you could just have more memory slots where each slot has its own channel. 16 channels of DDR5-8000 would have more bandwidth than the RTX 4090. But an ordinary desktop board with 16 memory channels is probably not happening. You could plausibly see that on servers however.

What's more likely is hybrid systems. Your basic desktop CPU gets e.g. 8GB of HBM, but then also has 16GB of DRAM in slots. Another CPU/APU model that fits into the same socket has 32GB of HBM (and so costs more), which you could then combine with 128GB of DRAM. Or none, by leaving the slots empty, if you want entirely HBM. A server or HEDT CPU might have 256GB of HBM and support 4TB of DRAM.

brookst · 2024-02-22T21:08:41 1708636121

Agree, this is likely future. It’s really just an extension of The existing tiered CPU cache model

ltbarcly3 · 2024-02-22T14:59:55 1708613995

I don’t think you really understand the current trends in computer architecture. Even cpus are being moved to have on package ram for higher bandwidth. Everything is the opposite of what you said.

sandworm101 · 2024-02-22T15:53:17 1708617197

Higher bandwidth but lower capacity. The real trend is different physical architectures for different compute loads. There is a place in AI for bulk albeit slower memory such as extremely large date sets that want to run internally on a discreet card without involving pci lanes.

ltbarcly3 · 2024-02-24T17:04:59 1708794299

This is also not true. You can transfer from main memory to cards plenty fast enough that it is not a bottleneck. Consumer GPU's don't even use pcie5 yet, which doubles the bandwidth of 4. Professional datacenter cards don't use pcie AT ALL, but they do put a huge amount of RAM on the package with the GPUs.

subzel0 · 2024-02-22T14:26:48 1708612008

“Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right is a dog, on the left is a cat”

https://pbs.twimg.com/media/GG8mm5va4AA_5PJ?format=jpg&name=...

Filligree · 2024-02-22T18:01:12 1708624872

That's _amazing_.

I imagine this doesn't look impressive to anyone unfamiliar with the scene, but this was absolutely impossible with any of the older models. Though, I still want to know if it reliabily does this--so many other things are left to chance, if I need to also hit a one-in-ten chance of the composition being right, it still might not be very useful.

ttul · 2024-02-22T22:37:19 1708641439

It’s the transformer making the difference. Original stable diffusion uses convolutions, which are bad at capturing long range spatial dependencies. The diffusion transformer chops the image into patches, mixes them with a positional embedding, and then just passes that through multiple transformer layers as in an LLM. At the end, the model unpatchify’s (yes, that term is in the source code) the patched tokens to generate output as a 2D image again.

The transformer layers perform self-attention between all pairs of patches, allowing the model to build a rich understanding of the relationships between areas of an image. These relationships extend into the dimensions of the conditioning prompts, which is why you can say “put a red cube over there” and it actually is able to do that.

I suspect that the smaller model versions will do a great job of generating imagery, but may not follow the prompt as closely, but that’s just a hunch.

qumpis · 2024-02-23T02:07:25 1708654045

Convolutions are bad at long range spatial dependencies? What makes you say that - any chance you have a reference?

ttul · 2024-02-23T15:44:22 1708703062

Convolution filters attend to a region around each pixel; not to every other pixel (or patch in the case of DiT). In that way, they are not good at establishing long range dependencies. The U-Net in Stable Diffusion does add self-attention layers but these operate only in the lower resolution parts of the model. The DiT model does away with convolutions altogether, going instead with a linear sequence of blocks containing self-attention layers. The dimensionality is constant throughout this sequence of blocks (i.e. there is no downscaling), so each block gets a chance to attend to all of the patch tokens in the image.

One of the neat things they do with the diffusion transformer is to enable creating smaller or larger models simply by changing the patch size. Smaller patches require more Gflops, but the attention is finer grained, so you would expect better output.

Another neat thing is how they apply conditioning and the time step embedding. Instead of adding these in a special way, they simply inject them as tokens, no different from the image patch tokens. The transformer model builds its own notion of what these things mean.

This implies that you could inject tokens representing anything you want. With the U-Net architecture in stable diffusion, for instance, we have to hook onto the side of the model to control it in various sort of hacky ways. With DiT, you would just add your control tokens and fine tune the model. That’s extremely powerful and flexible and I look forward to a whole lot more innovation happening simply because training in new concepts will be so straightforward.

andrewfong · 2024-02-23T21:19:26 1708723166

My understanding of this tech is pretty minimal, so please bear with me, but is the basic idea is something like this?

Before: Evaluate the image in a little region around each pixel against the prompt as a whole -- e.g. how well does a little 10x10 chunk of pixels map to a prompt about a "red sphere and blue cube". This is problematic because maybe all the pixels are red but you can't "see" whether it's the sphere or the cube.

After: Evaluate the image as a whole against chunks of the prompt. So now we're looking at a room, and then we patch in (layer?) a "red sphere" and then do it again with a "blue cube".

Is that roughly the idea?

feoren · 2024-02-23T05:41:46 1708666906

It kinda makes sense, doesn't it? What are the largest convolutions you've heard of -- 11 x 11 pixels? Not much more than that, surely? So how much can one part of the image influence another part 1000 pixels away? But I am not an expert in any of this, so an expert's opinion would be welcome.

qumpis · 2024-02-23T06:14:35 1708668875

Yes it makes sense a bit. Many popular convents operate on 3x3 kernels. But the number of channel increases per layer. This, coupled with the fact that the receptive field increases per layer and allows convnets to essentially see the whole image relatively early in model's depth (esp. coupled with pooling operations which increase the receptive field rapidly), makes this intuition questionable. Transformers on the other hand, operate on attention which allows them to weight each patch dynamically, but it's clear to me that this allows them to attend to all parts of the image in a way different from convnets.

CSMastermind · 2024-02-22T18:53:36 1708628016

I put the prompt into ChatGPT and it seemed to work just fine: https://imgur.com/LsRM7G4

mortenjorck · 2024-02-22T19:33:42 1708630422

You got lucky! Here's a thread where I attempted the same just now: https://imgur.com/a/xiaiKXp

It has a lot of difficulty with the orientation of the cat and dog, and by the time it gets them in the right positions, the triangle is lost.

mikeg8 · 2024-02-22T18:56:37 1708628197

I dislike the look of chatGPT images so much. The photo-realism of stable diffusion impresses me a lot more for some reason.

bbor · 2024-02-22T21:02:34 1708635754

This is just stylistic, and I think it’s because chatgpt knows a bit “better” that there aren’t very many literal photos of abstract floating shapes. Adding “studio photography, award winner” produced results quite similar to SD imo, but this does negatively impact the accuracy. On the other side of the coin, “minimalist textbook illustration” definitely seems to help the accuracy, which I think is soft confirmation of the thought above.

https://imgur.com/a/9fO2gxN

EDIT: I think the best approach is simply to separate out the terms in separate phrases, as that gets more-or-less 100% accuracy https://imgur.com/a/JGjkicQ

That said, we should acknowledge the point of all this: SD3 is just incredibly incredibly impressive.

LeoPanthera · 2024-02-23T00:01:51 1708646511

This is adjustable via the API, but not in ChatGPT. The API offers styles of "vivid" and "natural", but ChatGPT only uses "vivid".

smcleod · 2024-02-22T19:42:05 1708630925

It looks terrible to me though, very basic rendering and as if it’s lower resolution then scaled up.

Feuilles_Mortes · 2024-02-22T18:25:42 1708626342

What was difficult about it?

zavertnik · 2024-02-22T18:41:24 1708627284

From my experience, the thing that makes using AI image gen hard to use is nailing specificity. I often find myself having to resort to generating all of the elements I want out of an image separately and then comp them together with photoshop. This isn't a bad workflow, but it is tedious (I often equate it to putting coins in a slot machine, hoping it 'hits').

Generating good images is easy but generating good images with very specific instructions is not. For example, try getting midjourney to generate a shot of a road from the side (ie standing on the shoulder of a road taking a photo of the shoulder on the other side with the road crossing frame from left to right)...you'll find midjourney only wants to generate images of roads coming at the "camera" from the vanishing point. I even tried feeding an example image with the correct framing for midjourney to analyze to help inform what prompts to use, but this still did not result in the expected output. This is obviously not the only framing + subject combination that model(s) struggle with.

For people who use image generation as a tool within a larger project's workflow, this hurdle makes the tool swing back and forth from "game changing technology" to "major time sink".

If this example prompt/output is an honest demonstration of SD3's attention to specificity, especially as it pertains to framing and composition of objects + subjects, then I think its definitely impressive.

For context, I've used SD (via comfyUI), midjourney, and Dalle. All of these models + UIs have shared this issue in varying degrees.

astrange · 2024-02-22T20:53:05 1708635185

It's very difficult to improve text-to-image generation to do better than this because you need extremely detailed text training data, but I think a better approach would be to give up on it.

> I often find myself having to resort to generating all of the elements I want out of an image separately and then comp them together with photoshop. This isn't a bad workflow, but it is tedious

The models should be developed to accelerate this then.

ie you should be able to say layer one is this text prompt plus this camera angle, layer two is some mountains you cheaply modeled in Blender, layer three is a sketch you drew of today's anime girl.

xetplan · 2024-02-23T13:09:59 1708693799

Totally agree. I am blown away by that image. Midjourney is so bad at anything specific.

On the other hand, SD has just not been on the level of the quality of images I get from Midjourney. The people who counter this I don't think know what they are talking about.

Can't wait to try this.

lucidrains · 2024-02-22T18:38:12 1708627092

previous systems could not compose objects within the scene correctly, not to this degree. what changed to allow for this? could this be a heavily cherrypicked example? guess we will have to wait for the paper and model to find out

bbor · 2024-02-22T21:24:18 1708637058

From the original paper with this technique:

  We introduce Diffusion Transformers (DiTs), a simple transformer-based backbone for diffusion models that outperforms prior U-Net models and inherits the excellent scaling properties of the transformer model class. Given the promising scaling results in this paper, future work should continue to scale DiTs to larger models and token counts. DiT could also be explored as a drop-in backbone for text-to-image models like DALL E 2 and Stable Diffusion.

Afaict the answer is that combining transformers with diffusers in this way means that the models can (feasibly) operate in a much larger, more linguistically-complex space. So it’s better at spatial relationships simply because it has more computational “time” or “energy” or “attention” to focus on them.

Any actual experts want to tell me if I’m close?

lucidrains · 2024-02-23T15:05:20 1708700720

would be nice if it were just more attention. there could be something else though

jetrink · 2024-02-22T14:53:06 1708613586

One thing that jumps out to me is that the white fur on the animals has a strong green tint due to the reflected light from the green surfaces. I wonder if the model learned this effect from behind the scenes photos of green screen film sets.

zero_iq · 2024-02-22T16:23:23 1708619003

The models do a pretty good job at rendering plausible global illumination, radiosity, reflections, caustics, etc. in a whole bunch of scenarios. It's not necessarily physically accurate (usually not in fact), but usually good enough to trick the human brain unless you start paying very close attention to details, angles, etc.

This fascinated me when SD was first released, so I tested a whole bunch of scenarios. While it's quite easy to find situations that don't provide accurate results and produce all manner of glitches (some of which you can use to detect some SD-produced images), the results are nearly always convincing at a quick glance.

astrange · 2024-02-22T20:54:39 1708635279

One thing they don't so far do is have consistent perspective and vanishing points.

https://arxiv.org/abs/2311.17138

orbital-decay · 2024-02-22T21:33:19 1708637599

As well as light and shadows, yes. It can be fixed explicitly during training like the paper you linked suggests by offering a classifier, but it will probably also keep getting better in new models on its own, just as a result of better training sets, lower compression ratios, and better understanding of the real world by models.

awongh · 2024-02-22T16:55:27 1708620927

I think you have to conceptualize how diffusion models work, which is that once the green triangle has been put into the image in the early steps, the later generations will be influenced by the presence of it, and fill in fine details like reflection as it goes along.

The reason it knows this is that this is how any light in a real photograph works, not just CGI.

Or if your prompt was “A green triangle looking at itself in the mirror” then early generation steps would have two green triangle like shapes. It doesn’t need to know about the concept of light reflection. It does know about composition of an image based on the word mirror though.

diggan · 2024-02-22T15:28:35 1708615715

It's just diffuse irradiance, visible in most real (and CGI) pictures although not as obvious as that example. Seems like a typical demo scene for a 3D renderer, so I bet that's why it's so prominent.

mlsu · 2024-02-22T20:17:09 1708633029

It does make sense though. Accurate global illumination is very strongly represented in nearly all training data (except illustrations) so it makes sense that the model learned an approximation of it.

samstave · 2024-02-22T20:22:18 1708633338

Wow - is it doing pre-render-ray-tracing?

samstave · 2024-02-22T23:14:58 1708643698

EDIT: Wrong window folks....

What if you can | a scene to a model and just have it calc all the ray-paths and then | any color/image... if you pre-calc various ray angles, you can then just map your POV and allow for the volume as it pertains to your POV be mapped with whatever overlay you want.

Here is the crazy cyberpunk part:

IT (whatever 'IT' is) keeps a lidar of everything EVERYONE senses in that space and can overlap/time/sequence anything about each experience and layer (baromoter/news/blah tied to that temporal marker)

Micro resolution of advanced lidar is used in signature creation to ensure/verify/detect fake places vs IRL.

Secret nodes are used to anti-lidar the sensors... so a place can be hidden from drones attempting to map it.

These anonolies are detectable thou, and GIS experts with terra forming skills are the new secOPs.

Fn dorks.

-- so, you already have an asset, lets say its a CUBOID room - with walls and such of wood texture_05.png

smoldesu · 2024-02-23T03:04:03 1708657443

I think you've read too far into this. Ray tracing is not a useful real-world primitive for extracting information from most scenes. Sure, "everything is shiny", but most surfaces are diffuse and don't contain useful visual information besides the object they illuminate. Many supposedly "pure" reflections like mirrors and glass are actually subtle caustics that introduce too much nuance to account for.

Also, "pipe" isn't considered harmful terminology (yet) just FYI. I was confused seeing the "|" mononym in it's place.

samstave · 2024-02-24T01:18:14 1708737494

Thanks for that - I like | .

I was being lazy....

But I realize you are correctin the mirroring - I immediately thought it was ray tracing the green hue from the reflection onto a surface that could see it...

Inference is far more efficient - however - it would be really interesting to know HOW an AI 'thinks' about such reflections?

Whats the current status of AIs documenting themselves?

Workaccount2 · 2024-02-22T14:48:12 1708613292

Not bad, I'm curious of the output if you ask for a mirrored sphere instead.

svenmakes · 2024-02-22T18:13:39 1708625619

This is actually the approach of one paper to estimate lighting conditions. Their strategy is to paint a mirrored sphere onto an existing image: https://diffusionlight.github.io/

Hugsun · 2024-02-22T15:16:04 1708614964

That's very impressive!

yreg · 2024-02-22T16:00:28 1708617628

It is! This isn't something orevious models could do.

iamgopal · 2024-02-22T15:42:56 1708616576

Interesting is that Left and right taken from viewer’s perspective instead of red sphere’s perspective

ebertucc · 2024-02-22T17:44:41 1708623881

How do you know which way the red sphere is facing? A fun experiment would be to write two prompts for "a person in the middle, a dog to their left, and a cat to their right", and have the person either facing towards or away from the viewer.

heyoni · 2024-02-23T00:25:38 1708647938

Now try “a highway being held up by an airplane”

Tried all morning and ChatGPT could not do it.

8n4vidtmkvmk · 2024-02-23T02:51:15 1708656675

That's hard for me to parse as a human. Do you mean the plane is on the highway and causing a traffic jam?

Or is the highway literally being held by a humanoid plane?

bamboozled · 2024-02-23T08:10:34 1708675834

Would it be a highway resting on top of a plane, like the plane is a pillar ?

npunt · 2024-02-22T20:09:44 1708632584

We're getting to strong holodeck vibes here

leumon · 2024-02-22T17:22:16 1708622536

"When in doubt, scale it up." - openai.com/careers

keiferski · 2024-02-22T13:47:53 1708609673

The obsession with safety in this announcement feels like a missed marketing opportunity, considering the recent Gemini debacle. Isn’t SD’s primary use case the fact that you can install it on your own computer and make what you want to make?

atleastoptimal · 2024-02-22T23:09:53 1708643393

AGI will be safe and you will be happy.

And safe doesn't mean "lower than 1/10^6 chance of ending humanity", safe means shoddily implemented curtailing to idpol + fundamentalist level moral aversion towards human sexuality

j-krieger · 2024-02-23T07:31:58 1708673518

Right. For a group of hardliners of the left political spectrum they are weirdly against depictions of sex.

atleastoptimal · 2024-02-23T08:47:37 1708678057

It's not really their feelings, it's about controversy, bad publicity, etc. It's too delicate right now to risk people using their models for sex stuff.

FireInsight · 2024-02-23T10:53:49 1708685629

I don't believe corporations implementing liberal politics to prevent backlash and it being legislated onto them qualifies as them being on the "left political spectrum".

j-krieger · 2024-02-23T14:06:32 1708697192

Some backlash is perfectly ignorable. The twitter mob will move onto the next thing in a few days. And the proliferation of turned-to-the-max DEI employee policies, inclusion committees and self-censored newspeak does come from the californic technobubble cesspit.

There is such a great liberal fear of being perceived as any of the negative -ists and -isms that the pendulum swings to the other extreme where the left horseshoe toe meets its rightmost brother, which is why SD and Google's new toy rewrite ancient European history to include POC's and queer people.

SixDouble5321 · 2024-02-26T03:41:20 1708918880

I am not sure you are a good fit for the internet. Keep your bigotry at home.

jsheard · 2024-02-22T13:52:35 1708609955

At some point they have to actually make money, and I don't see how continuously releasing the fruits of their expensive training for people to run locally on their own computer (or a competing cloud service) for free is going to get them there. They're not running a charity, the walls will have to go up eventually.

Likewise with Mistral, you don't get half a billion in funding and a two billion valuation on the assumption that you'll keep giving the product away for free forever.

archerx · 2024-02-22T14:07:52 1708610872

Ironically their over sensitive nsfw image detector in their api caused me to stop using it and run it locally instead. I was using it to render animations of hundreds of frames but when every 20th to 30th image comes out blurry it ruins the whole animation and it would double the cost or more to rerender it with a different seed hoping to not trigger the over zealous blurring.

I don’t mind that they don’t want to let you generate nsfw images but their detector is hopelessly broken, it once censored a cube, yes a cube...

Sharlin · 2024-02-22T14:30:12 1708612212

Unfortunately their financial and reputational incentives are firmly aligned with preventing false negatives at the cost of a lot of false positives.

archerx · 2024-02-22T16:41:15 1708620075

Unfortunately I don't want to pay for hundreds if not thousands of images I have to throw away because it decided some random innocent element is offensive and blurs the entire image.

Here is the red cube it censored because my innocent eyes wouldn't be able to handle it; https://archerx.com/censoredcube.png

What they are achieving with the over zealous safety issues are driving developers to on demand GPU hosts that will let them host their own models, which also opens up a lot more freedom. I wanted to use the stability AI api as my main source for Stable Diffusion but they make it really really hard especially if you want use it as part of your business.

Sharlin · 2024-02-25T11:20:32 1708860032

I agree that given the status quo, it's a no-brainer to host your own model rather than use their SaaS – and likely one of the main reasons SAI doesn't seem to be on a very stable (heh) footing financially. To put it mildly.

TehCorwiz · 2024-02-22T19:19:13 1708629553

Everyone always talks about Platonic Solids but never Romantic Solids. /s

keiferski · 2024-02-22T13:54:24 1708610064

But there are plenty of other business models available for open source projects.

I use Midjourney a lot and (based on the images in the article) it’s leaps and bounds beyond SD. Not sure why I would switch if they are both locked down.

AuryGlenz · 2024-02-22T15:06:21 1708614381

SD would probably be a lot better if they didn't have to make sure it worked on consumer GPUs. Maybe this announcement is a step towards that where the best model will only be able to be accessed by most using a paid service.

raxxorraxor · 2024-02-23T11:45:11 1708688711

I believe the opposite.

I think the ability for people to adopt models made SD more successful than any other model for image synthesis in the first place.

Similarly how consumer PCs drove innovation towards faster hardware.

I believe it to be the reference of image synthesis for that matter, so "better" is a bit blurry.

bee_rider · 2024-02-22T15:28:34 1708615714

Is it possible to fine-tune Midjourney or produce a LORA?

nickthegreek · 2024-02-22T18:11:37 1708625497

No. You can provide a photos to merge though.

keiferski · 2024-02-22T15:48:44 1708616924

Sorry I don’t know what that means, but a quick google shows some results about it.

elbear · 2024-02-22T19:43:20 1708631000

Finetune means to do extra training on the model with your own dataset, for example to teach it to produce images in a certain style.

raxxorraxor · 2024-02-23T08:26:33 1708676793

Stable Diffusion has a much deeper learning curve but can generate far more accurate images fitting your perhaps special use case.

Although I don't understand the criticism of the images in question. Without a prompt comparison, it is impossible to compare image synthesis. What are examples of images that are beyond these?

keiferski · 2024-02-23T08:49:27 1708678167

I haven’t used SD so maybe the images on their home page here aren’t representative. But they look very generic and boring to me. They seem to lack “style” in a general aesthetic sense.

I am using Midjourney to basically create images in particular artistic styles (e.g., “painting of coffee cup in ukiyo-e style”) and that works very well. I am interested in SD for creating images based on artwork that isn’t indexed by Midjourney, though, as some of the more obscure artists aren’t available.

raxxorraxor · 2024-02-23T10:01:20 1708682480

Usually there are models adapted to a specific theme since generic models at some point hit barriers. To get an idea, you could look up examples on sites like civitai.com.

Of course such sites are heavily biased towards content that is popular, but you will also find quite specific models if you search for certain styles.

bluescrn · 2024-02-22T18:27:06 1708626426

Before long we're going to need a new word for physical 'safety' - when dealing with heavy machinery, chemicals, high voltages, etc.

jiggawatts · 2024-02-22T22:23:14 1708640594

Just replace “safety” with “puritan” in all of these announcements and they’ll make more sense.

causal · 2024-02-22T14:18:07 1708611487

Open source models can be fine-tuned by the community if needed.

I would much rather have this than a company releasing models this size into the wild without any safety checks whatsoever.

srid · 2024-02-22T14:42:25 1708612945

Could you list the concrete "safety checks" that you think prevents real-world harm? What particular image that you think a random human will ask the AI to generate, which then leads to concrete harm in the real world?

politician · 2024-02-22T14:48:48 1708613328

Not even the large companies will explain with precision their implementation of safety.

Until then, we must view this “safety” as both a scapegoat and a vector for social engineering.

astrange · 2024-02-22T20:58:47 1708635527

Companies are not going to explain their legal risks in their marketing material.

dyslexit · 2024-02-22T21:05:42 1708635942

This question narrows the scope of "safety" to something less than what the people at SD or even probably what OP cares about. _Non-random_ CSAM requests targeting potentially real people is the obvious answer here, but even non-CSAM sexual content is also a probably a threat. I can understand frustration with it currently going overboard on blurring, but removing safety checks altogether would result in SD mainly being associated with porn pretty quickly, which I'm sure Stability AI wants to avoid for the safety of their company.

Add to that, parents who want to avoid having their kids generate sexual content would now need to prevent their kids from using this tool because it can create it randomly, limiting SD usage to kids 18+ (which is probably something else Stability AI does not want to deal with.)

It's definitely a balance between going overboard and having restrictions though. I haven't used SD in several months now so I'm not sure where that balance is right now.

int_19h · 2024-02-23T00:38:44 1708648724

> non-CSAM sexual content is also a probably a threat

To whom? SD's reputation, perhaps - but that ship has already sailed with 1.x. That aside, why is generated porn threatening? If anything, anti-porn crusaders ought to rejoice, given that it doesn't involve actual humans performing all those acts.

dyslexit · 2024-02-23T00:52:58 1708649578

As I said, it means parents who don't want their young children seeing porn (whether you agree with them or not) would no longer be able to let their children use SD. I'm not making a statement on what our society should or shouldn't allow, I'm pointing out what _is currently_ the standard in the United States and many other, more socially conservative, countries. SD would become more heavily regulated, an 18+ tool in the US, and potentially banned in other countries.

You can have your own opinion on it, but surely you can see the issue here?

int_19h · 2024-02-23T01:48:46 1708652926

I can definitely see an argument for a "safe" model being available for this scenario. I don't see why all models SD releases should be so neutered, however.

BeFlatXIII · 2024-02-23T12:50:09 1708692609

How many of those parents would have the technical know-how to stop their lids from playing with SD? Give the model some “I am over 18” checkbox fig leaf and let them have their fun.

astrange · 2024-02-22T20:58:18 1708635498

The harm is that any use of the model becomes illegal in most countries (or offends credit card processors) if it easily generates porn. Especially if it does it when you didn't ask for it.

causal · 2024-02-22T15:28:26 1708615706

If 1 in 1,000 generations will randomly produce memorized CSAM that slipped into the training set then yeah, it's pretty damn unsafe to use. Producing memorized images has precedent[0].

Is it unlikely? Sure, but worth validating.

[0] https://arxiv.org/abs/2301.13188

dns_snek · 2024-02-22T19:20:00 1708629600

Do you have an example? I've never heard of anyone accidentally generating CSAM, with any model. "1 in 1,000" is just an obviously bogus probability, there must have been billions of images generated using hundreds of different models.

Besides, and this is a serious question, what's the harm of a model accidentally generating CSAM? If you weren't intending to generate these images then you would just discard the output, no harm done.

Nobody is forcing you to use a model that might accidentally offend you with its output. You can try "aligning" it, but you'll just end up with Google Gemini style "Sorry I can't generate pictures of white people".

causal · 2024-02-22T19:46:46 1708631206

Earlier datasets used by SD were likely contaminated with CSAM[0]. It was unlikely to have been significant enough to result in memorized images, but checking the safety of models increases that confidence.

And yeah I think we should care, for a lot of reasons, but a big one is just trying to stay well within the law.

[0] https://www.404media.co/laion-datasets-removed-stanford-csam...

astrange · 2024-02-22T20:57:13 1708635433

SD always removed enough nsfw material that this probably never made it in there.

7moritz7 · 2024-02-22T20:11:31 1708632691

Then you know almost nothing about the SD 1.5 ecosystem apparently. I've finetuned multiple models myself and it's nearly impossible to get rid of the child-bias in anime-derived models (which applies to 90 % of character focussed models) including nsfw ones. Took me like 30 attempts to get somewhere reasonable and it's still noticeable.

dns_snek · 2024-02-22T20:32:04 1708633924

If we're being honest, anime and anything "anime-derived" is uncomfortably close to CSAM as a source material, before you even get SD involved, so I'm not surprised.

What I had in mind were regular general purpose models which I've played around with quite extensively.

yreg · 2024-02-22T16:10:12 1708618212

Why not run the safety check on the training data?

causal · 2024-02-22T19:34:54 1708630494

They try to, but it is difficult to comb through billions of images, and at least some of SD's earlier datasets were later found to have been contaminated with CSAM[0].

https://www.404media.co/laion-datasets-removed-stanford-csam...

srid · 2024-02-22T15:50:05 1708617005

Okay, by "safety checks" you meant the already unlawful things like CSAM, but not politically-overloaded beliefs like "diversity"? The latter is what the comment[1] you were replying to was referring to (viz. "considering the recent Gemini debacle"[2]).

[1] https://news.ycombinator.com/item?id=39466991

[2] https://news.ycombinator.com/item?id=39456577

causal · 2024-02-22T19:32:39 1708630359

Right, by "rather have this [nothing]" I meant Stable Diffusion doing some basic safety checking, not Google's obviously flawed ideas of safety. I should have made that clear.

I posed the worst-case scenario of generating actual CSAM in response to your question, "What particular image that you think a random human will ask the AI to generate, which then leads to concrete harm in the real world?"

thomquaid · 2024-02-22T20:29:01 1708633741

Could you elaborate on the concrete real world harm?

AnthonyMouse · 2024-02-22T20:54:02 1708635242

> the recent Gemini debacle.

I've noticed that SDXL does something a little odd. For a given prompt it essentially decides what race the subject should be without the prompt having specified one. You generate 20 images with 20 different seeds but the same prompt and they're typically all the same race. In some cases they even appear to be the same "person" even though I doubt it's a real person (at least not anyone I could recognize as a known public figure any of the times it did this). I'm kind of curious what they changed from SD 1.5, which didn't do this.

wtcactus · 2024-02-22T14:10:39 1708611039

I notice they are avoiding images of people in the announcement.

I wonder if they are afraid of the same debacle as google AI and what they mean by "safety" is actually heavy bias against white people and their culture like what happened with Gemini.

mr_toad · 2024-02-23T13:16:00 1708694160

They should just generate a nude black Taylor Swift on a longboat so that everyone can be equally offended.

Lockal · 2024-02-23T07:12:13 1708672333

I wouldn't look for hidden reasons. Recent image generators are already too good with face generation (thanks to CelebA-like datasets and early researchers). And now the emphasis is on the multimodality of the model within a domain. There, almost every picture demonstrates some aspect of it. Somewhere there is text on the picture (old AI used to output bullshit instead of letters), somewhere there are humorous references to old images (for example, a cosmonaut on a pig).

danielbln · 2024-02-22T19:44:42 1708631082

What's white people culture?

potwinkle · 2024-02-22T19:52:41 1708631561

From the examples I see on Twitter, they are usually referring to the different cultures of Irish, European, and American white people. Gemini, in an effort to reverse the bias that the models would naturally have, ends up replacing these people with those from other cultures.

astrange · 2024-02-22T20:59:53 1708635593

Calling Irish people white is a rather historically radical statement.

int_19h · 2024-02-23T00:40:59 1708648859

Since the definition of "white" is inherently cultural, it varies from place to place and from time to time. Today, in US and Europe, pretty much everyone who cares about racial categorization would consider Irish "white". Historically, it was different, but that is only relevant when discussing history.

sealeck · 2024-02-22T21:41:51 1708638111

White is a pretty complex and non-obvious category.

max47 · 2024-02-24T20:14:54 1708805694

Isn't it even more racist to replace them in a picture? Being told that your skin colour is too offensive to show sounds a lot worse to me than calling them "white" considering their skin is very white

sbochins · 2024-02-23T17:47:36 1708710456

Let me guess, you’re one of these people that think the Irish had it harder than African Americans in early US history.

samatman · 2024-02-23T02:56:14 1708656974

Not true, that was made up by race baiting academics who were lying liars. It's absurd that anyone fell for it, frankly.

The lie originates with a Communist race hustler named Noel Ignatiev, also known for publishing Race Traitor magazine. A thoroughly unpleasant person.

DaSHacka · 2024-02-23T11:40:59 1708688459

> who were lying liars.

As opposed to truthful liars?

zild3d · 2024-02-23T11:08:44 1708686524

is pale better?

7moritz7 · 2024-02-22T20:13:47 1708632827

US American white people. Anything else would be a ridiculous overgeneralization, like "Asian culture", even if you set some arbitary benchmark for teint and only look at those European countries it's still too much diversity to pool together.

t0lo · 2024-02-22T22:55:21 1708642521

A little continent called europe?

geraneum · 2024-02-23T00:51:44 1708649504

As if you can generalize the culture of different European countries, or even different regions in the same country just by skin color. Now this, in my opinion, is a form of cultural erasure where all the intricacies and interesting aspects of culture are put aside and overshadowed by skin color.

JayPalm · 2024-02-23T04:14:59 1708661699

Enter, The Horseshoe

Jackson__ · 2024-02-23T07:22:35 1708672955

I believe it's just the usual issue of mistakes in generations being easier to spot in humans, by humans. Usually not a good sign for model quality.

hizanberg · 2024-02-22T15:24:53 1708615493

IMO the "safety" in Stable Diffusion is becoming more overzealous where most of my images are coming back blurred, where I no longer want to waste my time writing a prompt only for it to return mostly blurred images. Prompts that worked in previous versions like portraits are coming back mostly blurred in SDXL.

If this next version is just as bad, I'm going to stop using Stability APIs. Are there any other text-to-image services that offer similar value and quality to Stable Diffusion without the overzealous blurring?

Edit:

Example prompt's like "Matte portrait of Yennefer" return 8/9 blurred images [1]

[1] https://imgur.com/a/nIx8GBR

Tenoke · 2024-02-22T15:35:50 1708616150

The nice thing about Stable Diffusion is that you can very easily set it up on a machine you control without any 'safety' and with a user-finetuned checkpoint.

cyanydeez · 2024-02-22T15:49:49 1708616989

they're nerfing the models, not just the prompt engineering.

After SD1.5 they started directly modifying the dataset.

it's only other users who "restore" the porno.

and that's what we're discussing. there's a real concern about it as a public offering.

Tenoke · 2024-02-22T15:52:17 1708617137

Sure, but again if you run it yourself you can use the finetuned by users checkpoints that have it.

cyanydeez · 2024-02-22T16:08:51 1708618131

yes, but the GP is discussing the API, and specifically the company that offers the base model.

they both don't want to offer anything that's legally dubious and it's not hard to understand why.

jncfhnb · 2024-02-22T16:44:33 1708620273

No it’s not. It’s perfectly reasonable not to want to generate porn for customers.

The models being open sourced makes them very easy to turn into the most deprived porno machines ever conceived. And they are.

It is in no way a meaningful barrier to what people can do. That’s the benefit of open source software.

raxxorraxor · 2024-02-23T11:36:42 1708688202

That isn't the topic. Porn is an example, but safety is synonymous with puritanical requirements arbitrarily summed up as the lowest common denominator. I want a powerful AI, not a replacement of a priest.

Gemini demonstrated a product I do not want to use and I am aware about the requirements of corporate contexts, although I think the safety mechanisms should be in the hand of users.

Google optimized for advertisers, but I am not interested in such content as it provides little value.

jncfhnb · 2024-02-23T17:44:53 1708710293

Ok, but it seems very stupid to say you want the powerful AI to specifically come from a specific API when the very same tech is open sourced for any one to do whatever they want with

cyanydeez · 2024-02-23T16:48:04 1708706884

most of your arguments are about business cases.

No large scale model maker is going to put out public models for B2B with dubious use cases.

what the problem is: OpenAI, facebook, Google are not curating the data sets. you're arguing they shouldn't put controls after the fact. but what you actually want is them to use quality datasets.

araes · 2024-02-22T16:59:12 1708621152

Taking the actual example you provided, I can understand the issue. Since it amounts to blurring images of a virtual character, that are not actually "naughty." Equivalent images in bulk quantity are available on every search engine with "yennefer witcher 3 game" [1][2][3][4][5][6] Returns almost the exact generated images, just blurry.

[1] Google: https://www.google.com/search?sca_esv=a930a3196aed2650&q=yen...

[2] Bing via Ecosia: https://www.ecosia.org/images?q=yennefer%20witcher%203%20gam...

[3] Bing: https://www.bing.com/images/search?q=yennefer+witcher+3+game...

[4] DDG: https://duckduckgo.com/?va=e&t=hj&q=yennefer+witcher+3+game&...

[5] Yippy: https://www.alltheinternet.com/?q=yennefer+witcher+3+game&ar...

[6] Dogpile: https://www.dogpile.com/serp?qc=images&q=yennefer+witcher+3+...

gangstead · 2024-02-22T16:13:53 1708618433

I've never seen blurring in my images. Is that something that they add when you do API access? I'm running SD 1.5 and SDXL 1.0 models locally. Maybe I'm just not prompting for things they deem naughty. Can you share an example prompt where the result gets blurred?