Megaface

samwillis · on Jan 2, 2023

One of the difficulties with these training datasets is the currently understood rules around web scraping. The current legal precedent [0] is that web scraping is perfectly legal, despite what is in the websites terms of service, "licence" or robots.txt. If a human can navigate to it freely, you can scrape it using automated means.

What you can't do with scraped data is republish it verbatim. Doing a data analysis on scraped data is permitted by law, and you can publish your analysis of that data.

The question is, is an AI model trained on scraped data a derived analysis that is therefore legal? Or is it republishing of the original data? We need a test case to find out.

In the case of this dataset, I don't think the CC license applies to people using it. It "may" apply to redistribution of it for free. If the dataset was sold, that would be a violation. I suspect (after tested in court) a model trained on this dataset would be allowed despite the CC license on the photos.

Personally, in this case I think the ethics committee of the University should have put up barriers to the project. The morals of this are questionable at best.

0: https://techcrunch.com/2022/04/18/web-scraping-legal-court/

satvikpendem · on Jan 2, 2023

There was an updated ruling in November 2022 showing that HiQ was ruled against and that they reached a settlement with LinkedIn, so I'm not sure that web scraping is entirely legal.

https://www.natlawreview.com/article/hiq-and-linkedin-reach-...

EMIRELADERO · on Jan 2, 2023

That was because Linkedin added a no-scraping clause to their ToS and also put up a login wall for viewing profiles in the first place.

If you scraped from a web page without actually signing up for an account you wouldn't be accepting the terms and would thus be legally in the clear.

drusepth · on Jan 3, 2023

>One of the difficulties with these training datasets is the currently understood rules around web scraping. The current legal precedent [0] is that web scraping is perfectly legal, despite what is in the websites terms of service, "licence" or robots.txt. If a human can navigate to it freely, you can scrape it using automated means.

So... the HN crowd seems perfect to ask, since everyone's always clamoring for a Goodreads alternative: where is the line between "data" and "content" (on public pages)? AFAICT, Goodreads has been entrenched for so long because their ToS doesn't allow anyone else to use their book/reviews data, which makes quite a large moat. This comment would make me think it's legal to scrape that moat.

pbhjpbhj · on Jan 2, 2023

>Or is it republishing of the original data?

If it's publishing _data_ then you're fine under regular copyright as it only protects artistic works and not things like data. You might fall shy of other IP legislation but not copyright.

YMMV, this is not legal advice and represents my personal opinion unrelated to my employment.

cmeacham98 · on Jan 2, 2023

The "data" here is photographs, which all jurisdictions I'm aware of treat as coprightable.

pbhjpbhj · on Jan 3, 2023

FWIW in UK it's possible for works to be too generic to attract copyright, or to be slavish reproductions (eg as photos).

ID images have their parameters dictated by technical needs - no smiling, plain background, even lighting, no eyewear, head only, face-on - and so leave no room for artistry.

An ID photo might lack copyright.

I know of no caselaw here (on copyright in ID shots) and am projecting from eg the "red bus" case (Temple Island v New English Tea).

Mtinie · on Jan 2, 2023

Which makes this case even more interesting to me. Some percentage of those photos’ copywrites’ are owned by corporations rather than the pictured individuals.

If it was simply a large group of selfies, I don’t expect much legal challenge from the allegedly aggrieved. But when companies with legal counsel get involved…

messe · on Jan 3, 2023

Yep, in the EU you can run afoul of database rights. This is separate from copyright.

fragmede · on Jan 2, 2023

The CFAA would be the thing to look out for.

langsoul-com · on Jan 3, 2023

There was an article talking about how big companies fund researchers to harvest datasets and train models. Then those companies use those models.

Even though, in the tos, it's not for commercial use.

KRAKRISMOTT · on Jan 2, 2023

You can do the scraping in a jurisdiction where it is legal.

pbhjpbhj · on Jan 2, 2023

Importing (in the geographical sense) the data would still be infringing, you've just scraped it in a convoluted way -- legal systems in my limited experience take account of such things.

traceroute66 · on Jan 2, 2023

> You can do the scraping in a jurisdiction where it is legal.

No such thing with GDPR.

Why do you think so many US websites take the lazy-ass approach and block EU visitors to their websites ?

Simple, its because either you comply with GDPR or you don't process the information of citizens of GDPR covered countries. End of story.

laingc · on Jan 2, 2023

Well, no, only if you’re under the jurisdiction of the EU courts. They can rule against you as much as they like, but it’s not enforceable outside of the EU or a jurisdiction that chooses to enforce EU judgements.

traceroute66 · on Jan 2, 2023

> Well, no, only if you’re under the jurisdiction of the EU courts.

That is an awfully naïve argument my friend.

If it were that simple then there would, for example, be no need for a 68 page document entitled "The Sedona Conference Commentary on the Enforceability in U.S. Courts of Orders and Judgments Entered under GDPR"[1].

Allow me to quote from the Conclusion on page 68:

"As the Commentary shows, the enforceability of GDPR orders and judgments in a U.S. court will depend on several factors, including the nature of the relief sought through the order or judgment, the nature of the underlying violation and the process through which the order or judgment was initially obtained in the EU, and the U.S. organization’s contacts with the EU."

I would say that makes it pretty darn clear that it's far from being a simple argument about the jurisdiction in which the defendant is based.

[1]https://www.dorsey.com/~/media/files/newsresources/publicati...

int_19h · on Jan 3, 2023

It's not about jurisdiction in which the defendant is based, of course, but rather in jurisdiction where it has presence.

If foreign laws were enforceable against actors who never operated in those other countries, we'd have to enforce Saudi laws against atheism and Russian laws against "gay propaganda".

TedDoesntTalk · on Jan 3, 2023

There are literally millions of US companies with no EU presence, but who have online ordering of digital products. To the parent: Good luck enforcing GDPR on them. They won’t even abide by an EU subpoena that asks “what PII are you storing?” And right to delete? Haha, good luck.

TylerE · on Jan 3, 2023

Why should they? What happens when, say, an Arab state passes a law that selling pornography gets a body part chopped off.

TedDoesntTalk · on Jan 3, 2023

Good example. In the US, We thank our circumstances that we were not born in Saudi Arabia. And that’s about it. We buy porn and keep our body parts. The parent traceroute66 has no idea what he’s talking about.

pixl97 · on Jan 2, 2023

If I'm in China and I scrape/collect data I don't think the GDPR is going to do anything to me. This really only affects businesses that some the EU has some means of reaching.

JumpCrisscross · on Jan 2, 2023

Beyond copyright, how would these requirements work with Illinois’ biometrics law?

the_duke · on Jan 2, 2023

It's not as easy as that.

Pictures are clearly personally identifiable data, so storing them violates the GDPR if you don't have permission to do so.

Some "data analysis company" got fined a hefty sum for doing so with EU citizens.

I forgot the name, but they were recently in the news for helping Ukraine identify Russian soldiers by picture.

Of course they were also aggregating other data including names, so just pictures might be a more complicated case, but as a company with EU exposure I wouldn't do it. It's pretty clearly against the law.

samwillis · on Jan 2, 2023

You are quite right, forgot that one.

Point is though, we need a test case to go through the courts to clarify all of this. There are companies betting billions on the outcome that they are ok to do what they are doing.

voakbasda · on Jan 3, 2023

They are betting they can make billions of dollars while the courts screw around trying to decide the legality. They don’t care if it is legal; they only care about making money today. This is the standard playbook for big money these days: profit until the courts decide you can’t.

fleddr · on Jan 2, 2023

"Pictures are clearly personally identifiable data, so storing them violates the GDPR if you don't have permission to do so."

Wouldn't a Creative Commons license express this permission?

kixiQu · on Jan 2, 2023

IANAL, but I believe no; the CC license handles the rights that a photographer can hand out, but doesn't come with any kind of model release guarantees.

fleddr · on Jan 2, 2023

Model release is a good point but in many situations where people are photographed it does not apply. When you make photos of yourself or your family or even of people in public spaces, you do not require a model release. And I imagine this to be the main input of this training set.

When you hire a model, photograph the person and then use these photos for promotion or commercial activities, you do require a model release. But in that case it would be absurdly weird to publish such commercial material as CC NC on Flickr, makes no sense.

kixiQu · on Jan 3, 2023

I recommend https://ipo.blog.gov.uk/2019/06/11/copyright-and-gdpr-for-ph... and https://www.mondaq.com/germany/data-protection/714306/does-t... here for a sense of how the PII consent framework here is different than commercial licensing.

charcircuit · on Jan 2, 2023

A ML model would be considered transformational.

janalsncm · on Jan 2, 2023

Depends? What about Naive Bayes?

contravariant · on Jan 3, 2023

Why would that be less transformational than other models?

sschueller · on Jan 2, 2023

This datasets usage and creation violates Swiss law [1]. Any person in Switzerland has the right to their face in any picture taken now and any time in the future even if taken by someone else. Without the explicit consent of a person, their face may not be used or published in anyway or form. There are only a few exceptions like for public figures and celebrities but even then they also have a right to privacy.

SRF once did a segment about face recognition and public photos from social media. Under strict supervision and journalism protection they created a data set and showed what was possible. The dataset and code was then destroyed. [2]

Similar laws exist in EU states as well.

[1] https://www.edoeb.admin.ch/edoeb/de/home/datenschutz/Interne...

[2] https://www.srf.ch/news/schweiz/automatische-gesichtserkennu...

p-e-w · on Jan 3, 2023

> This datasets usage and creation violates Swiss law

And drinking alcohol violates Saudi Arabian law.

Unless the institutions involved are subject to Swiss jurisdiction (which doesn't appear to be the case), this doesn't mean much. If anything, the blame lies with the Swiss government, which allowed a foreign company (Flickr) to operate in Switzerland without adequate guarantees that this company would protect Swiss users' data from unlawful applications.

But like most other governments, they were happy to let US tech giants bait their own citizens into giving them boatloads of personal data, which ended up outside of the regulatory control of said governments. Any complaints now are just posturing. The time to act was 15 years ago.

smaudet · on Jan 3, 2023

Can't agree, if an assassin illegaly enters a country and commits a crime, the blame falls on the miscreant for their behavior not the country for failing to notice the deceptive miscreant.

You would have to demonstrate the Swiss govt was made aware of Flickr before they were a thing there and then failed to ban them, for this line of reasoning to make any sense...

p-e-w · on Jan 3, 2023

Is that supposed to be a joke? Privacy advocates around the world have been rallying against Big Tech's practices for almost two decades. There have been hundreds of well-documented incidents where technology corporations violated privacy laws, either with complete impunity or with "punishments" that can be filed away as "cost of doing business".

Yet not a single government thinks that maybe, companies that repeatedly break the law shouldn't be allowed to do business in their jurisdiction at all.

This is 100% on the respective governments, who have been made aware countless times but have chosen to all but ignore the problem.

bawolff · on Jan 3, 2023

I don't think that is comparable. Your example is literally happening on the countries soil.

I think a better physical analogy would be handling the sale of imported foreign products that dont meet safety standards.

gardenhedge · on Jan 2, 2023

It strikes me as sad that people's photos have been taken a used to train a technology for a corporate profit. People just wanted to share their wedding photos.

ironmagma · on Jan 2, 2023

It's no worry, someday that data will be so ubiquitous and well-studied that it won't even be profitable, it will just be trivial to construct or deconstruct any face.

fleddr · on Jan 2, 2023

I agree, but why would one share wedding photos using an open license like Creative Commons?

jefftk · on Jan 2, 2023

In this case, because Flickr used to wave hosting fees for people who chose a Creative Commons license.

tomrod · on Jan 2, 2023

Because regulations are generally unknown to folks who don't spend their time solving tech problems.

People simply assumed they could share it easily with friends and family.

fleddr · on Jan 2, 2023

This doesn't match my experience, and I run a photo community myself.

You're absolutely right that people are generally fairly clueless about licenses, especially in the amateur domain. And the main implication of that is that they don't bother with it at all and leave it at whatever the default is, which typically is "copyrighted, all rights reserved".

Those explicitly tinkering with licenses, which is a purposeful action, tend to actually know (somewhat) what they are doing.

Further, if you leave a photo's license to its default, copyrighted, absolutely nothing stops you from sharing it with friends and family. What would happen? You share it with them and then sue yourself?

Similarly, somebody you don't even know could use your copyrighted image and post it on social media. Again, nothing happens, as this is widespread behavior and called "fair use", which it legally absolutely isn't. But nobody cares, as nobody will sue over it unless there is a case of vast commercial usage.

sowbug · on Jan 3, 2023

Do ad blockers also make you sad?

teolandon · on Jan 3, 2023

How is that comparable in any way?

sowbug · on Jan 3, 2023

Both cases use web content in ways other than their publishers intended.

pksebben · on Jan 3, 2023

calling the urls in a Blocklist is kinda stretching the definition of 'content' though, innit?

sacrosancty · on Jan 2, 2023

Replace "corporate profit" with "social good", which is what it generally comes from, and then is it still sad?

You seem to imply there's something wrong with corporate profit. We as society want and encourage corporate profit because we want the social good that corporations provide and the profit incentivizes them to do it. Profit is a rough measure of how much good they do for people.

Profit is like salary for investors. Salary is fine for doctors and teachers, isn't it? It's also fine for investors which do the useful and difficult job of deciding which companies are doing the most good, then encouraging them to do more of it by investing money.

mikeiz404 · on Jan 3, 2023

Corporate profit does not always lead to social good. There are lots of cases, especially when companies scale significantly and over a long enough period of time, where the profit motive leads to a decline in social good which is often seen in the form of negative externalities.

> Profit is a rough measure of how much good they do for people.

While this is true to a certain degree in many situations it does not capture the distribution of the “good” across people (is “good” given to 100 people equivalent to 100x that “good” enjoyed by one person?) and it does not take into account negative externalities.

I’m not saying all profit is bad but I also don’t think, in the current system we have, that profit in and of itself is inherently good. Sometimes profit and social good are aligned and sometimes they are not.

gardenhedge · on Jan 3, 2023

Ah yes, the "social good" that is literally destroying our planet.

rsync · on Jan 2, 2023

What's going to happen when (not if) it becomes cheap and simple to mock up your own head and you "present" that in multiple locations, simultaneously ?

It's interesting to think about how these systems (and their human operators) will react when their system recognizes, with certainty, that X is in two places (or 15) at once ...

... or if X is recorded somewhere (Zurich) and then two hours later at an impossible distance (San Francisco) ...

In a way, it's the opposite of the "Sigil" plot device in Gibsons _Zero History_ wherein the wearer was invisible to security camera networks.[1] Instead, the operator of this network of clones aspires to be on as many cameras as possible.

[1] https://en.wikipedia.org/wiki/Zero_History

smaudet · on Jan 3, 2023

Indeed - the (security/social/economical) issues these systems will pose are nearly boundless, something as ill considered and will in hindsight be as obviously shortsighted and the early naivete of the first internet users, who saw no issue with unencrypted communications, centralized unscaling systems, and complete lack of security credentials...

cshimmin · on Jan 2, 2023

Hmmm...

    June 11, 2020: MegaFace dataset is now decommissioned. University of Washington has ceased distributing the MegaFace dataset citing the challenge has concluded and that maintenance of their platform would be too burdensome.

colesantiago · on Jan 2, 2023

"All photos included a Creative Commons licenses, but most were not licensed for commercial use."

I wonder what the implications are for Stable Diffusion, DALLE and Midjourney since that art images on the internet are copyrighted by default.

Even with a fair use argument, there are examples in cases where AI was generating art that included the signatures of artists.

https://nwn.blogs.com/nwn/2022/12/lensa-ai-art-images-withou...

gwern · on Jan 2, 2023

This is apples and oranges. SD et al are defended on the grounds of being transformative use (https://en.wikipedia.org/wiki/Transformative_use): they do not distribute (ie copy) the original training images, and they are not a derivative work due to transformativeness, so the license of the original images is completely irrelevant. (Details like 'signatures' are also irrelevant: if I write a style parody of William Shakespeare and add a '--Willy Shakespeare' at the end to round it off, have I revealed that I have secretly copied his work? Of course not. It's just plausible that there would be a name there, so I came up with a name.)

The criticism here is that distributing (copying) the original image violates the non-commercial clause of the original images because someone, somewhere, might somehow have made money in some way because the dataset exists; but as they somewhat lamely acknowledge later, what counts as 'commercial' has never been clearly defined, and it probably can't be defined (because for most people 'commercial' seems to be defined by 'ewww'), and this is why CC-NC licenses are heavily discouraged by WMF and other FLOSS groups and weren't part of FLOSS from the beginning even though Stallman was in large part attacking commercial exploitation.

pbhjpbhj · on Jan 2, 2023

>if I write a style parody of William Shakespeare and add a '--Willy Shakespeare' at the end to round it off, have I revealed that I have secretly copied his work? //

I doubt you're suggesting SD, Dall-E, etc., are producing parodies so bringing in parody considerations muddies the water a lot. Also, Shakespeare's works are out of copyright.

If you sell a painting signed with a [facsimile] signature of Dali then it's pretty hard to say you didn't copy the signature, as a minimum. Thats likely to be a trademark violation too. Now, suppose you include aspects in the image specifically associated with the artist, and a signature, ... there's no way to genuinely deny that is a derivative.

theptip · on Jan 2, 2023

> a painting signed with a [facsimile] signature of Dali

That's not what's happening here though.

If you look at the original tweet (https://twitter.com/LaurynIpsum/status/1599953586699767808) it seems that the complaint is about the "mangled remains of an artist’s signature". I don't see any examples where it's actually copying the signature of a specific artist.

(Please do share an example of that if there is one.)

return_to_monke · on Jan 2, 2023

I do respect artist's concerns. I have a hard time getting this one though. The ai learned that humans usually put squiggly lines in the corners, and it does, too. What is wrong with this?

Filligree · on Jan 3, 2023

It's reasoning based on a false assumption. If you start from the assumption that AI imagery is based on specific source images, which had signatures, then it can make sense to assume that the squiggly lines in the corner are evidence that it copied-and-mangled some such specific pictures.

That's not what it did, of course. But if you start from the assumption...

colesantiago · on Jan 2, 2023

Nobody is complaining about William Shakespeare or public domain works that are out of copyright. The issue is that there is clearly copyrighted works in the model that living artists have not consented to being in the model.

> SD et al are defended on the grounds of being transformative use they do not distribute (ie copy) the original training images, and they are not a derivative work due to transformativeness, so the license of the original images is completely irrelevant.

If this is irrelevant, why is Stability.AI creating an "opt out" system for artists? It's almost like they know they are copying digital artists works without their explicit consent. This wouldn't be an issue if they only used images in the public domain.

https://www.technologyreview.com/2022/12/16/1065247/artists-...

And we both know that Stable Diffusion have Dreambooth which you have to pay to use the platform, so this is clearly not fair use.

https://platform.stability.ai/docs/getting-started/credits-a...

Filligree · on Jan 3, 2023

> If this is irrelevant, why is Stability.AI creating an "opt out" system for artists? It's almost like they know they are copying digital artists works without their explicit consent. This wouldn't be an issue if they only used images in the public domain.

It might also be because there's a huge backlash against the concept of stable diffusion in certain circles. Whether or not that backlash matches the legalities, or if their logic is sound at all, has absolutely nothing to do with it.

gwern · on Jan 3, 2023

> Nobody is complaining about William Shakespeare or public domain works that are out of copyright.

They definitely are, and I deliberately chose a public domain writer for my example to get away from the narrow procedural grounds and focus on the issue: does generating a 'signature' or 'watermark' prove copying? No, of course not, and that's true whether you are in or out of the public domain.

> If this is irrelevant, why is Stability.AI creating an "opt out" system for artists?

What is prudent or nice or moral or good PR != legal. (And vice-versa, of course.)

schemescape · on Jan 2, 2023

Does anyone know if attempts have been made to trick these ML models into reproducing original copyrighted inputs verbatim (edit: or close enough)?

Edit: Asking about verbatim copies wasn't really a great question. I should have asked about producing things that are "close enough to cause legal trouble" (whether that be due to copyright, trademark, or something else).

gwern · on Jan 2, 2023

There's been a lot of work on memorization, yes, and you can also do nearest-neighbor lookups in the original data to gutcheck 'memorization'. As usual, the answer is "it's complicated" but for most practical purposes, the answer is 'no': you will get the Mona Lisa if you ask for it, absolutely, but the odds of a randomly generated image being a doppelganger is near-zero. (If you've seen stuff on social media to the contrary, then you may have been misled by various people peddling img2img or 'variation' functions, or prompting for it, or other ways of lying/ignorance.)

But you certainly can get things like watermarks without any real memorization. Watermarks have been a nuisance in GANS, for example - the StyleGAN CATS model was filled with watermarks or attempted meme text captions, even though the cats were so nightmarish that they obviously weren't 'plagiarized' so nobody made a big deal about it back then and they understood the GAN had simply learned that watermarks were a thing in many real images and it would try to imitate them where plausible in a sample.

int_19h · on Jan 3, 2023

If there's a specific prompt that makes the model produce such an image (without img2img), why shouldn't it count? The question isn't whether the model produces such things randomly, but rather whether it's capable of producing them in principle, even if it requires a very elaborate prompt.

gwern · on Jan 3, 2023

> If there's a specific prompt that makes the model produce such an image (without img2img), why shouldn't it count?

There's a specific prompt which makes a human artist produce such an image without img2img too: "Please draw the Mona Lisa". There - you just did it in your head while reading this comment!

int_19h · on Jan 3, 2023

So? Humans aren't copyrightable works. But a bunch of weights that constitutes the model is just data, and that data can very well contain copyrighted works. The question is whether it does in any meaningful sense. And if it can reproduce them verbatim with the right prompt, I don't see how the answer could be "no", anymore so than a password-protected archive of a copyrighted JPEG.

polygamous_bat · on Jan 2, 2023

One, the diffusion model's possible output space contains every RGB image ever. But two, it cannot ever possibly contain the original inputs verbatim, because (the size of the model)/(the size of the training set) comes out to be something like 0.2 KB per image. Unless it's an incredible compression algorithm, diffusion necessarily have learned something from the input rather than copy-pasting things, as claimed upthread.

TeMPOraL · on Jan 2, 2023

> Unless it's an incredible compression algorithm, diffusion necessarily have learned something from the input rather than copy-pasting things, as claimed upthread.

Arguably, "learning" and "compression" are the same thing.

In this sense, you can view SD as a compression algorithm where the decoder is the model, the compressed file is the prompt + tweakable params, and there aren't any error checks made, so you can feed random data into the decoder and get something out.

int_19h · on Jan 3, 2023

Or maybe the compressed file is the model, and prompt + tweakable params is a "path" inside that compressed file?

schemescape · on Jan 2, 2023

I edited my post a while ago, but I shouldn't have ask about "verbatim" copies. See reply to sibling for a more interesting question.

sdenton4 · on Jan 2, 2023

That's not really how memorization in neutral networks works. For classifiers, memorization is more like learning a hash function and a lookup table; no need to store the full image at all. Even for very large models, the weights are a tiny fraction of the size of the original data.

It's probably helpful to think of embeddings for generative models in a similar way; it's a very specific embedding function, like a locality sensitive hash, which doesn't require actually storing the data.

schemescape · on Jan 2, 2023

Thanks. Yes, I shouldn't have asked about "verbatim" copies -- I should have asked about something more like "close enough to cause legal trouble". Obviously copying verbatim is a violation of copyright, but there must be some threshold of "close enough" that is still problematic. E.g. compressed MP3s of copyrighted songs aren't a verbatim reproduction, but as far as I'm aware they're still covered by copyright.

Trademarks are even broader.

makapuf · on Jan 2, 2023

A band playing publicly a reinterpretation of a famous band will get sued even if its not verbatim.

schemescape · on Jan 3, 2023

That’s an even better example.

Scaevolus · on Jan 2, 2023

That's a known failure mode called "overfitting" or "memorization", where a specific input text is very accurately reproduced.

I'm not aware of it occurring for any copyrighted inputs, but it occurs for many famous artworks-- it's nearly impossible to convince Stable Diffusion to style "Mona Lisa" at all.

theptip · on Jan 2, 2023

> distributing the original image violates the non-commercial clause of the original images because someone, somewhere, might somehow have made money

I agree with the rest of your post, but this point seems a bit uncharitable.

I think the claims would be:

1. It's a breach of copyright for Megaface to share the images in any case without attribution & replicating the CC-NC license. It would (presumably) be OK assuming Megaface were to correctly apply the CC-NC licenses to the dataset.

2. It's a breach of copyright for anyone consuming Megaface (e.g. Google) to use those images for commercial purposes.

And your argument for SD applies to 2. that regardless of license, it's OK to create a transformative work. But it still doesn't get Megaface off the hook for 1. - distributing those images without the license.

polygamous_bat · on Jan 2, 2023

> Even with a fair use argument, there are examples in cases where AI was generating art that included the signatures of artists.

I went through the post, and I am not sure whether I agree with the analysis of the examples. Diffusion models are conceptual parrots, and it is possible that "25% images contain a scribble in the bottom right corner, so the model will make a scribble in the corner" is what is being construed as a signature in this post.

I think a large part of outrage from the artists about diffusion model "stealing" art comes from a place of disbelief that machines can be this good without "stealing", and it's perfectly natural. In fact, it's unnatural to me how good machines have gotten in image generation, and it is a field I've been following for five years now. However, because I understand the model and can implement it myself, I can convince myself it doesn't need to steal, just needs to be able to model correlations at some ungodly level.

zowie_vd · on Jan 3, 2023

>I think a large part of outrage from the artists about diffusion model "stealing" art comes from a place of disbelief that machines can be this good without "stealing"

I think when you make machines that automate away some people's passion and purpose in life, of course they're going to be upset. When, on top of that, the machine automating their work is a "conceptual parrot" that is parroting the concepts they invented without their permission, of course they're going to be pissed off.

Besides, whilst AI image generators don't steal exact elements from the training data, they do basically steal the artstyles and subject matter of illustrators, which takes them many years to foster. Imagine an illustrator of fantasy book covers hardly being able to find work anymore because some publishers figured out that, instead of hiring him, they could cheaply hire an unskilled person from a third world country to type the illustrator's name into an AI image generator so that it imitates his works, along with a few keywords for the book in question, until eventually something good pops out. That is currently considered fair use in US copyright law, but in my opinion it is nonetheless so unjust that it justifies calling it "stealing".

ilikehurdles · on Jan 2, 2023

> Stability AI is happy to follow copyright laws for their music model, because they know that music labels will hold them accountable. So this seems like a good time to point out to larger companies like @WaltDisneyCo that their copyrighted material is being stolen and used too

I mean this is a pretty good point. If they're so sure this is legal, then train on copyrighted audio+video media as they already do with copyrighted visual media.

zarzavat · on Jan 2, 2023

Avoiding doing something because you don't want to get sued and subjected to a lengthy court battle is completely rational and it doesn't mean that doing that thing is illegal.

For example, for decades many TV shows came up with their own lyrics for Happy Birthday song, even though it was well known that the song wasn't copyrighted, because nobody wanted to get sued and fight that battle. Easier to just change a few words in the script.

int_19h · on Jan 3, 2023

It's not just legal issues that'll lead to self-regulation. It's also stuff like this:

https://imgur.com/a/vAqCUP7

devalidating · on Jan 3, 2023

Simply put, there is less information in music. Copyright infringement cases have been won on as little as a handful of notes. So the likelihood of a copyrightable element from the underlying showing up in the model's output is that much higher. A single melody represents a similar portion of all possible melodies as does a color all possible colors and certainly less than the fraction represented by a single color palette. Yet you cannot copyright a color palette but you can a melody.

Copyright law regarding audio media is entire different beast than visual media; the case law and fair use technicalities do not always line up as simply as "if its legal on visual media it must also be legal on audio."

georgeglue1 · on Jan 2, 2023

Are there any licenses that are generally permissive, but prohibit certain programmatic, law enforcement, government, etc. usecases?

It'd be interesting legal territory if someone has tried this already.

542458 · on Jan 2, 2023

IANAL.

I don’t think you can prevent scraping or use in ML corpuses in this way. Copyright prevents the creation of non-transformative copies of a work other than some protected use cases (parody, education, etc). All OSS licenses do is provide a right to copy a work provided certain conditions (attribution, copy left) are met. But the general legal consensus as far as I know is that most ML models meet the threshold for being a new transformative work, so copyright doesn’t apply. Accordingly, you can’t use copyright to prevent something from being part of a ML corpus.

That said, I if your question is broader than the article… if you’re just talking about non-transformative uses (I.e., just using open source software) I don’t see any reason why you couldn’t create a license that doesn’t allow software to be deployed into certain environments. Some examples:

https://www.cs.ucdavis.edu/~rogaway/ocb/license2.pdf

https://www.linux.com/news/open-source-project-adds-no-milit...

No idea how these would do in court though.

AlexandrB · on Jan 2, 2023

> But the general legal consensus as far as I know is that most ML models meet the threshold for being a new transformative work, so copyright doesn’t apply.

Has this been tested in court yet?

jefftk · on Jan 2, 2023

It hasn't yet. I think this is the central claim of the GitHub co-pilot suit.

There's a prediction market on whether the suit will be successful, which is currently at 43%: https://manifold.markets/JeffKaufman/will-the-github-copilot...

dragonwriter · on Jan 2, 2023

> Copyright prevents the creation of non-transformative copies of a work

It also prevents transformative derivatives.

Both nontransformative copies and transformative derivative works may meet (in the US) the exception for fair use, which is the usual argument for nonlicensed use in ML training.

Tepix · on Jan 3, 2023

I think many (if not most) people are opposed to their images being used for facial recognition related purposes (training, evaluation, services, …)

I hope creative commons will create a new license like that explicitly prohibits the use of the protected works for these purposes. Of course, in this particular case a university gathered the images (violating the citation requirement) and then companies used them commercially, violating another aspect of the license. I guess UWash is at fault for not stating that Megaface cannot be used for commercial purposes?

Are the researchers at UWash personally liable? Is the university? If we get a couple of deterring judgements there's a chance this will stop future violantions.

re-lre-l · on Jan 3, 2023

I agree. But one has to understand that if something has been uploaded into internet it's going to be public by default - this is how I treat the internet. And no license will solve it.

underlines · on Jan 3, 2023

Challenge this:

If I am legally allowed to look at a million pictures to learn how to draw in different art styles, or even how to imitate a very certain art style, I basically train weights and biases in my brain.

If StabilityAi does the same with pictures available online, and release the weights and biases as Stable Diffusion, how is this different from humans learning from that data?

dandelany · on Jan 3, 2023

Law does not work by analogy. The difference is obvious - one of them is an automated computer system, the other is a human. As mentioned above it is not clear whether use/redistribution of this data is legal. But generally speaking, drawing an analogy between a piece of software and a human being is not an argument that will hold any weight in court.

p-e-w · on Jan 3, 2023

This is the correct answer to any question such as the above. The law is an instrument of power. At the most basic level, its purpose is to uphold the power of a select group of individuals and institutions. The law's purpose is not to bend to logical arguments. If it did, those with the most complete grasp of logic would have the most power, undermining what the law is ultimately designed to do.

Imnimo · on Jan 2, 2023

If I understand correctly, this dataset isn't even being used to train commercial facial recognition models, it's just being used to benchmark them? The implication seems to be that it should be illegal to even apply an algorithm (of any sort) to an image that you don't have a commercial license for?

bawolff · on Jan 3, 2023

I think that's part of the problem with non-commercial licenses. Its very confusing what it means.

It clearly means that you can't directly sell it. What about using non-commercial work on a page with ads? What about selling web hosting to someone using NC works?

The last one sounds rediculous, but what really is the difference from a strict logical perspective? You are making money off the non-comercial work after all as without it there would be no web hosting to sell.

MuffinFlavored · on Jan 2, 2023

Slighty unrelated, how many "distinguishable/unique" faces/facial styles are there in terms of like "broad categories?"

Obviously hard to define but as somebody who moved around a lot growing up, I would catch my brain thinking I'd recognize somebody (quite often) only to remember I was in a totally different state than where the person I thought I was recognizing lived.

scotty79 · on Jan 3, 2023

That's interesting. If someone creates work A and doesn't allow commercial use, but you create your work B derived from A and you publish it without commercial use restriction then somebody uses your work B commercially, were the conditions that creator of A set violated? And if so, by whom?

Does it depend on how transformative was your work B?

Jerry2 · on Jan 2, 2023

It's unfortunate they removed it. Is there a public mirror/torrent of it by any chance?

hk__2 · on Jan 2, 2023

(2021)

> Last updated: Jan 23, 2021

jonplackett · on Jan 2, 2023

Anyone know have any stats on the ethnicities and genera of the people in this dataset?

Is this still widely used to test face recognition?

wongarsu · on Jan 2, 2023

There is the DiveFace dataset/metadata [1], which is a subset of the Megaface dataset with six equally sized groups: three ethnicities times two genders.

1: https://github.com/BiDAlab/DiveFace

williamscales · on Jan 2, 2023

How can I find out if one of my photos was used against its license? And then I suppose I can sue them, right?

hrkucuk · on Jan 2, 2023

Are all these faces white?

Zamicol · on Jan 2, 2023