Hacker News new | past | comments | ask | show | jobs | submit login
Laion-400M: open-source dataset of 400M image-text pairs (laion.ai)
172 points by lnyan on Sept 13, 2021 | hide | past | favorite | 20 comments



"are from random web pages crawled between 2014 and 2021" makes me wonder if someone could copyright an AI based on this, or if the AI will be considered a derived work, meaning that you'll need to pay royalties to the owners of those 400mio images


Seems to be an untested legal question but it's definitely presumptuous on the part of AI researchers to assume that it is fine. Maybe using the dataset to train an AI is fine but I'd say it's pretty clear that distributing the training dataset isn't.

Nobody has sued yet though.


>Maybe using the dataset to train an AI is fine

If this is true, how much longer until we start purposefully training AI models to be overfit and start returning inputs verbatim?


I think a reasonable first guess for a legal interpretation is "would it be legal if a human did that?"

If you read an encyclopedia and use that knowledge to answer questions, a human isn't violating copyright, so an AI doing that is probably fine. If you look at all of Picasso's paintings and paint something in his style that isn't violating copyright, so training an AI to make Picasso-like paintings by training it on real Picassos is probably fine.

However looking at a painting and perfectly replicating it is a copyright violation. Same for reading a text and then writing down the same text. Both of these are just copies, not new works. So intentionally overfitting an AI to have it return the inputs with minimal changes probably makes the outputs subject to copyright from the owners of the training data.


We're already there. Copilot already does this, as has been found with various GPL violation probes.

Whether or not it's illegal is still an ongoing debate. FSF says it absolutely is illegal, but I think it's ultimately going to end up in court (though I'm just speculating).


The demo index puts Google Images to shame with its search by content feature.


Google images used to work very well. It seems it was hobbled at some point either intentionally or unintentionally.


Another good image search is Yandex. You can do semantic search by image, as opposed to mere duplicate search.


You can look at https://same.energy/ for an ex-OAer showing what CLIP-like NNs can do for image search.

It's impressive but if Google Images doesn't do it yet, it's almost certainly more to do with the realities of corporate policy, and scaling up & deploying it at GI scale than as a small toy demo on a static set of a few hundred million (as opposed to countless trillions of adversarially changing images). It's certainly not for lack of good NNs. Remember, Google has much better NNs than CLIP already! ALIGN was over half a year ago, and they have MUM and "Pathways" already superseding that.

As always, "the future is already here, it's just unevenly distributed"; it's just practical realities about lag - same way that the RNN machine translation models were massively better for years before Google Translate could afford to switch away from its old n-grams approach, or the speech transcription NNs were great long before they ever showed up on Google YouTube auto-captions, or BERT was kicking ass in NLP long before it became a major signal in Google Search, etc.


Not only does it produce really good results, the interface with it's ability to start another search with either the description or image of any of the results is great for complex searches.


Those "search by description" links reveal a strange problem with the index:

1. Search for "monitor": https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2...

2. Note that there are several results with AOC model E2260SWDN.

3. Click the search button beside one of them.

4. Note that none of the results contain "E2260SWDN". Searching for "E2260SWDN" by itself is the same.

The back button doesn't work in this interface either. The URL changes, but it doesn't actually show that previous search.


That's interesting indeed! Note that the description search as well as the image search are both using a knn index on embeddings and not exact search. That helps for finding semantically close by items but indeed for exact reference match it might not be the best solution. Re-indexing the dataset with something like elastic search would give the reference search results you expect.


Quite useful, I appreciate the contribution!


entered "woman" and started to wonder where they crawled this from


Even leaving the query blank shows a few NSFW things that suggest they really did just look for any and all image-caption pairs they could find!


a safe mode is now in place


I tried a few keywords which should not provide NSFW, but they did, such as "thumb". Not cool. This needs a 'safe mode'.


Great dataset, thanks!


I wonder how useful these data sets really are. For example, I used their "clip retrieval" tool (https://rom1504.github.io/clip-retrieval/) and searched for the first random word that came to mind — "hug".

The results were all some sort of anime/cartoon drawings (perhaps just an issue with the Common Crawl results).

But the question remains — would a model trained on this data accurately identify a photo of two humans hugging?


If you try "two humans hugging", it gives pretty good results it seems. I tried "bicycle in the wood", same.

I think I'm going to use this engine for all my future presentations ^^




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: