"are from random web pages crawled between 2014 and 2021" makes me wonder if someone could copyright an AI based on this, or if the AI will be considered a derived work, meaning that you'll need to pay royalties to the owners of those 400mio images
Seems to be an untested legal question but it's definitely presumptuous on the part of AI researchers to assume that it is fine. Maybe using the dataset to train an AI is fine but I'd say it's pretty clear that distributing the training dataset isn't.
I think a reasonable first guess for a legal interpretation is "would it be legal if a human did that?"
If you read an encyclopedia and use that knowledge to answer questions, a human isn't violating copyright, so an AI doing that is probably fine. If you look at all of Picasso's paintings and paint something in his style that isn't violating copyright, so training an AI to make Picasso-like paintings by training it on real Picassos is probably fine.
However looking at a painting and perfectly replicating it is a copyright violation. Same for reading a text and then writing down the same text. Both of these are just copies, not new works. So intentionally overfitting an AI to have it return the inputs with minimal changes probably makes the outputs subject to copyright from the owners of the training data.
We're already there. Copilot already does this, as has been found with various GPL violation probes.
Whether or not it's illegal is still an ongoing debate. FSF says it absolutely is illegal, but I think it's ultimately going to end up in court (though I'm just speculating).
You can look at https://same.energy/ for an ex-OAer showing what CLIP-like NNs can do for image search.
It's impressive but if Google Images doesn't do it yet, it's almost certainly more to do with the realities of corporate policy, and scaling up & deploying it at GI scale than as a small toy demo on a static set of a few hundred million (as opposed to countless trillions of adversarially changing images). It's certainly not for lack of good NNs. Remember, Google has much better NNs than CLIP already! ALIGN was over half a year ago, and they have MUM and "Pathways" already superseding that.
As always, "the future is already here, it's just unevenly distributed"; it's just practical realities about lag - same way that the RNN machine translation models were massively better for years before Google Translate could afford to switch away from its old n-grams approach, or the speech transcription NNs were great long before they ever showed up on Google YouTube auto-captions, or BERT was kicking ass in NLP long before it became a major signal in Google Search, etc.
Not only does it produce really good results, the interface with it's ability to start another search with either the description or image of any of the results is great for complex searches.
That's interesting indeed!
Note that the description search as well as the image search are both using a knn index on embeddings and not exact search. That helps for finding semantically close by items but indeed for exact reference match it might not be the best solution.
Re-indexing the dataset with something like elastic search would give the reference search results you expect.
I wonder how useful these data sets really are. For example, I used their "clip retrieval" tool (https://rom1504.github.io/clip-retrieval/) and searched for the first random word that came to mind — "hug".
The results were all some sort of anime/cartoon drawings (perhaps just an issue with the Common Crawl results).
But the question remains — would a model trained on this data accurately identify a photo of two humans hugging?