>Today, we introduce Open Images, a dataset consisting of ~9 million URLs ... having a Creative Commons Attribution license* .
Then the footnote below:
>* While we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.
I think this might be the most blatant instance I've ever seen of, "We have to write this even though it's essentially impossible for you to actually follow our directions."
Lawyers aren't so funny when they sue you for trivial things.
There are so many instances where people try to take advantage of things. This is just protection. And it's not like they're providing nine images. It's nine million.
The funny part is the suggestion that you go through each and every one of the 9 million images and verify their licenses. It's basically impossible. It's funny that, by suggesting we do the impossible, their butts are covered.
Any image where it is impossible to find the originating source, already used thousands of times on tumbler, blogs, and websites, and is under a certain resolution and size (thumbnail for example) falls into the category of fair use.
Nope. Everything is copyrighted by default, and if you don’t have permission (“license”) to copy it, it’s illegal for you to copy it. There are the “fair use” exceptions, but those have nothing to do with what you wrote.
I clearly misunderstood what is meant by "Would it be considered impossible to obtain permission from the original source?" as criteria for fair use.[0] I've seen numerous images on Wikipedia that use this as reason for fair use. Someone uploads an image to 4chan as anonymous, for example, without a path to ask permission it might be fair use to use that image.This is different from someone posting an image to facebook where there is a way to ask the author for permission. Please, explain my misunderstanding here.
Interesting that the base data consists of URLs. I guess it makes sense given copyright issues. Anybody know what the ballpark expected half-life of such URLs?
This is the tactic used by the CNN / Dailymail dataset[1] released by Google Deepmind. In that situation you want everyone to have the same and original file. The contents of the URL may be updated and/or disappear. The original URLs are recorded but they're retrieved from the Wayback Machine.
Even with that a few of the pages are still essentially lost however - or at least I was never able to retrieve them. Kyunghyun Cho hosts a copy of the processed data on his site at NYU - who may be less likely to receive legal requests compared to Google or a similar commercial company hosting them.
Distribution of datasets is going to be a continued painpoint for machine learning in the future. No-one wants to be the legal guinea pig for exactly what entails fair use in ML.
Strict hashsums (like MD5 / SHA-1 / SHA-2) don't make too much sense here, as the images from the 'less obvious sources' might be thumbnails of different sizes. Or, probably, the same size but reencoded (not always a good idea, but happens in the practice).
Would a fuzzy hash, like the output from a known and fixed classifier work for that?
Not necessarily - they're listed as CC-BY on Flickr but that doesn't necessarily make it true. Specifically the user may not have authority to give out such copyright (as I assume for any case where the user "accidentally" set it to CC-BY they can't retract?)
Any guesses on how large the resulting dataset would be if you actually downloaded all the images? I imagine the urls will get removed in a hurry as everybody starts automating it.
First video, now images - wonder if speech and others are on the way?
It's nice that they're doing this, helps advance the art I think. But it also puts a lot of smaller operations in unis sort of under the Google system in that they're best compared to Google's ML work and others using these datasets. It's a small way of stacking the deck to make Google and DeepMind more embedded in the community.
That said, its utility for others surely outweighs the strategic advantage gained here, so I for one welcome these libraries. A lot of work goes into them. Hopefully others will release theirs as well.
I really want real-world speech data. I think speech-to-text could open the doors to lots of creative tech to help hard of hearing folks, but the dataset barrier is so massive :(
Have you seen the LibriSpeech ASR corpus[1]? It's large scale (1k hours) Creative Commons licensed English speech with the original book text.
The data is derived from read audiobooks from the LibriVox project and has been carefully segmented and aligned.
If you mean real world as in "with realistic noise", Baidu's DeepSpeech showed that adding noise to clean audio can make a resilient model.
Indeed, having clean audio is better as you can add different types of noise to clean data to, in effect, create regularization.
I wonder why semi supervised learning hasn't taken off more. There is so much unlabelled data out there. E.g. in podcasts and youtube videos and television. A small amount of it is labelled, like captioned television programs and youtube videos. You could use those captions to train a weak model which could provide labels for the unlabelled data.
You can correct many of it's errors with simple language models. For instance the phrase "wreck a nice beach" has much less probability than "recognize speech". So if the model isn't sure which one it is, you can assume it's the more probable one. Then train on that, and it will get even better at recognizing those words.
>You can correct many of it's errors with simple language models. For instance the phrase "wreck a nice beach" has much less probability than "recognize speech". So if the model isn't sure which one it is, you can assume it's the more probable one. Then train on that, and it will get even better at recognizing those words.
Why not just build that heuristic into your initial supervised "weak" model? Training on data labeled by a model introduces no new information, so you're not gaining anything there.
It does introduce new information. This is how semi-supervised learning works. Many words may be missed by the weak model, but can be inferred correctly from their context. Then you have new labels to train the weak model on, to make it better at those words.
The way I'm describing it is probably not the optimal way to do it. I don't know if there is a better way. But the point is it must be possible to take advantage of the vast quantities of unlabelled data we have. Humans brains somehow do something similar.
If you make your training set the output of Google's speech recognition algo from YT there will be lots of label noise.
It's better to go with movies and audiobooks because the text 'transcription' is human generated. Aligning known text to audio is much more accurate than speech recognition.
TV show audio tracks with subtitles? Not perfect, but a very, very large available dataset if you feel like downloading enough dodgy torrents. Plus points would be that any model trained would also need to be able to distinguish and separate speakers, which is something even the human brain struggles with.
Record people with a good microphone as they use their cell phone / landline (or other communication apps that compress heavily), and also record what comes out at the other end, the other end being all sorts of places, including overseas. Train a huge neural network with that, until it can extrapolate bad phone quality into something superb.
Next up: x-ray glasses, with several modes, to turn people into the hottest or ugliest possible version that could hide under those clothes.
>Today, we introduce Open Images, a dataset consisting of ~9 million URLs ... having a Creative Commons Attribution license* .
Then the footnote below:
>* While we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.
I think this might be the most blatant instance I've ever seen of, "We have to write this even though it's essentially impossible for you to actually follow our directions."