Introducing the Open Images Dataset

imh · on Oct 1, 2016

Lawyers are funny:

>Today, we introduce Open Images, a dataset consisting of ~9 million URLs ... having a Creative Commons Attribution license* .

Then the footnote below:

>* While we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

I think this might be the most blatant instance I've ever seen of, "We have to write this even though it's essentially impossible for you to actually follow our directions."

ktta · on Oct 1, 2016

Lawyers aren't so funny when they sue you for trivial things.

There are so many instances where people try to take advantage of things. This is just protection. And it's not like they're providing nine images. It's nine million.

imh · on Oct 2, 2016

The funny part is the suggestion that you go through each and every one of the 9 million images and verify their licenses. It's basically impossible. It's funny that, by suggesting we do the impossible, their butts are covered.

zappo2938 · on Oct 1, 2016

Any image where it is impossible to find the originating source, already used thousands of times on tumbler, blogs, and websites, and is under a certain resolution and size (thumbnail for example) falls into the category of fair use.

teddyh · on Oct 1, 2016

Nope. Everything is copyrighted by default, and if you don’t have permission (“license”) to copy it, it’s illegal for you to copy it. There are the “fair use” exceptions, but those have nothing to do with what you wrote.

zappo2938 · on Oct 1, 2016

I clearly misunderstood what is meant by "Would it be considered impossible to obtain permission from the original source?" as criteria for fair use.[0] I've seen numerous images on Wikipedia that use this as reason for fair use. Someone uploads an image to 4chan as anonymous, for example, without a path to ask permission it might be fair use to use that image.This is different from someone posting an image to facebook where there is a way to ask the author for permission. Please, explain my misunderstanding here.

[0]: https://i.kinja-img.com/gawker-media/image/upload/qxptplnxqt...

teddyh · on Oct 2, 2016

> Would it be considered impossible to obtain permission from the original source?

That question assumes that you know who the original source is. If you don’t know that, you can’t consider it impossible.

transcranial · on Sept 30, 2016

Interesting that the base data consists of URLs. I guess it makes sense given copyright issues. Anybody know what the ballpark expected half-life of such URLs?

joelthelion · on Sept 30, 2016

I suppose that will simply leave the job of downloading the images and making a torrent out of them to someone else less identifiable...

visarga · on Oct 1, 2016

Or shared by sneakernet.

DarkLinkXXXX · on Sept 30, 2016

I was wondering about that as well. I'd hope that all these images are cached by archive.org or the like.

Smerity · on Sept 30, 2016

This is the tactic used by the CNN / Dailymail dataset[1] released by Google Deepmind. In that situation you want everyone to have the same and original file. The contents of the URL may be updated and/or disappear. The original URLs are recorded but they're retrieved from the Wayback Machine.

Even with that a few of the pages are still essentially lost however - or at least I was never able to retrieve them. Kyunghyun Cho hosts a copy of the processed data on his site at NYU - who may be less likely to receive legal requests compared to Google or a similar commercial company hosting them.

Distribution of datasets is going to be a continued painpoint for machine learning in the future. No-one wants to be the legal guinea pig for exactly what entails fair use in ML.

[1]: https://github.com/deepmind/rc-data

DarkLinkXXXX · on Sept 30, 2016

Agreed. Do you think maybe these datasets should include checksums so that they might be recovered with more certainty, from less obvious sources?

P.S.: Also, archive.org retroactively removes (or delists?) site content when blocked by robots.txt, as domain parkers often do, so I could see that causing trouble down the line. https://archive.org/post/1019415/retroactive-robotstxt-remov...

krasin · on Oct 9, 2016

I have added OriginalSize and OriginalMD5 columns to images.csv in the dataset: https://github.com/openimages/dataset/commit/526555b43ab74b6...

still thinking about a soft hash.

krasin · on Sept 30, 2016

Strict hashsums (like MD5 / SHA-1 / SHA-2) don't make too much sense here, as the images from the 'less obvious sources' might be thumbnails of different sizes. Or, probably, the same size but reencoded (not always a good idea, but happens in the practice).

Would a fuzzy hash, like the output from a known and fixed classifier work for that?

ramshorns · on Oct 1, 2016

For these images, a lot of the legal issues would be simpler and fair use wouldn't be necessary, since they're all CC-BY.

Smerity · on Oct 1, 2016

Not necessarily - they're listed as CC-BY on Flickr but that doesn't necessarily make it true. Specifically the user may not have authority to give out such copyright (as I assume for any case where the user "accidentally" set it to CC-BY they can't retract?)

Google included specific legalese on this very issue. See https://news.ycombinator.com/item?id=12616012

diyseguy · on Oct 1, 2016

Any guesses on how large the resulting dataset would be if you actually downloaded all the images? I imagine the urls will get removed in a hurry as everybody starts automating it.

krasin · on Oct 1, 2016

(disclosure: I am one of the contributors to the dataset)

~1TB for 640x480 thumbnails, ~3TB for 1600x1200 thumbnails.

The originals are about 20TB, though.

devindotcom · on Sept 30, 2016

First video, now images - wonder if speech and others are on the way?

It's nice that they're doing this, helps advance the art I think. But it also puts a lot of smaller operations in unis sort of under the Google system in that they're best compared to Google's ML work and others using these datasets. It's a small way of stacking the deck to make Google and DeepMind more embedded in the community.

That said, its utility for others surely outweighs the strategic advantage gained here, so I for one welcome these libraries. A lot of work goes into them. Hopefully others will release theirs as well.

imh · on Oct 1, 2016

I really want real-world speech data. I think speech-to-text could open the doors to lots of creative tech to help hard of hearing folks, but the dataset barrier is so massive :(

Smerity · on Oct 1, 2016

Have you seen the LibriSpeech ASR corpus[1]? It's large scale (1k hours) Creative Commons licensed English speech with the original book text.

The data is derived from read audiobooks from the LibriVox project and has been carefully segmented and aligned.

If you mean real world as in "with realistic noise", Baidu's DeepSpeech showed that adding noise to clean audio can make a resilient model. Indeed, having clean audio is better as you can add different types of noise to clean data to, in effect, create regularization.

[1]: http://www.openslr.org/12/

Houshalter · on Oct 1, 2016

I wonder why semi supervised learning hasn't taken off more. There is so much unlabelled data out there. E.g. in podcasts and youtube videos and television. A small amount of it is labelled, like captioned television programs and youtube videos. You could use those captions to train a weak model which could provide labels for the unlabelled data.

You can correct many of it's errors with simple language models. For instance the phrase "wreck a nice beach" has much less probability than "recognize speech". So if the model isn't sure which one it is, you can assume it's the more probable one. Then train on that, and it will get even better at recognizing those words.

rm999 · on Oct 1, 2016

>You can correct many of it's errors with simple language models. For instance the phrase "wreck a nice beach" has much less probability than "recognize speech". So if the model isn't sure which one it is, you can assume it's the more probable one. Then train on that, and it will get even better at recognizing those words.

Why not just build that heuristic into your initial supervised "weak" model? Training on data labeled by a model introduces no new information, so you're not gaining anything there.

Houshalter · on Oct 1, 2016

It does introduce new information. This is how semi-supervised learning works. Many words may be missed by the weak model, but can be inferred correctly from their context. Then you have new labels to train the weak model on, to make it better at those words.

The way I'm describing it is probably not the optimal way to do it. I don't know if there is a better way. But the point is it must be possible to take advantage of the vast quantities of unlabelled data we have. Humans brains somehow do something similar.

Semi supervised learning is really cool. I saw one successful example where they labelled just a few emails as spam and not spam. Then they trained used their weak classifier to label thousands of unclassified emails, and then used those as training data for an even stronger model. It actually works: http://matpalm.com/semi_supervised_naive_bayes/semi_supervis... https://en.wikipedia.org/wiki/Semi-supervised_learning

visarga · on Oct 1, 2016

If you make your training set the output of Google's speech recognition algo from YT there will be lots of label noise.

It's better to go with movies and audiobooks because the text 'transcription' is human generated. Aligning known text to audio is much more accurate than speech recognition.

mynameislegion · on Oct 1, 2016

http://voxforge.org/

chillydawg · on Oct 1, 2016

TV show audio tracks with subtitles? Not perfect, but a very, very large available dataset if you feel like downloading enough dodgy torrents. Plus points would be that any model trained would also need to be able to distinguish and separate speakers, which is something even the human brain struggles with.

PavlovsCat · on Oct 1, 2016

Record people with a good microphone as they use their cell phone / landline (or other communication apps that compress heavily), and also record what comes out at the other end, the other end being all sorts of places, including overseas. Train a huge neural network with that, until it can extrapolate bad phone quality into something superb.

Next up: x-ray glasses, with several modes, to turn people into the hottest or ugliest possible version that could hide under those clothes.

zappo2938 · on Oct 1, 2016

I'm glad I'm getting a return on all the effort clicking street signs and store fronts on reCaptcha.

pilooch · on Oct 1, 2016

I've put an efficient downloader here for the interested crowd: https://github.com/beniz/openimages_downloader It's a fork of the one script I used to grab Imagenet.

dharma1 · on Sept 30, 2016

Is there a link to the trained model somewhere?

goddamnsteve · on Oct 1, 2016

Even I've been searching for some links for the trained models. May be its too early for now.

krasin · on Oct 1, 2016

Stay tuned: https://github.com/openimages/dataset/issues/3

rocky1138 · on Sept 30, 2016

Are there any other libraries that are similar?

Omnipresent · on Oct 1, 2016

Looking forward to someone trying tensorFlow CNN on this