The internet is full of data you can use. Just crawl it, like everyone does. There are thousands of open datasets, some gigantic in size. If you have sensors (camera, GPS, orientation, etc) you can generate a shitload of data. If you can create a game that is related to the data you want to collect, then you can collect data for 'free'.
On the other hand, think about it: what do Google and FB have that we don't? Personal data. What they have is data that is useful to target ads. If your interest in AI goes beyond ads, then you don't need that data.
| On the other hand, think about it: what do Google and FB have that we don't? Personal data. What they have is data that is useful to target ads.
Yeah ? like the tons of photos they harvest from people. Most of the progress they did in training computer vision is based on that. Should I build facebook or google to get access to it ?
What about language modeling ? They have access to conversational data and billions of search queries, both of which there is no way to access them from outside.
What about health ? Well if I'm not somehow working with some big pharma how could I access this kind of data ?
I can go on and on. The point is, yes I can crawl the web, but what "web" is there left ? everything is locked behind paywalls and private clouds. If the real vision of an open internet was fulfilled, all data generated on it would be accessible to crawl indeed.
I'm not saying it's not possible to get data and use it. I'm saying you cannot get the kind of data only monopolies have and you will never be able to compete with them.
Photos: if you build a facebook app, you can probably ask for permissions to fotos of your app users. Also the open datasets for machine learning with images like the coco dataset are pretty big. Can you really handle a lot more than that? Even hinton starts with mnist for new ideas like capsules.
Language modeling: hacker news, public mailing lists, wikipedia, github.
Health: you can usually get data if you work at a hospital as an md or researcher. Just need a reasonable idea and an IRB. If you want the pharmacy data, I imagine you could get at it by going to work as a researcher in pharma, insurance, or retailer.
alphago was built using publicly available games of go pros. Alphagozero didn't even depend on data at all.
For AI, the limiting factors are ideas, code, time, hardware.
AlphaGo and AG0 were built with ridiculous amounts of compute power that Google donated to the effort. To replicate their results would cost millions of dollars.
Unless your objective is to target ads, I'm really not sure why you'd think that Facebook's collection of people's holiday and wedding snaps and memes is a superior training set to say, the entire world's surface photographed at regular intervals, or millions of more-selectively-uploaded tagged images in Flickr, or image sets especially designed for training like OpenImages
sp4ke sez> " I'm saying you cannot get the kind of data only monopolies have and you will never be able to compete with them."
More a political statement than a statement of relevance to the workplace.
You need not worry that "they" will hold you back. It is unlikely that analyzing monopolys' data will explain how early man built flint tools, Joe the mechanic repairs his car, fifth-grade Fred solves his geometry problems or van Gogh painted. ML, including AutoML, appears to be a long way from solving most AI problems. There's no need to feel that "they" are holding you back by witholding data. And then remember:
"Be careful what you wish for, it might just come true."
- old saying
On the other hand, think about it: what do Google and FB have that we don't? Personal data. What they have is data that is useful to target ads. If your interest in AI goes beyond ads, then you don't need that data.