There are a few services popping up with aim to provide data repositories for analysis/ML (Kaggle, data.world, /r/datasets)
As someone who likes making analyses from random datasets, I have a few issues with these types of services:
1) There is often no indication of the distribution rights of the data, or whether the data was obtained ethically from the source (i.e. following the ToS). I made this mistake when I used an OKCupid dataset released on an Open Data Repository; turns out it was scraped with a logged-in account and the dataset was taken down by DMCA
2) There is no indication of the quality of the data, and as a result, it may take an absurd amount of time cleaning the data for accuracy. Some datasets may not be salvageable.
3) Bandwidth. Good datasets have lots of data for better models, which these sites may not be able to support. (BigQuery public datasets solve this problem however)
Good points.
We are soon going to release a p2p system for sharing datasets, backed by Hadoop clusters. You install the Hadoop stack (localhost or distributed), then you can free-text search for datasets that have been made 'public' on any hadoop cluster that participates in the 'ecosystem'. We expect it to be self-policing, but there will be a way to report illegal distribution of datasets.
The solution is based on a variant of Bittorrent where files are downloaded in-order (not randomly due to rarest piece selection in Bittorrent). Files can be downloaded to either HDFS or to a Kafka topic. We will demo it in 2 weeks here:
https://fosdem.org/2017/schedule/event/democratizing_deep_le...
The system will be bootstrapped with lots of interesting big datasets: imagenet, 10m images, youtube 8m, reddit comments, hn comments, etc. Our experience is that we need a central point for researchers to get easy access to open datasets that doesn't require a AWS or GCE account.
Could you explain what problem you're trying to solve here? Are there really that many researchers who have access to modern (and expensive) GPU hardware that don't have bandwidth or disk space available? Or are there many researchers who are putting in lots of time assembling a dataset but don't have the bandwidth to distribute it?
It's more a case of providing a quick and easy way to share large datasets, backed by HDFS. So, researchers don't have a good way to share datasets (apart from AWS/GCE).
We work with climate science researchers who have multi-TB datasets, and they have no efficient way to share them. Same goes for genomics researchers who routinely pay lots of money for Aspera licenses just to download datasets faster than TCP allows. We are using a Ledbat protocol tuned to give good bandwidth over high latency links, but only scavange available b/w as it is lower priority than TCP.
For the machine learning researcher:
i'd like to test this RNN on the reddit comments dataset....3 days later after finding a poor quality torrent...oh, now i can do it.
On our system, search, find, click to download. We will move towards downloading (random) samples of very large datasets (even to Kafka from where they can be processed as they are downloaded).
Sounds nice. Could you consider to make it more general than sharing datasets for ML? I mean, it sounds like a really generic solution that anyone could benefit from, not just researchers.
Just say yours does lower bound of Wilson score confidence interval for a Bernoulli parameter [0]. It has been used for sorting shopping items by ratings for years.
If a person figured out a way to apply it usefully to some other area - which doesn't seem hard at all - (job skill ranking? :D), MS is the kind of company that would attempt to collect $$$ from it regardless. :(
That's why we're working on a ranking/feedback system (a la piratebay) so that users can see other user comments on the datasets.
Regarding decentralization - we are starting with centralized search, and download times are much better for popular datasets by downloading in parallel from many peers. We are also using a congestion control protocol (Ledbat) that is lower priority than TCP - you will not notice it using your b/w to share data if you are downloading/uploading using TCP.
I’m one of the cofounders at data.world and you definitely make good points… We encourage all users to post a license with their datasets but, like open source, not everyone will maintain or honor these. This is definitely something that we feel we can help the open data community with and encourage even more. (As an aside, in general you should note is that a lot of data scraped from websites is something that you’ll need to be careful with).
When it comes to data quality, the world is a messy place and the data that comes from it is messy too. Most professional data scientists spend an inordinate amount of time cleaning datasets, doing feature engineering etc… it’s part of their job description. We’re trying to eliminate some of that repetitive work by making sure that people can comment on, contribute to and give some signal back on the quality of the dataset. We also think a dataset is more than just the data... on data.world you can upload code, Notebooks, images, etc... anything that helps add context to the data.
Finally, when it comes to size... ML definitely needs it. However, there's a lot of interesting data out there thats still very complicated, very useful but not that big. Most datasets in the world are well under the terabyte size (or even 100s of GBs in size). We're rapidly expanding the size of datasets we support because we want that stuff too but we really want to help people understand all the data in the world!
There's probably some utility to it: a lot of problems involve hacking together datasets, sometimes in dubious ways. There's also value, especially for startups that are looking to build simple neural net applications (ex: identifying plates of food from different restaurants) which are very data-dependent. Researchers may also want to reflect the cost assembling datasets (ex: MTurk, processing power) and open up datasets that may never have been open before.
My general sense on this though is that I'd like there to be more of an incentive for people to open up their datasets to the larger public. Maybe I'm being idealistic but a crowdsourcing type function where you pay for X dataset together with other users and then it's released under MIT, forever free etc.
As others have mentioned that'll probably bump against usage rights issues, a larger problem you'll have to deal with independent of your need to sell or distribute the datasets in question.
Hi everyone,
While working on some of my projects involving Machine learning algorithms and deep neural networks i have found that there is a lack of training data sets in many areas. Also many of them are scattered throughout web, some are extremely huge for an individual to process etc.So i thought of this idea of having a marketplace for data ,structured for machine learning communities. It can be a one stop place for researchers, scientists, students,data analysts etc.Looking for some valuable opinions.
I don't think you can make this work because even a fairly small cost for data produces all sorts of problems.
You would have to distribute all these datasets with a license and such a thing would restrict any users much more than you'd realize - sharing one's results would become a trickier matter just for one. Deciding if the data is good would be another thing since being able to sell stuff produces an incentive for uploading garbage.
Just as much, huge datasets are effectively going to be the product of many people's data. If some entity decides it's going to profit from that, how do the profits get divided, especially if there's no protocols for that division.
For example, should someone be able to sell my chest X-rays if I haven't agreed to it? A lot of large companies with big data sets would face far more issues if they were to be selling those. And considerations go on and on and on.
The free-wheel, free-sharing quality of present day machine has been a driving force in its progress and it would be a shame for that to go away. As can be seen with the Internet, freely available data can be incredible force and hopefully data will remain freely available.
Yeah. I was thinking of a crowdsourced marketplace platform which follows a particular standard so all dataset supports major ML platforms, API support for programming languages for preprocessing huge datasets etc.People can prepare and sell /buy datasets. This can bring people from various streams together to solve really major issues using the data.
I think the idea is great but you should think about this sentence: "Buy and sell your data like Ebay" next to an image of connected people. It looks like you're a shady user profile dealer. To be successful it's crucial that you draw a clear line there
"""
Datapie offers data analysis without downloading the data. This means you need not download the massive data. No need to have massive distributed systems to process it.
"""
Where is your company trying to fit into the market? Is this a http://zerotoonebook.com/ or are we commodotized in this space already?
IBM offers many of the same data sets, paired with your company's private data, both annotated by Watson. IBM also does not learn from, or improve their own models, w.r.t. your company's proprietary data.
I swear this is what Infochimps used to be, but now I don't really see a reference to it on their website. Except for a 404 when I click on "Resources" -> "Data Marketplace"[1]. I'm guessing that means they moved away from that business. Looks like they now focus on tools, not data itself.
What I like more is the concept of a job-agency for AIs. Basically, the job-agency is a broker between people who have data and need an algorithm, and people who have an algorithm but no data. The broker can then work as a matchmaker, but also provide protection against data/algorithm theft by managing hardware themselves.
For those unfamiliar with Reddit, you can choose in a dropdown what time period to filter top posts by; here's a direct link to the all-time top dataset posts there: https://www.reddit.com/r/datasets/top/?sort=top&t=all
Would there be much of an opportunity for someone whos into photo-realistic 3d rendering to create some of these datasets?
For starters i was thinking of making something like the make3d dataset - http://make3d.cs.cornell.edu/data.html for some of my own experiments.
As someone who likes making analyses from random datasets, I have a few issues with these types of services:
1) There is often no indication of the distribution rights of the data, or whether the data was obtained ethically from the source (i.e. following the ToS). I made this mistake when I used an OKCupid dataset released on an Open Data Repository; turns out it was scraped with a logged-in account and the dataset was taken down by DMCA
2) There is no indication of the quality of the data, and as a result, it may take an absurd amount of time cleaning the data for accuracy. Some datasets may not be salvageable.
3) Bandwidth. Good datasets have lots of data for better models, which these sites may not be able to support. (BigQuery public datasets solve this problem however)