Show HN: Concept of a marketplace for machine learning datasets

minimaxir · on Jan 24, 2017

There are a few services popping up with aim to provide data repositories for analysis/ML (Kaggle, data.world, /r/datasets)

As someone who likes making analyses from random datasets, I have a few issues with these types of services:

1) There is often no indication of the distribution rights of the data, or whether the data was obtained ethically from the source (i.e. following the ToS). I made this mistake when I used an OKCupid dataset released on an Open Data Repository; turns out it was scraped with a logged-in account and the dataset was taken down by DMCA

2) There is no indication of the quality of the data, and as a result, it may take an absurd amount of time cleaning the data for accuracy. Some datasets may not be salvageable.

3) Bandwidth. Good datasets have lots of data for better models, which these sites may not be able to support. (BigQuery public datasets solve this problem however)

jamesblonde · on Jan 24, 2017

Good points. We are soon going to release a p2p system for sharing datasets, backed by Hadoop clusters. You install the Hadoop stack (localhost or distributed), then you can free-text search for datasets that have been made 'public' on any hadoop cluster that participates in the 'ecosystem'. We expect it to be self-policing, but there will be a way to report illegal distribution of datasets. The solution is based on a variant of Bittorrent where files are downloaded in-order (not randomly due to rarest piece selection in Bittorrent). Files can be downloaded to either HDFS or to a Kafka topic. We will demo it in 2 weeks here: https://fosdem.org/2017/schedule/event/democratizing_deep_le...

The system will be bootstrapped with lots of interesting big datasets: imagenet, 10m images, youtube 8m, reddit comments, hn comments, etc. Our experience is that we need a central point for researchers to get easy access to open datasets that doesn't require a AWS or GCE account.

Cacti · on Jan 24, 2017

Could you explain what problem you're trying to solve here? Are there really that many researchers who have access to modern (and expensive) GPU hardware that don't have bandwidth or disk space available? Or are there many researchers who are putting in lots of time assembling a dataset but don't have the bandwidth to distribute it?

jamesblonde · on Jan 24, 2017

It's more a case of providing a quick and easy way to share large datasets, backed by HDFS. So, researchers don't have a good way to share datasets (apart from AWS/GCE).

We work with climate science researchers who have multi-TB datasets, and they have no efficient way to share them. Same goes for genomics researchers who routinely pay lots of money for Aspera licenses just to download datasets faster than TCP allows. We are using a Ledbat protocol tuned to give good bandwidth over high latency links, but only scavange available b/w as it is lower priority than TCP.

For the machine learning researcher: i'd like to test this RNN on the reddit comments dataset....3 days later after finding a poor quality torrent...oh, now i can do it. On our system, search, find, click to download. We will move towards downloading (random) samples of very large datasets (even to Kafka from where they can be processed as they are downloaded).

amelius · on Jan 25, 2017

Sounds nice. Could you consider to make it more general than sharing datasets for ML? I mean, it sounds like a really generic solution that anyone could benefit from, not just researchers.

bussiere · on Jan 24, 2017

Will you make a ranking system for noting user and dataset ?

I recommand using true skill for noting user and dataset.

Keep up the good job.

jamesblonde · on Jan 24, 2017

It appears TrueSkill is patented - https://en.wikipedia.org/wiki/TrueSkill

justinclift · on Jan 25, 2017

Ugh, that's a fairly egregious patent. It's literally of a mathematical method. :(

_v7gu · on Jan 25, 2017

Just say yours does lower bound of Wilson score confidence interval for a Bernoulli parameter [0]. It has been used for sorting shopping items by ratings for years.

[0] http://www.evanmiller.org/how-not-to-sort-by-average-rating....

justinclift · on Jan 25, 2017

Thanks, didn't know about that. :)

Chris2048 · on Jan 25, 2017

Doesn't the language tie it necessarily to the application? i.e., player skill determination?

justinclift · on Jan 25, 2017

I'm not a lawyer, so no idea.

If a person figured out a way to apply it usefully to some other area - which doesn't seem hard at all - (job skill ranking? :D), MS is the kind of company that would attempt to collect $$$ from it regardless. :(

bussiere · on Jan 24, 2017

Or a ELO system

minimaxir · on Jan 24, 2017

That only addresses issue #3, which is much less important than issues #1 and #2. (and arguably worse: decentralization makes proper sourcing harder)

jamesblonde · on Jan 24, 2017

That's why we're working on a ranking/feedback system (a la piratebay) so that users can see other user comments on the datasets. Regarding decentralization - we are starting with centralized search, and download times are much better for popular datasets by downloading in parallel from many peers. We are also using a congestion control protocol (Ledbat) that is lower priority than TCP - you will not notice it using your b/w to share data if you are downloading/uploading using TCP.

jonloyens · on Jan 29, 2017

I’m one of the cofounders at data.world and you definitely make good points… We encourage all users to post a license with their datasets but, like open source, not everyone will maintain or honor these. This is definitely something that we feel we can help the open data community with and encourage even more. (As an aside, in general you should note is that a lot of data scraped from websites is something that you’ll need to be careful with).

When it comes to data quality, the world is a messy place and the data that comes from it is messy too. Most professional data scientists spend an inordinate amount of time cleaning datasets, doing feature engineering etc… it’s part of their job description. We’re trying to eliminate some of that repetitive work by making sure that people can comment on, contribute to and give some signal back on the quality of the dataset. We also think a dataset is more than just the data... on data.world you can upload code, Notebooks, images, etc... anything that helps add context to the data.

Finally, when it comes to size... ML definitely needs it. However, there's a lot of interesting data out there thats still very complicated, very useful but not that big. Most datasets in the world are well under the terabyte size (or even 100s of GBs in size). We're rapidly expanding the size of datasets we support because we want that stuff too but we really want to help people understand all the data in the world!

akshaynathr · on Jan 24, 2017

Thanks for the valuable feedback.

whyileft · on Jan 24, 2017

Is there actual case law that prohibits the use of copyrighted material for corpora and other training data?

Sure distribution can have issues, but do you have any references for simple possession as training and test data?

minimaxir · on Jan 24, 2017

If you're gathering data for your own business, it would be difficult for anyone to know, sure.

But if the data/analysis is published, then the data source would need to be disclosed.

See more on the OKCupid case I mentioned above: http://www.vox.com/platform/amp/2016/5/12/11666116/70000-okc...

EternalData · on Jan 24, 2017

There's probably some utility to it: a lot of problems involve hacking together datasets, sometimes in dubious ways. There's also value, especially for startups that are looking to build simple neural net applications (ex: identifying plates of food from different restaurants) which are very data-dependent. Researchers may also want to reflect the cost assembling datasets (ex: MTurk, processing power) and open up datasets that may never have been open before.

My general sense on this though is that I'd like there to be more of an incentive for people to open up their datasets to the larger public. Maybe I'm being idealistic but a crowdsourcing type function where you pay for X dataset together with other users and then it's released under MIT, forever free etc.

As others have mentioned that'll probably bump against usage rights issues, a larger problem you'll have to deal with independent of your need to sell or distribute the datasets in question.

akshaynathr · on Jan 24, 2017

Hi everyone, While working on some of my projects involving Machine learning algorithms and deep neural networks i have found that there is a lack of training data sets in many areas. Also many of them are scattered throughout web, some are extremely huge for an individual to process etc.So i thought of this idea of having a marketplace for data ,structured for machine learning communities. It can be a one stop place for researchers, scientists, students,data analysts etc.Looking for some valuable opinions.

bijection · on Jan 24, 2017

It looks like your page is down, possibly due to HN load.

joe_the_user · on Jan 25, 2017

I don't think you can make this work because even a fairly small cost for data produces all sorts of problems.

You would have to distribute all these datasets with a license and such a thing would restrict any users much more than you'd realize - sharing one's results would become a trickier matter just for one. Deciding if the data is good would be another thing since being able to sell stuff produces an incentive for uploading garbage.

Just as much, huge datasets are effectively going to be the product of many people's data. If some entity decides it's going to profit from that, how do the profits get divided, especially if there's no protocols for that division.

For example, should someone be able to sell my chest X-rays if I haven't agreed to it? A lot of large companies with big data sets would face far more issues if they were to be selling those. And considerations go on and on and on.

The free-wheel, free-sharing quality of present day machine has been a driving force in its progress and it would be a shame for that to go away. As can be seen with the Internet, freely available data can be incredible force and hopefully data will remain freely available.

nerdponx · on Jan 24, 2017

Are you familiar with the UCI repository? https://archive.ics.uci.edu/ml/datasets.html

akshaynathr · on Jan 24, 2017

Yeah. I was thinking of a crowdsourced marketplace platform which follows a particular standard so all dataset supports major ML platforms, API support for programming languages for preprocessing huge datasets etc.People can prepare and sell /buy datasets. This can bring people from various streams together to solve really major issues using the data.

huac · on Jan 24, 2017

what is a 'particular standard for data'

akshaynathr · on Jan 24, 2017

Many datasets available online does not clearly tell the quality of data. So I think, having a particular standard of quality can really help.

hazelnut · on Jan 24, 2017

I think the idea is great but you should think about this sentence: "Buy and sell your data like Ebay" next to an image of connected people. It looks like you're a shady user profile dealer. To be successful it's crucial that you draw a clear line there

verdverm · on Jan 24, 2017

also, how does this fit with...

""" Datapie offers data analysis without downloading the data. This means you need not download the massive data. No need to have massive distributed systems to process it. """

Where is your company trying to fit into the market? Is this a http://zerotoonebook.com/ or are we commodotized in this space already?

IBM offers many of the same data sets, paired with your company's private data, both annotated by Watson. IBM also does not learn from, or improve their own models, w.r.t. your company's proprietary data.

akshaynathr · on Jan 24, 2017

I used that image to demonstrate the buyers and sellers being connected together. I will correct it asap.

pgroves · on Jan 24, 2017

I swear this is what Infochimps used to be, but now I don't really see a reference to it on their website. Except for a 404 when I click on "Resources" -> "Data Marketplace"[1]. I'm guessing that means they moved away from that business. Looks like they now focus on tools, not data itself.

[1] http://www.infochimps.com/marketplace

Palomides · on Jan 25, 2017

I think they pivoted and then got bought out, but yeah, I recall the same.

amelius · on Jan 24, 2017

What I like more is the concept of a job-agency for AIs. Basically, the job-agency is a broker between people who have data and need an algorithm, and people who have an algorithm but no data. The broker can then work as a matchmaker, but also provide protection against data/algorithm theft by managing hardware themselves.

As an example, see [1].

[1] http://www.aigency.co/

_wmd · on Jan 24, 2017

relevant: https://www.reddit.com/r/datasets/top/

btown · on Jan 24, 2017

For those unfamiliar with Reddit, you can choose in a dropdown what time period to filter top posts by; here's a direct link to the all-time top dataset posts there: https://www.reddit.com/r/datasets/top/?sort=top&t=all

pjackson5 · on Jan 24, 2017

Would there be much of an opportunity for someone whos into photo-realistic 3d rendering to create some of these datasets? For starters i was thinking of making something like the make3d dataset - http://make3d.cs.cornell.edu/data.html for some of my own experiments.

markkurt · on Jan 24, 2017

not sure why it bothered me but the scrim on your top section could use an extra couple pixels of padding.

like the idea - minimaxir had some good thoughts.

chattamatt · on Jan 24, 2017

Facebook page button doesn't work, fwiw

ungaro · on Jan 25, 2017

link seems broken to me, but here is something that i know: http://academictorrents.com/

akshaynathr · on Jan 24, 2017

Link is now working again. It went down due to HN load.