AWS Public Datasets

sitkack · on March 26, 2018

Google Cloud has something similar, https://cloud.google.com/public-datasets/

I think it would show a kind of gilded age maturity if all the cloud providers cooperated on their public datasets, because they are for the public good.

CobrastanJorji · on March 26, 2018

Hacker News is a member of this dataset, but it's only updated daily. If you want to get meta about this discussion, I guess you'll need to wait until tomorrow.

https://cloud.google.com/bigquery/public-data/hacker-news

vgt · on March 26, 2018

And the first terabyte of analysis for free in BigQuery!!

(Work at g)

diggan · on March 26, 2018

That's all fine and dandy, but I assume I could also pull down the entire dataset for free and run unlimited analysis locally for free?

IAmEveryone · on March 26, 2018

Yes, although data transfer fees may apply if you go over the free quota.

diggan · on March 26, 2018

Ah, I see. These "public" datasets are not actually public as they are only available if you login (and I assume input your payment details before).

Shame on me to think "Public Datasets" meant that I could just download them via rsync/http.

If you're interested in truly public datasets, see http://academictorrents.com/ which hosts datasets as torrents.

komali2 · on March 26, 2018

This seems critical - as Google is a corporation, I wonder where this sense comes from? Obviously we have certain expectations for a couple who's tagline reads "don't be evil," but does that extend to criticizing them for not providing terabytes of free transfers of terabytes of free data, which they pay to host and serve?

diggan · on March 26, 2018

> criticizing them for not providing terabytes of free transfers of terabytes of free data

My criticism is not towards them not providing free transfers of data. My criticism is that they say "these datasets are freely hosted and accessible" and public, while only being available while logged in to Google, which I don't think classifies as public.

I would not have any problem if it's just called "Google-hosted Datasets" or "Google-only Public Datasets", I just think the current naming is misleading.

Edit: compare this to the AWS Public Datasets which are actually available without a AWS account, just go to https://landsat-pds.s3.amazonaws.com/c1/L8/139/045/LC08_L1TP... for example, which is the Landsat Dataset

kalcode · on March 26, 2018

A public library still requires a library card...

the term public means its available to the public. Signing up google, library card, doesn't negate that fact that any member of the public can access the data with a reasonable and insignificant barrier.

diggan · on March 26, 2018

Sure, the difference between getting a library card and signing up to a Google account is that usually the first has a user agreement which is about 10 lines long with human language and the second one has a user agreement which is 10+ pages with lawyer speak and probably includes that they are allowed to sell your data. How is it reasonable for a normal person to understand those kind of user agreements?

And if that's your definition of public, isn't everything public, if you have the right amount of money, knows the right people and can get access to the right place?

kalcode · on March 26, 2018

You only pay a price because you are using someone else system. Which cost money. Public library are public, you also pay taxes on them. Because using it as a service still has a cost.

Google give you 1TB on data to query each month. That is a lot of data, if you need more for free collect the data over time.

But you still need a library card and if the facility gets overused they ask for a small tax to cover the operating costs.

I fail to see how this isn't considered public. Public literally just means accessible to the general public. That all. A public event can still charge a fee for entry. As long as the fee doesn't create a barrier for most the public, you know accessible to the general public.

Also if a library sold their data would it no longer be considered public? I believe it would still be considered public. User data isn't sold typically by name or email, byt activity. If a library compiled a list of books checked out and their frequency, the amount of people entering everyday etc that be the equivalent of most user data being sold. Very rare for a company to sell your actual personal data, when they do they disassociate your personal information with it.

So if the above doesn't disqualify a library from being public, then neither would this dataset that is public. If you really disagree with that then you are just trying to be pedantic at that point.

Mizza · on March 26, 2018

If anybody is interested in AI/bioinformatics projects on AWS, I'm currently involved in a project to harmonize _all_ of the publicly available RNA data (many petabytes) into easily sliced, AI/ML-ready datasets for anybody to use:

Website: http://www.refine.bio/

Source: https://github.com/AlexsLemonade/refinebio

eggie · on March 26, 2018

Would this allow rapid realignment of the data against a given reference (graph?) model? Or is the alignment somehow baked in?

Mizza · on March 26, 2018

We do the alignment too, yes.

eggie · on March 27, 2018

What reference model do you align the RNAseq against? That would seem to have a huge effect on results no?

dekhn · on March 26, 2018

How in the world do you have "petabytes" of RNA data?

vamin · on March 26, 2018

An RNA sequencing run generates on the order of 10GB of data, a typical study requires many runs (treatments, controls, replication of results, etc), and posting the raw data is required by most biology journals. I'm not surprised that there is over 1PB of data available to curate.

dekhn · on March 26, 2018

Oh, you mean BAM files? Get yourself a retention policy; you don't need to keep RNA BAM files that long.

I thought you meant derived data.

vamin · on March 26, 2018

I'm talking about the raw reads, which is important if you want to try a different alignment or base-calling method. You can debate how important it is to be able to do that, but I'm not trying to argue that the data should be kept, I was just explaining why the total size of publicly available RNA-seq data (the sum total of which the parent is attempting to organize) runs in the petabytes.

dekhn · on March 27, 2018

So, do you or the original poster actually have a materialized petabyte of RNA data? Otherwise, you're just describing a million files spread over a million file servers, not being used for science or processed in any way.

Nashooo · on March 26, 2018

Very interested. Looking forward to following this!

yinyang_in · on March 26, 2018

Anybody to use, who is paying for bandwidth here?

benhamner · on March 26, 2018

Over 13,000 community-uploaded public datasets: https://www.kaggle.com/datasets

(I work at Kaggle)

santiagobasulto · on March 26, 2018

I take the chance to ask you: is there any command line utility similar to pip to import datasets from kaggle? Something like:

`kik pull titanic`

Godel_unicode · on March 26, 2018

https://github.com/Kaggle/kaggle-api

chrisbaglieri · on March 26, 2018

A lot of these datasets have been available for some time: https://aws.amazon.com/blogs/aws/new-aws-public-data-sets-tc.... Perhaps what is most surprising is that the list hasn't grown a ton since then.

yellowbkpk · on March 26, 2018

As someone who has spent a considerable amount of time on data that has ended up on this page, I think the fact that the list hasn't grown says more about the priorities of other companies than of AWS. Amazon doesn't (yet) have time to build and maintain these datasets themselves: they work with others to build and maintain it and then fund the storage and transmission fees.

I helped build the Terrain Tiles dataset as part of Mapzen, which recently shut down. The OpenStreetMap data exists on the AWS Public Datasets page because it's useful to Humanitarian OpenStreetMap Team. If you're able to convince your company to generate and work with a public dataset, consider reaching out to the AWS and Google public datasets teams to get it hosted and publicized.

neuromantik8086 · on March 27, 2018

In my experience, the list that AWS keeps on their public page isn't completely up to date. There are two major datasets in the neuroimaging community that are hosted in S3 (see my comment elsewhere on this page), but AWS hasn't widely publicized this fact for some reason.

marcinzm · on March 26, 2018

Interesting, they used to have Wikipedia data and now it's gone from the list, anyone know why?

arxpoetica · on March 27, 2018

Also curious...

minimaxir · on March 26, 2018

How much does it cost to export/process this data from Amazon? Unlike GCP/BigQuery which has free cloud processing built in, downloading/analyzing these GB/TB datasets for personal analysis can’t be cheap.

The descriptions note “Educators, researchers and students can apply for free promotional credits to take advantage of Public Datasets on AWS.” which is not a good sign.

sjburt · on March 26, 2018

It's the normal data cost. EG the Landsat data is stored in S3, so it's free to EC2 (in the same region), and somewhere around $0.09/gb to the public internet.

However, the idea is not that you download it all (there's probably cheaper ways to acquire Landsat data), the idea is if you want to do whatever analysis on AWS, they've already got it neatly ingested for you.

isseu · on March 26, 2018

You can use Athena, is cheap and easy to use

pletnes · on March 26, 2018

I guess they want you to run your analyses on AWS directly.

keypusher · on March 26, 2018

AWS has free tier, in this case 1500 hours free of a DC2.Large on Redshift.

pacificleo11 · on March 26, 2018

Importance of data for machine learning algorithms can’t be stressed enough. I remember talking to a friend in Google Translate team. they had a good also but they were struggling to get quality translation data to train their service. the problem was more severe when it came to language which was not very popular. translation set was next to nothing for Say something like Turkish, Hindi, Latvian etc.

They finally solved this by using meeting notes from UN Assembly. Which were transcribed by best of the translators? that access to meeting transcription was (unfair ??) advantage Google had over other tools. Was it wrong ? I don’t think so. Should have been those meeting notes be public: Yes

pavlov · on March 26, 2018

When Google Translate was new in my language Finnish, their translation quality was atrocious.

I tried this sentence: "Hei äiti, puhun suomea". The expected translation would be "Hi mom, I speak Finnish".

Instead Google's result was: "July's mother, I speak English".

Obviously the engine had been trained on unvetted data sets where the word "English" occurred in translations in a position where the original had the word "Finnish", and no context was provided to avoid this kind of mistake.

The word "July" came about because "hei" is also used as an abbreviation for "heinäkuu" (July). It was sobering that a supposed world-class AI couldn't distinguish between these two usages. Machine learning needs a lot of old-fashioned handtuned human-made heuristics.

komali2 · on March 26, 2018

Sobering perhaps to learn what the state of the art is for "world class" AI indeed ;)

Cortana when??

solarkraft · on March 26, 2018

The German ML translation company of ex-googlers called DeepL (formerly Linguee) providing the currently best machine translation service apparently got a lot of their data from (I assume public) EU documents which had to be translated to all member languages.

gajju3588 · on March 26, 2018

Adding one more source of open datasets in the thread: Manually tagged high-quality datasets :

https://dataturks.com/projects/trending

ah- · on March 26, 2018

Seems to be down. What do they have?

neuromantik8086 · on March 27, 2018

This doesn't appear to be publicized on the linked to page, but two of the major open access fMRI dataset repositories (FCP-INDI [1] and OpenfMRI [2] are hosted in S3.

[1] http://fcon_1000.projects.nitrc.org/ [2] https://openfmri.org/

elorant · on March 26, 2018

I wonder if anyone has processed Common Crawl's dataset for a list of all the existing domains (I don't care if the data are old).

RobAley · on March 26, 2018

You can get the URL data (and parse out the domains yourself) at http://index.commoncrawl.org/ Be aware that commoncrawl, while huge, only crawls a fraction of the web, and so you won't get all of the "existing domains" (i.e. all domains that exist) if that is what you are after.

tracker1 · on March 28, 2018

I wish they'd centralize WHOIS data and clean it up... that is some messy stuff there.