If anybody is interested in AI/bioinformatics projects on AWS, I'm currently inv...

eggie · on March 26, 2018

Would this allow rapid realignment of the data against a given reference (graph?) model? Or is the alignment somehow baked in?

Mizza · on March 26, 2018

We do the alignment too, yes.

eggie · on March 27, 2018

What reference model do you align the RNAseq against? That would seem to have a huge effect on results no?

dekhn · on March 26, 2018

How in the world do you have "petabytes" of RNA data?

vamin · on March 26, 2018

An RNA sequencing run generates on the order of 10GB of data, a typical study requires many runs (treatments, controls, replication of results, etc), and posting the raw data is required by most biology journals. I'm not surprised that there is over 1PB of data available to curate.

dekhn · on March 26, 2018

Oh, you mean BAM files? Get yourself a retention policy; you don't need to keep RNA BAM files that long.

I thought you meant derived data.

vamin · on March 26, 2018

I'm talking about the raw reads, which is important if you want to try a different alignment or base-calling method. You can debate how important it is to be able to do that, but I'm not trying to argue that the data should be kept, I was just explaining why the total size of publicly available RNA-seq data (the sum total of which the parent is attempting to organize) runs in the petabytes.

dekhn · on March 27, 2018

So, do you or the original poster actually have a materialized petabyte of RNA data? Otherwise, you're just describing a million files spread over a million file servers, not being used for science or processed in any way.

Nashooo · on March 26, 2018

Very interested. Looking forward to following this!

yinyang_in · on March 26, 2018

Anybody to use, who is paying for bandwidth here?