Hacker News new | past | comments | ask | show | jobs | submit login

If anybody is interested in AI/bioinformatics projects on AWS, I'm currently involved in a project to harmonize _all_ of the publicly available RNA data (many petabytes) into easily sliced, AI/ML-ready datasets for anybody to use:

Website: http://www.refine.bio/

Source: https://github.com/AlexsLemonade/refinebio




Would this allow rapid realignment of the data against a given reference (graph?) model? Or is the alignment somehow baked in?


We do the alignment too, yes.


What reference model do you align the RNAseq against? That would seem to have a huge effect on results no?


How in the world do you have "petabytes" of RNA data?


An RNA sequencing run generates on the order of 10GB of data, a typical study requires many runs (treatments, controls, replication of results, etc), and posting the raw data is required by most biology journals. I'm not surprised that there is over 1PB of data available to curate.


Oh, you mean BAM files? Get yourself a retention policy; you don't need to keep RNA BAM files that long.

I thought you meant derived data.


I'm talking about the raw reads, which is important if you want to try a different alignment or base-calling method. You can debate how important it is to be able to do that, but I'm not trying to argue that the data should be kept, I was just explaining why the total size of publicly available RNA-seq data (the sum total of which the parent is attempting to organize) runs in the petabytes.


So, do you or the original poster actually have a materialized petabyte of RNA data? Otherwise, you're just describing a million files spread over a million file servers, not being used for science or processed in any way.


Very interested. Looking forward to following this!


Anybody to use, who is paying for bandwidth here?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: