If anybody is interested in AI/bioinformatics projects on AWS, I'm currently involved in a project to harmonize _all_ of the publicly available RNA data (many petabytes) into easily sliced, AI/ML-ready datasets for anybody to use:
An RNA sequencing run generates on the order of 10GB of data, a typical study requires many runs (treatments, controls, replication of results, etc), and posting the raw data is required by most biology journals. I'm not surprised that there is over 1PB of data available to curate.
I'm talking about the raw reads, which is important if you want to try a different alignment or base-calling method. You can debate how important it is to be able to do that, but I'm not trying to argue that the data should be kept, I was just explaining why the total size of publicly available RNA-seq data (the sum total of which the parent is attempting to organize) runs in the petabytes.
So, do you or the original poster actually have a materialized petabyte of RNA data? Otherwise, you're just describing a million files spread over a million file servers, not being used for science or processed in any way.
Website: http://www.refine.bio/
Source: https://github.com/AlexsLemonade/refinebio