Hacker News new | past | comments | ask | show | jobs | submit login
How a Theoretical Claim About the NSA Magically Transformed Into Factual Reality (thedailybanter.com)
57 points by brown9-2 on July 19, 2013 | hide | past | favorite | 30 comments



> In reality, it would take 18,918 days or 51 years for 550 analysts to listen to just one day’s worth of fiber optic data gathered for Tempora.

This sort of statement is so misleading it hurts. It's kind of sad that people are so ignorant of what computers and modern data processing can do, that someone can go around saying this sort of things and not be laughed at.

The analysis of the citation chain is pertinent, though.


The dumb mistake of this article is assuming that all captured data all has to be analyzed hot off the wire. Obviously the data would be warehoused and indexed for specific searches later.


or filtered by software.


This article is full of holes itself.

The actual headline he's decrying is 'Millions Of Gigabytes Collected Daily'. This is, according to the estimates and information he includes in the article, an accurate estimate. It is also suitably vague given the uncertainty over the estimates - it means anything 2 million gigabytes, up to 21 (the theoretical limit). We just don't know, what we do know is the ambition is to collect every communication, and they are nearing that ambition in reality, if only for a few full take days at GCHQ - the storage capacity in terms of internets/day at NSA I'm not sure on, does anyone know what their current capacity is?

Put another way, no! NSA and GCHQ are absolutely not gathering and/or analyzing that much data per day.

How does this follow? What has analysing got do do with collecting?

It’s an inconceivably big number meant to frighten readers.

2 petabytes per day would be just as frightening, and yet is well below the theoretical limit - I don't think the exact numbers here matter, this is an estimate meant to show just how much information is being collected.

But he saves the most bizarre assumptions for last:

As a group, and barring some sort of vortex, 550 analysts could only listen to 1,100 gigabytes of phone conversations per day, and that’s if they worked 24 hours per day and listened constantly. In reality, it would take 18,918 days or 51 years for 550 analysts to listen to just one day’s worth of fiber optic data gathered for Tempora.

Is he seriously contending most of this data is only telephone conversations, that collection without immediate analysis by a human being is not just as dangerous, that humans actually analyse each piece of information individually, without using filtering algorithms?

This comparison is far more misleading and inaccurate than the headline he decries.


There are a number of misrepresentations in this article.

He talks of the impossibility of the 550 (where has this number come from?) GHCQ and NSA analysts reading all this data. As if people are claiming that humans are individually watching it themselves. Even if they were, he completely ignores the obvious method: that machines under those agencies' control are the ones collecting this data, in favour of a hand-waving 'it's impossible'.

His statement that it's illegal for the NSA analysts to read these communications nicely sidesteps the strong evidence that they are doing it.

Even the focus of the article, the 21 million GB claim, tries to stoke up something much madder than what actually happened. The Guardian's reporting of the potential volume of data on the cables is not 'wild' in any sense. That the telephone game translated that potential figure into an actual one is not desirable, but also not particularly surprising. Notably, he doesn't take the opportunity to find out the actual average data rate on those cables and correct what he's criticising.

Of course, he also takes the opportunity to ladle on some criticism of Greenwald and the Guardian, and play up the 'it's a complex issue' angle which discourages any kind of public outcry.


I've always thought this is interesting, if not directly related.

Apache Accumulo:

Like Google, the [NSA] needed a way of storing and retrieving massive amounts of data across an army of servers, but it also needed extra tools for protecting all that data from prying eyes. They added “cell level” software controls that could separate various classifications of data, ensuring that each user could only access the information they were authorized to access.

http://www.wired.com/wiredenterprise/2012/07/nsa-accumulo-go...

Then:

How will graph applications adapt to Big Data at petabyte scale?

Brain scale. . .: 2.84 PB adjacency list, 2.84 PB edge list

Largest supercomputer installations do not have enough memory to process the Brain Graph (3 PB)!

[Accumulo gives] linear performance from 1 trillion to 70 trillion edges [on the 1 Pb Graph 500 benchmark]

From "An NSA Big Graph Experiment": http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013...


An interesting example of how journalists play telephone, but let's not distract ourselves by questioning the volume of data they intercept from the problematic fact that they do this at all.

What would worry me is a chain of citations where the pertinent facts changed (e.g. "celebrity X has the potential for Y" -> "X did Y"). This happens too, as you sometimes find in issues of Private Eye.


Deduplication. 753785 people watching the same video. The video can be stored once, or even not at all. Storing 753785 requests for the video is easy and takes 0 analysts.

In fact, by storing only what users send rather than what they receive you get everything that really matters.


Is there any doubt that if the NSA could store 22 petabytes per day that they would do so? That they are limited by practicality to merely snooping through or scanning 22 petabytes per day rather than storing all of it is I think a really minor quibble.


According to the NSA[1], their Utah center is "measured in zettabytes"

Let's break this down:

>1,000 gigabytes is a terabyte.

>1,000 terabytes is a petabyte.

>1,000 petabytes is an exabyte.

>1,000 exabytes is a zettabyte.

>1,000 zettabytes is a yottabyte.

Assuming they use the Microsoft standard to count and not 1024 bytes to a kilobyte. Let's assume they have 1 yottabyte of data storage. That's 1000 zettabytes. That's 1,000,000 exabytes. And, finally, that's 1,000,000,000 petabytes.

At a rate of 22 petabytes per day it will take approximately 45454545.45 days or 124,533 years to fill only the Utah data center. (Thank you Wolfram alpha [2])

So the scary answer to your rhetorical question is that "Yes, the NSA has the storage capacity to easily, if not comfortably store 22 petabytes of data per day, or even 500 times that without breaking a sweat."

----------

[1]http://nsa.gov1.info/utah-data-center/

[2]http://www.wolframalpha.com/input/?i=%281%C3%9710%5E9%2F22%2...


Read this: http://nsa.gov1.info/about/about.html

Total HDD shipments are ~135 million per quarter[1]. If every one of those drives was 4TB, entire world production would be about 2 zettabytes yearly. 1 yottabyte would take 500 years of current HDD production.

http://www.storagenewsletter.com/news/marketreport/trendfocu...


lol. They got me there. I missed that it was a parody site XD.

The analysis is still good though, assuming that the data center does have a one yottabyte capacity.

However, let's redo the analysis with a 2 zetabyte assumption, keeping in mind that the full data capacity of the center is probably not purchased and set up in a one year period:

We have 2,000 exabytes. That is 2,000,000 petabytes. That is 90909.1 days or 249.1 years. There we go, that sounds better! They can now only store a little over 10x the 22 petabytes per day comfortably.

http://www.wolframalpha.com/input/?i=2000000%2F22+days+to+ye...


In most interviews, William Binney estimates Utah will have one zettabyte of storage. I may come back with a source later. He worked very closely with these systems in their infancy, and estimates they can store 500 years of current data at maximum... but will most likely not use one large volume, and in actuality they can store 100 years of data written numerous times so they can parallel process for cryptocracking.


Close.

The NSA's Utah Data Center will be able to handle and process five zettabytes of data, according to William Binney, a former NSA technical director turned whistleblower.

http://www.npr.org/2013/06/10/190160772/amid-data-controvers...

However, from the same article:

Despite its capacity, the Utah center does not satisfy NSA's data demands. Last month, the agency broke ground on its next data farm at its headquarters at Ft. Meade, Md. But that facility will be only two-thirds the size of the mega-complex in Utah.


> is "measured in zettabytes"

Yah right. That would require the ENTIRE worlds production of hard disks for several years.

At 3TB per hard disk you need 350 million of them. Total world production is about 600 million - but the vast majority are much smaller than 3TB.

And that would require a building 1/5 of a mile on each side - just to hold the hard disks, never mind power, cooling, computers or network.

It would require 2.5GW to power just the hard disks - and never mind cool them, or power the computers and routers.

So: Yah Right.


It's not as simple as that.

Large scale commercially available SANs advertise de-duplication in the order of 50%[1].

Commercially available backup software advertises de-duplication rates that reduce storage requirements by up to 99% [2]

I don't know how they measure "storage", but the The Utah data center requires 65 MW of power[3] which is a non-trivial amount.

[1] http://www.netapp.com/us/products/platform-os/dedupe.aspx

[2] http://thebackupblog.typepad.com/thebackupblog/2011/07/insid...

[3] http://www.npr.org/2013/06/10/190160772/amid-data-controvers...


How many 4TB drives to store 1 YB?

Assuming 80% fill factor for error correction & redundancy, it's 3.1 x 10^11

http://www.wolframalpha.com/input/?i=%281+yottabyte+%2F+4+te...


so how big a data center would it take to store, cool and operate more than 300 billion 4TB hard drives.


Looking around on the EMC website, they have a product that can put 1.9 PB in a 40U rack. So that's 5.26 x 10^8 racks. Each rack consumes about 18 sq ft. of floor space (includes room for aisles, servicing, etc.), so doing the math, it comes out to 339 square miles. Assuming just one floor.

http://www.wolframalpha.com/input/?i=%285.26+x+10%5E8%29+*+1...

What does this tell me? Either the NSA has storage that is 2 or more orders of magnitude denser than current industry parts, or that a lot of their data is not in online storage - it's been moved to tape for warehouse storage (but LTO v6 is not as dense as hard drive - only 2.5TB in a larger device, so the warehouse would still be huge...)

Or most likely, the NSA just doesn't have 1 YB in their data center. Think about the disruption there would be in the hard drive market by someone buying 3.1 billion hard drives, even if they spaced out their purchases. And coming after the factories in Thailand were flooded a few years ago, with the shortage that happened.

So what if they used data compression on disk? Assume they got 90% compression (english text, voice, etc are pretty compressible), that means they only need 1/10th the number of drives. Wolfram Alpha says their data center space requirements go down to 34 square miles. Ok, a big savings. But it's still an area the size of Manhattan.


Relevant: "Dear NSA, I Don’t Think You Meant Yottabytes" http://xato.net/privacy/dear-nsa-meant-yottabytes/


the NSA and GCHQ have way more than 550 employees and I guarantee that the remainder is not all just janitors, secretarial staff and management.

This article and most others have failed to mention the technological force-multipliers (the bread and butter of HN) that their developers, sysadmins, and other technologist are constantly working on.

With the right search capabilities and filtering that 21 million GB gets a lot smaller.


Bamford stated ~70,000 in the seventies in one of his books. The Wiki estimates 30,000. 550 is a comically low guess. We can get at a much larger guess than 550 merely by counting parking places at Ft. Meade, and that wouldn't include the many employees who don't work at Ft. Meade.


The LHC grid (as of 2010) had storage capacity of 150 petabytes. So yes it is feasible. Efficient compression could probably increase that capacity 5-10x if you're storing primarily textual data.


Hey they're wrong about the amount of the data, therefore we've got nothing to worry about!


This should be called "How a Trivial Fact Check Became a Hyperbolic Article".


550 security guards each in a little stadium of cctv's

always watching


Yes, the worst part about the security state/NSA story is listening to and reading those who are most upset about it. They get on an emotional tear and play loose with facts and sources. This hurts much more than it helps, because it begins to paint anybody concerned about what we've created as a crank. I'm much more concerned about the ranters than I am about those that are apathetic.

This is an issue to be intellectually passionate about, not emotionally passionate. It's something I have to keep reminding myself of daily.

News outlets and bloggers are all too willing to play telephone and translate emotional energy into pageviews. That's not a good thing for the discussion.


As soon as you boil it down into boring work, it will be forgotten about by the proles.


The problem with this strategy is that all you get is a mindless angry mob. This the Egypt scenario: everybody knows things are broken, but nobody can agree on what it's supposed to look like.

Contrast this with the Federalist Papers, where an intellectual debate was held to vet solutions.

You don't want "any change" to take the place of "fix what's broken" By dumbing this down, you make what could be a productive movement into something much more dangerous.


As a rule of thumb, conservatives base their public policy agenda on theories that have no basis in reality, whereas liberals base their public policy agenda on statistics that have no basis in reality. This is a good example of the latter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: