Hacker News new | past | comments | ask | show | jobs | submit login
Phone call metadata does betray sensitive details about your life (theguardian.com)
138 points by Libertatea on March 13, 2014 | hide | past | favorite | 62 comments



If it didn't, the NSA would not be so interested in collecting it. Paranoid people who believe they might be being listened in on are unlikely to reveal much directly in the conversation itself anyway, so in those cases the metadata is more important. Also, the metadata can reveal anomalous behavior, which they look for mainly because it's easy to find, but also because it can reveal important information assuming the targets are correctly selected.

Anyway, the only reason they aren't collecting the calls themselves is because the storage required is not yet available, so their begging off about "but we aren't storing the content" is disingenuous. They can, at any moment, capture any call going over the long-distance network (which would include pretty much all cell phone calls) so the only thing they're unable to do is to retroactively listen in on calls. If you are flagged for whatever reason (you know someone who knows someone who knows someone they suspect of sending money to a bad charity), you may well be being monitored.


> If it didn't, the NSA would not be so interested in collecting it.

I'm going to quote Schneier.

Stop calling it metadata. Call it what cryptographers and security professionals have called it since forever: traffic analysis.

Traffic analysis is a powerful tool by itself. Combined with practically unlimited access to source material, and the ability to unmask almost every communication node, you don't even NEED to care about the contents. Timing, frequency and direction of communication is more than enough.


The amount of storage required for storing all of the call recordings (GSM, VoIP, land-line) are currently available. For example, Speex [1] can compress voice even in 2kbps. So storing everything e.g. in 8kbps you can store 916259 hours (104 years) of voice in just one 3TB disk.

[1] http://speex.org/


Let's take the US. 317 million people, assuming they call for an average of 10 minutes (based on nothing whatsoever, btw), gives approx. (10317) / 60 = 53 million hours of phone conversations a day.

Given 916.259 hours = 3 TB, 53M / 916.259 = 57 3 TB = 172 TB of data /a day/. And that's just the US. Even if you adjust the average to just one minute a day, you're still looking at 17 TB / day, which should be sorta manageable I reckon.

But it's not just the US. Let's assume they want to track all voice communications globally, rounding it to an even 7 billion, ten minutes a day. I'm counting 3819 TB/day. That's a lotta 3 TB hard drives.

tl;dr: big data is big. disclaimer: I suck at basic arithmetic, I probably made a miscalculation.


~200TB/day is entirely within the room the NSA's budget allows.

You don't need to record all phones in the world if you have the metadata of all calls in the world. The NSA or another spy agency could record only calls that match a given pattern: place from which the call originates, who is being called, time of call, previous calls made or received by the line, when the line or phone was purchased, whether this phone's call patterns resemble another phone's call patterns, etc.

They wouldn't achieve 100% coverage, but efficacy would probably be 99%+.

I would guess the problem with this kind of semi-targeted collection is processing power to decide who is a target and schedule the line taps.


Or, record all of it that you can into a daily or weekly cache, and then keep in indefinite (expensive) storage the things which are Statistically Interesting but outside our current budget/capabilities to store forever.


Yes, good point. You don't have to store everything forever, you can have tiers of interestingness — capture everything, read it looking for certain patterns, store anything that matches those patterns forever, store stuff that matches [secondary, less important but still useful pattern] for a few years, and store all other calls for a few months or a year.

Phone calls are quite low quality audio, but I don't expect the NSA to be limited to consumer grade text-to-speech technology, so at least for calls in some languages, they could store the transcripts forever.

EDIT: Apart from processing power, another expensive problem with such a setup is memory to store the firehose temporarily.

EDIT 2: If you were wondering, 200TB/day would run at $7500 for 50 4TB external hard drives at $150 each, assuming you wanted to use a Backblaze-like setup. In a year, that's $2.7 million. (This doesn't account for redundancy.)


I think they can handle it in Utah:

"An article by Forbes estimates the storage capacity as between 3 and 12 exabytes in the near term" [1]

[1] https://en.wikipedia.org/wiki/Utah_Data_Center


Which gives you between 2-10 years of storage if it was recording everyone in the world. (Based on the assumptions above).


I think you need to divide that by 2 since conversations happen in pairs, and most people in the US are talking to someone else in the US.


You rate 17 TB / day as "sorta manageable I reckon" for the federal goverment? Your local Best Buy could meet that demand.


I estimate an upper bound on the cost to be about $11 million/year:

https://docs.google.com/spreadsheet/ccc?key=0AqWtA_3af-R0dE5...


Well, for starters, if most calls are internal to the US, half of those ten minutes will be shared with another phone (unless you're just counting outgoing calls, not ingoing calls, for the "magic" 10 minute number). So if you're talking average talk time, you need to divide by two. The effect of conference calls is probably(?) negligible however.

Now, lets take a number, say 4PB/day, or 120 PB/month, say that we put each call to S3, and that each call lasts 2 minutes (on average), so 5x317x30x10^6~ 1.5 Billion puts/month, and lets say 1.5 Million gets, and a full 120PB in/out internal transfer -- that comes to just about 19M USD/Month on S3. That's certainly within NSAs budget -- and Amazon can provide that with a profit margin (assuming, they can, in fact, provide that service).


I think you made an arithmetic mistake. Look at it another way:

10 minutes per person per day * 8kbps = 600 kB per person per day. 600 kB / person / day * 365 days /year = 214 MB / year.

That's nothing. Consumer flash media is something like $0.30/GB. Let's mark that up 100x because the three letter agency doesn't care about costs and has an inefficient procurement process, so $30/GB.

0.209 GB / year * $30 / GB = $6 per person per year.

There are 300 million people in the US, but phone calls are between at least two parties, so:

300 million people * 0.5 * $6 per person per year = $900 million per year.

You can't even build a mile of highway for that little. Hell, some big cities in the US have a bigger annual deficit than that.


Count me among those who think the storage is doable. The transmission may be a bottleneck: effectively you're doubling the bandwidth requirements for phone traffic.

With local tapping and storage facilities with some mechanism for cache-and-forward including the enduring favorite: a station wagon full of tapes, or perhaps a panel-van full of flash drives, this remains within the realm of possibility.


You can actually do much better than 2 kbps. MELP-E (unfortunately patented), actually can do 600 bps (and is fairly robust to noise I might add). With the advent of great speech-to-text ML, you could actually reduce this even further, to the 100-200 bps range).


They are already storing the phone calls...

* http://www.pbs.org/newshour/bb/government_programs-july-dec1...

* http://www.theguardian.com/commentisfree/2013/may/04/telepho...

I'll quote the slip up of Tim Clemente, a former FBI agent:

> All digital communications are uh uh... There's a way to look at digital communications in the past. And I can't go into detail of how that's done or what's done but I can tell you that no digital communication is secure.


And even if they are not storing the audio recordings, wouldn't it be reasonable to expect that automatic transcription is run against the audio for search indexing and long-term storage purposes? Text transcriptions like that, even if they are imperfect, can preserve conversations with minimal storage requirements almost indefinitely.


I think you need to pay a little more attention:

http://www.youtube.com/watch?feature=player_detailpage&v=vIL...

The ARE storing everything. 15 years worth.


Are you quite sure they aren't collecting content? The storage is feasible, and several remarks (slips?) by people in a position to know suggest that they are in fact doing this, if not to everyone, then to hundreds of thousands or even millions of people.

http://blog.rubbingalcoholic.com/post/52913031241/its-not-ju...

https://www.schneier.com/blog/archives/2013/06/evidence_that...


> Are you quite sure they aren't collecting content?

It seems to depend entirely on the program in question, and the legal authorities for each program.

For foreign intelligence actually collected overseas under EO 12333 they have programs that collect content (e.g. SMS), which have existed in one form or another since the beginnings of the Cold War.

For foreign intelligence conducted on U.S. soil they have to fall within the boundaries of Fourth Amendment reasonable search (and related legal rulings like Smith v. Maryland) so they wouldn't capture content. But on the other hand they can still capture content in targeted fashion against non-U.S. persons in certain scenarios, and that content might itself involve conversations involving a U.S. person and still be a legal search.


It seems to depend entirely on the program in question

Accessibility does, but it has nothing to do with the agency. "Does the government store the content of US calls?" can be answered in the aggregate without reference to specific programs. If the content is there, an agency only has to acquire the authorities required to access what is already there.

So, the question remains: is the content there to access, given proper authorities?


There's multiple levels of authorities though. Collecting the calls at all would itself require some legal authority. Searching the calls you collected would then need yet another legal authority.


Of course! So, is the content stored?


While this is certainly true, sentences like this one show that The Guardian doesn't understand the nature of the data/metadata distinction: "The researchers cite statements like that of President Obama, that the NSA was 'not looking at content,' and ask whether the legal distinction between metadata and content is matched by harm reduction in the real world."

The distinction between data and metadata in U.S. law isn't about "metadata" being supposedly less harmful than "data." That has absolutely nothing to do with it. The distinction is based on the 4th Amendment's requirement that people have an expectation of privacy in the information in order for it to be protected. The idea is that if you have to expose certain information to a company in order for the communication to work at all, then you can't expect that information to be private. If you expose it, if you have to expose it, it's not private.

The legal distinction reflects the underlying technical distinctions. You can encrypt a phone conversation, but you can't encrypt the signaling information and still have the system work. You can encrypt the contents of a packet, but you can't do the same for the IP headers. You have to expose this "metadata" in order for modern telephone and IP networks to work as they are currently designed, and it's this exposure that creates the distinction for 4th amendment purposes.


"The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized."

There is nothing in the 4th Amendment that says anything about an expectation of privacy. Besides phone metadata IS private. It is only shared with the person placing the call, the phone company, and the recipient of the call. It is not publicly available. Just because I share something with someone else does not mean it is no longer private, just that it is private between myself and that other person.


> There is nothing in the 4th Amendment that says anything about an expectation of privacy.

Ignoring the 4th amendment jurisprudence is a double-edged sword, because it allows interpretations of the 4th amendment that are a lot more conservative than the modern "expectation of privacy" formulation. The original intent of the 4th was just to prevent customs searches of your home. Extending it to communications at all was an invention of SCOTUS. Moreover, the interpretation of the word "their" is a challenge. How can Bob and Marie say that AT&T's records about them are "their . . . papers, and effects . . ." when they didn't record that information, they aren't storing that information, and they don't even have access to that information?

No, if you're a privacy advocate, you're far better starting from the "expectation of privacy" springboard than going back to the literal text.

> It is only shared with the person placing the call, the phone company, and the recipient of the call.

The difference between "private" and "public" is a spectrum. On one end of the spectrum are the thoughts in your head. On the other end are thoughts you publish in the NYT. The way the 4th amendment jurisprudence uses the word "private" it means things that are close in the spectrum to the former extreme, not everything that isn't at the latter extreme. Sharing something by publishing it in the NYT means it's not private. So does sharing something with potentially thousands of people at AT&T or Google.


"How can Bob and Marie say that AT&T's records about them are "their . . . papers, and effects . . ." when they didn't record that information, they aren't storing that information, and they don't even have access to that information?"

I agree with you on this, but I don't agree that means it is no longer protected by the 4th Amendment. That information is AT&T's "papers and effects", so collecting it without a warrant is a violation of AT&T's 4th Amendment rights, not necessarily Bob and Marie's.


You don't need a warrant if the party consents, and as far as I know, AT&T hands that information over voluntarily or in response to valid subpoenas.


"The original intent of the 4th was just to prevent customs searches of your home."

How did you reach that conclusion? I would have thought that if the Founders intended to limit the 4th to "just customs searches" they would have said so?

Given that at the time, there was no electronic communication, it isn't really surprising that the 4th doesn't mention it explicitly, and only mentions "papers".


> How did you reach that conclusion? I would have thought that if the Founders intended to limit the 4th to "just customs searches" they would have said so?

The Bill of Rights were born of the same environment that gave a specific protection regarding the right to bear arms and regarding the quartering of soldiers in private homes.

They were in many ways a reaction to British actions prior to the American Revolutionary War and not a simple enumeration of rights. Indeed, the Federalists felt the Bill of Rights was self-evident and did not need to be duplicated in that fashion, and they added in a separate Amendment making clear that the fact that these rights were enumerated should not be construed to mean that the list of rights was exhaustive.

Getting back to your question then, the Fourth Amendment was a reaction to British "writs of assistance" that allowed customs inspectors wide-ranging powers of search and seizure to stop New England merchants from smuggling goods.

Indeed, if the writers of the Fourth Amendment had meant to completely foreclose the possibility of government warrantless searches then they would not have inserted the weasel word "reasonable" into its text.

So the good news is that reasonable can change over time to become more restrictive on government's ability to search. The bad news is that until this change actually happens in the courts then it's hard to say specifically that the government is "oppressing the people" just because their lawyers don't agree with your interpretation of the Fourth Amendment.


Thanks- I understand the historical background, but I don't agree that it was meant to be limited to just what the early Americans had recently experienced. You are correct that the word "reasonable" can be interpreted and re-interpreted by the courts over time.


Except that the NSA has, by its own admission, been gathering metadata without warrant, for some time, so it's not private, by any reasonable standard of the concept. There are degrees of privacy, but arguably once the data has been removed from the context you shared it in, it's not private - the "moment of exposure" is a crucial concept here.


So they just need to articulate that they see this as a problem with the way the 4th Amendment is written and interpreted? Along with much of the associated law of course.

I guess a right to conduct day to day affairs without 3rd and 4th party institutional intervention is harder to reason about.


Yes, the correct debate is about how the 4th amendment is written and interpreted. That said, there is a tactical reason why opponents of surveillance take the somewhat disingenuous approach of banging on the 4th amendment while raising issues that are not relevant to analysis under the 4th amendment. That tactical reason is that if you can make a government policy out to be a Constitutional violation, you need only convince the courts the policy is wrong, instead of convincing the public and the Congress. It's a lot easier.

> I guess a right to conduct day to day affairs without 3rd and 4th party institutional intervention is harder to reason about.

What I find puzzling about this sentence is that we conduct our day to day affairs in a way where 3rd parties are almost always involved, whether you're talking about AT&T, Google, etc. Most people who oppose surveillance want to set up a system where it's okay for AT&T, Google, etc, to be able to track you using metadata, but not okay for the government to do the same.

That proposal might very well be a good idea, but there's simply no way to express that distinction in Constitutional terms. "Private" in the Constitution doesn't mean information is just between you, who you're talking to, and your hundred closest friends working at Google or Facebook.

If people want the interpretation of the 4th amendment to change, it will be a lot easier if these services were actually private: i.e. encrypted content and obfuscated signaling so that third party service providers like AT&T and Google can't use the information for commercial purposes. Then you can make the "expectation of privacy" argument with a straight face.


Trying to explain intervention there, say I have an agreement with AT&T that they maintain records of my calls strictly for billing and then dispose of them, then doing something else with the records is intervening (in the sense that undisclosed attention is somehow being directed at records of my calls). It was not a good way to say it.


That's precisely what I'm getting at. If you had such agreements with AT&T, Google, etc, that they'd only use your call/e-mail records for routing and billing and dispose of them immediately, you'd have a far stronger argument that you have an expectation of privacy in that information.


The idea is that if you have to expose certain information to a company in order for the communication to work at all, then you can't expect that information to be private.

I don't see how this can work as a distinction between a phone call's metadata and its actual audio content. When I pick up the phone, my phone number and my voice are both conducted over the phone company's network. I have to expose my voice to the network to make the system work.


The cell network doesn't need to inspect the audio content. It could be encrypted, for example (and carriers do offer encrypted voice calls). But the signaling information has to be exposed in a way that allows the network to know that phone A is calling phone B.


Where does that sort of hair-splitting get you when you consider a land line, and does that logical endpoint still make any sense at all?


It's not hair splitting, it's a deep truth about the nature of communications networks as they're currently designed: they can't route communications without inspecting metadata that discloses the endpoints. They can, however, perfectly happily route content that they can't inspect.


I don't see how you can talk about "inspection" without some sort of intentionality, which the machines don't have.

Digital systems in particular transport copies of bits, and the process by which the copies are made is "inspection" as surely as anything else is; the useful function of the system depends on correct interpretation of the payload bits as much as it does on any of the other bits.

It is your revision from "exposure" to "inspection" that I feel is hair-splitting. It's not even a distinction without a difference. It's a distinction applied to a machine but based on human assumptions about intent. It's anthropomorphizing.

I don't buy the legal justification for the collection of metadata that you describe, and I think that a really honest, thorough application of that line of reasoning would lead us to conclude that no electronically assisted communications have "an expectation of privacy."

I guess that's where we find ourselves, though.


did this number talk to this number or anything four hops in its graph. that is the extent of the "metadata" level of anonymity that is admitted to. are the folks running this query system willing to remove numbers everybody has to call at one time or another? it is not and never has been metadata, it is data. it is not anonymous, it is not data about data and it can be cross referenced. it would be utterly idiotic to contemplate that it works any other way. if you were in charge of it, how would you have it work? leave all decency aside and answer in the "get the job done" frame of mind.


A land line makes an actual circuit to the other party. The signaling information is used to arrange the necessary relays/switches (for old analog gear) or virtual circuit routing tables (for newer stuff).

Once this circuit is setup, there's no requirement to actually monitor the electrical signaling on the wire (physically or virtually) until the next out-of-band signal event (e.g. hang-up), just as there's no requirement for UPS to open your box while it's in their possession. Even for virtual circuits you don't need to inspect the data portion of the segment, you need only inspect enough of the header to send it on to the right endpoint.

But either way, even that argument has a bit more nuance; the call metadata was a business record retained by the business itself, not the person. And the person knew the business had these records about them, because the phone company mailed them that metadata every month along with the bill.

So (the argument went) there couldn't be a reasonable expectation of privacy regarding the phone numbers dialed and the call lengths, since the phone company made it quite clear every month that this "metadata" wasn't actually private (to the person) at all.


Wouldn't this strictly only apply to IP destination headers?


Researchers ... successfully identified a cannabis cultivator, multiple sclerosis sufferer and a visitor to an abortion clinic using nothing more than the timing and destination of their phone calls.

Unfortunately, the political actors who are the biggest cheerleaders/defenders of total surveillance are also the ones most likely to be in favor of unconstitutionally severe pursuit and punishment of drug growers, in favor of ejecting sick people from the health care system (preferring that they instead die quickly and cheaply), and in favor of publicly shaming abortion patients.

In other words, a result like this is particularly likely to re-enforce existing biases more than anything else. Plenty of Americans will respond to such news by suggesting such data should be used more, not collected less.


As someone said recently "we know you have called a phone sex line 10 times in the last month, and a divorce lawyer last week, but we don't know what you talked about."


I'm not doubting that it's entirely possible to identify a lot about people from their phone metadata. I don't think it takes a huge amount of creative thinking to establish that someone who calls a hydroponics dealer and a headshop is fairly likely to have some connection to drugs, and that tracking their phone calls would allow you to spot that.

But I'm not sure that this article really adds a lot to the discussion with statements like "Owing to the sensitivity of these matters, Mayer explains that the researchers elected not to contact the three participants for confirmation that their inferences were correct"

If they could have correctly identified a drug dealer from a pattern of seemingly innocuous phone calls (and actually validated that they are correct) then this could have been at least vaguely interesting. As it is, the story is "if you've phoned somewhere that deals with MS relapses, we can make a guess that you could have MS". Well thanks, Sherlock.


Just for the record -- unless somebody is also storing the contents of the calls, which the NSA strenuously claims isn't happening, it's NOT metadata. It's just data.

Why? Because metadata is "data about data", so something is only metadata if there's other data for it to be data about.

http://www.dbms2.com/2014/02/23/confusion-about-metadata/


Meh. The call itself, and the data about the call, traverse the telephone network. The data about the call, the metadata, is being logged and stored. Yes, by the time it gets to an NSA storage device, it's no longer "meta," but at the time the human made a call and generated the metadata, it was metadata. This is the tag we've chosen to stick on it. That the tag maybe should change doesn't really change the privacy implications, and would only serve to confuse the conversation to a populace that already doesn't seem to care.


Only if you define data as "something that the NSA has". I appreciate that they've overstepped their boundaries but this is probably taking things a bit far.

The existence of metadata doesn't imply the continued existence of the original data. Just that it has existed.


Let's be perfectly clear here:

> Metadata is "data about data"

Just shorten that to "Metadata is data." That's what it is, and always has been. This weird idea that it is some sort of second-class citizen of data is the problem and allows the government to hide behind "it's just metadata": it's still data, and should be treated as such. I wouldn't care if it were an empty file: it's still data, and you still need a warrant or probable cause, not a terrorist witch hunt.


Well, yeah? The first thing I thought when the NSA story broke was "Just metadata? Okay Mr. President and members of Congress: how about we release a list of everyone you talked to and when? It's just metadata, right?"


How about we perform a basic statistical analysis and see what locations you frequent. From that we can tell what religious, financial, and social organizations you are a member of, and can go beyond to what stores you patronize. And what routes you take to and fro. Metadata can be as harmful as recording the calls, if not moreso.

Betcha they collect more than just the raw call metadata, and collect metadata on data connections. From that they can more easily determine the above, since you don't even need to be on a call for a service to connect to the 'net.


If it didn't, it would be even more troubling that US uses phone metadata to assign drone assassination targets, as if it wasn't troubling enough.

https://firstlook.org/theintercept/article/2014/02/10/the-ns...


Don't mean to be rude, but literally thought this was common knowledge by now. Clearly, that is my own failing for not understanding that The People != Me.

I think people are starting to get it, though, especially thanks to how creepy targeted advertising is getting on Facebook and the like.

And that's kind of the thing, with all of the data that Facebook had on me, they couldn't figure out to not send me advertisements for skeezy dating services? That's the only solace I take from this story: it sucks that the security services are collecting all of this data, but if I can't do anything about it, it's good to know that they are drinking from a firehose.


Loads of "truthy" political bullshit that goes around on Facebook and email chain letters is regarded as fact (by the test of gut feeling) by plenty of voting adults.

There are different kinds of "common knowledge" among different groups of people. What's common knowledge to a person immersed in cyberculture for decades is going to be a world apart from a sales guy.


This is a good slide that demonstrates the power of metadata. http://pbs.twimg.com/media/BeMm9qJCEAEHu9Y.jpg


Once again, if people are worried about NSA capturing cell phone meta-data, please, go check out what the CFPB is doing with credit card data. We are talking every transaction from millions of Americans.

http://www.usnews.com/opinion/economic-intelligence/2014/02/...


Protip: the government scans and keeps the exterior of all mail that is sent. So, put the returning address on the inside of the envelope so that it is harder to work out who is sending mail to whom.


I think it is very naive (bordering on denial) to think the the NSA doesn't record the contents of every phone call it can get its hands on, not just meta-data.


Sure, but I think it's useful to show that, even taking them at their word, metadata alone is dangerous.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: