You can infer some amazing things from simple metadata. I spent six months in an R&D team at a large mobile telco, with the task of trying to infer as much as possible from anonymous customer data just like this.
Figuring out where you live and work, to a reasonable accuracy, is quite easy; you simply look at where the most outgoing calls/SMS originate from at certain hours of the day over an extended period.
We built up our own social graph. You treat calls and text messages as directed edges and phone numbers as nodes. These were fascinating to look at.
You can even try to guess when someone gets off a plane. When a plane lands you'll suddenly see lots of incoming undelivered text messages as people turn their phones back on. If a node was last seen in a far away cell, but then reappears in this group, you can cross-correlate with arrival times and make a reasonable guess.
INterestingly what you describe is probably not legal under EU privacy laws. People are horrified by NSA just collecting this data. And yet you calmly describe this process.
Your opinions are not given in your post - you're not saying whether it's good or bad to do this - but it's clear that the company you worked for didn't see doing this as evil.
I find it fascinating that this kind of data mining has been going on for years and that opposition has been so quiet.
(Please, this post is not any judgement about you!)
All the telcos collect this data as far as I know. They're allowed to for the purposes of improving and maintaining their network. A few crunch it for marketing purposes but this has to be opt-in (not that customers would have any idea what that might entail, even if the privacy policy describes it broadly). I can't comment on the legality of the project I worked on, but I assume it was checked out by legal counsel.
I personally wouldn't want my data mined in this way. I don't retain any brand loyalty, lets put it that way.
It may well be legal, if part of the stated purpose of collecting the data, as agreed with the customer in the T's and C's that they thoroughly read through and agreed to, was to collect data for research, network development, service development and other wooley terms that cover this kind of R&D.
What many companies do is anonymise the data, remove the actual phone numbers /account details and replace with dummy numbers. While not ideal (backwards matching is possible due to the clues the data "gives up"), its probably safer than it sounds.
There's also the matter of who's doing it. It is, imo, one thing that the company whose services I use collect data on my use of their equipment - for gathering network performance data, troubleshooting (and billing info) that they are using themselves and not handing over to other parties.
It's another thing to have government agencies snooping in such data for entirely different purposes.
If anonymisation is reversible (it seems clear that it is reversible, if it was anonymised at all), the data likely again falls under the Data Protection Directive in the EU.
There, personal data legally requires explicit permission for each specific purpose it's used for and cannot be stored any longer than is necessary for that purpose.
Improving cell coverage / planning for growth is a specific purpose. You can then argue how long to keep the metadata for - any reasonable argument starts in years.
I think we need to accept that metadata and all digital comms is communications in public. And that we need social conventions backed by law to make certain things politely not read unless a warrent is served.
>Improving cell coverage / planning for growth is a specific purpose.
Agree, so you just need contractual permission (not hard to get). You can't decide later that you want to use it for some other apparently-innocuous reason.
>any reasonable argument starts in years.
Cell coverage data from years ago is relevant to today's growth? Although that might be enough to get you out of a legal hole, I find that highly dubious.
The law is there, but it's not understood, not clear enough, nor enforceable enough for commerce to fall in line.
what your company doesn't make year over year comparisons? Many telco's at least in the US don't own every tower they rent and if they can compare that we aren't utilizing this tower is this a downward trend? Should we not renew our contract for this cell tower location. Then there is just the planning aspects of anticipating heavy use patterns for major events, concerts, festivals etc. you can't compare how your system is handling the added demand as an even grows if you don't have data
This is exactly what the NSA have been doing, according to an earlier whistle blower, see eg my submission to HN here with his keynote from Hope 9 (2012):
I went to a presentation by DONG Energy [1] where they were discussing how, with their in-progress upgrade to remote-reporting, per-residence electric meters, they can soon infer all sorts of things about people's apartments and daily habits from the distinctive patterns of electric usage. Not just aggregate usage like inferring when someone's awake or asleep, but in much more detail based on the distinctive patterns different devices make.
They do seem sensitive to privacy fears (perhaps partly because the regulatory climate forces them to be), but the level of detail they were able to get out of the electric data in a prototype system was quite eye-opening. They had some ideas about using it for consumer self-education, e.g. feeding it back into a small display near the meter that would make energy-saving suggestions. But even that could get creepy, because it could make suggestions about specific devices you owned, when you never told the energy company that you owned them!
[1] An unfortunate acronym, from Danish Oil and Natural Gas
Could you just clarify something? You state this data is anonymous, but that you use phone numbers as nodes? Do you mean some sort of ID number representing phone numbers, or actual phone numbers? I ask because I wouldn't consider phone numbers anonymous.
The input space is too small for SHA1 to effectively anonymize. The NANP, for example, has less than 10^9 possible numbers; it would be a very simple task to create a rainbow table mapping every possible phone number to its corresponding SHA1 hash.
For the same reason, you can't just use a simple cryptographic hash to "anonymize" data such as birthdates, zip codes, SSNs, or PINs.
Using a key derivation function with a very high cost factor can mitigate this to some extent (e.g. making it take 5 seconds on an average CPU to generate the hash from a phone number), but it by no means makes for secure anonymization; eventually computing power will catch up.
Encrypting the number with a secret key (or using an HMAC), and destroying the key after the anonymization takes place might be a reasonably secure way of doing this, however.
The argument isn't that meta-data can't be used to get a lot of information about someone. The argument is that in the U.S., meta-data isn't protected information. Call meta-data is not your information, but information the telephone company keeps about you. In the U.S., the 4th amendment does not protect those sorts of records: http://en.wikipedia.org/wiki/Smith_v._Maryland. Your cell phone, which you use voluntarily, gives the phone company tremendous information about you, and under U.S. law nothing keeps the government from getting that information from the phone company.
Does call meta-data give the government a lot of information? Yes. Does it give the government too much information? Quite possibly. But arguing shrilly about how collecting call meta-data is "illegal" is counter-productive. Maybe it should be illegal, but you can't start the process of making it so by proceeding from an incorrect premise. And you can't dismiss the goal of making it illegal, by arguing that the government is already ignoring the law, with reference to activity where the government is clearly attempting to stay within the law, even if it is pushing the boundaries as much as it can.
One is what the general public thinks about the importance of metadata; the OP shows that it is a bit more than some may think, so arguing about the legal aspects is a bit beside the point.
That said, if you want to argue the legality, it isn't that clear-cut, either. The problem is that while Smith v. Maryland may resolve the question of the constitutionality of the collection, it does not answer the question of its legality. In addition to a search or seizure being constitutional, there generally also needs to be a statute authorizing the search or seizure.
Unfortunately, whether there is such a statute is highly dubious. The leaked court order used an extremely suspect interpretation of section 215 of the Patriot Act to justify seizure of the phone records; which is what Senators Ron Wyden and Mark Udall kept pointing out. Seizure under the electronic surveillance provisions of 50 USC §1801 etc. does not solve the problem, either, because 50 USC §1801(n) defines "contents" for the purpose of electronic surveillance to include metadata. And seizure under the Pen Register Act would require the government to certify that the information obtained is likely to be relevant for an ongoing criminal investigation.
This means that while Congress could have authorized the NSA to collect the connection data in such a fashion without such a law being unconstitutional, it is at the very least questionable whether Congress actually did such a thing.
To put the link into perspective: That article stems from a debate around the German "Vorratsdatenspeicherung", an attempt to put law into place which forces telecommunication service providers to store metadata (of telephone calls and - technically curious - emails) for six months. Law enforcement would then be able to access metadata for such a timespan. Law enforcement could query for all available data before that law was discussed, but the data wasn't necessarily available.
FWIW, the law was put into place and revoked by the German constitutional court, the "Bundesverfassungsgericht".
Of course, German data privacy law works a bit differently, too. Metadata is covered, as long as it points (possibly indirectly) to natural persons. As long as the data isn't needed for any purpose covered by the business it stems for (as when the telephone bill is over the dispute deadlines), it has to be deleted.
The whole debate is about 5-3 years old here in Germany.
I don't think the argument is whether metadata is "protected."
I think that the argument is that the US is collecting whatever they want even on lawful citizens, not being forthcoming, and arguing its all legal whether it is or isn't.
Also - your notion that something not explicitly "protected" is fair game scares me.
Totally agreed. The law should reflect our morals. But we judge the legality of an action by the law, not our morals.
Moreover, the problem for privacy advocates is that the moral debate is even less clear than the legal one. Being able to declare the NSA's actions straight-up unconstitutional under the 4th amendment would avoid the mess of resorting to the democratic process to determine what the people, as a whole, really thought about surveillance.
"Metadata doesn't matter" to me seems to be a really poor strawman. Maybe a small minority of people think that, but I'm pretty sure most people are smart enough to realize that if it "didn't matter" the NSA wouldn't be collecting it to begin with.
Also, I don't believe that it has been shown that location information has been collected. That claim is conjecture only. We've seen a lot of conjecture related to these leaks that has been taken for fact. Sometimes it is hard to tell them apart.
And that's just from the phone metadata. Imagine how much more they can do with all your online info from all the services you're using, all the blogs you're commenting on, and so on.
The same person being talked about above wrote this article in NYTimes yesterday:
Thanks, that is a great post and well worth reading! Especially the part that describes possible consequences of trading privacy for security (Nazis, Communists).
What a remarkable visualisation - this is a clear demonstration of just how intrusive these metadata records can be. If they're not controlled by law, they should be.
I would encourage anybody who haven't watched this to do so. It's a very interesting video, especially for younger people who didn't grow up during that time period.
Let's not forget that combined metadata from millions of people allow much greater detail than this (who you meet, talk to regularly, share interests with, are likely to run into ...).
I'm afraid the actual definition of "meta-data" is up to interpretation in the context of IP communication.
What if the NSA considers not only IP source & destination as "metadata" but also anything down to the application layer that is not strictly content? Like the HTTP GET line or HTTP headers.
I think you can take that as a given. If you look at the GCHQ leak - they're basically just recording everything (including content) for 3 days, and keeping headers for 30 (shared with the NSA of course). That would give them most websites visited by an IP (which would take hardly any space to store, but are still really intrusive).
The only things preventing this from being a total capture of all information (to be sifted through later) are technical issues with storage, not moral or legal ones.
What do you thing that graph databases with trillions of connections are used for? The real fun will start after someone leaks couple of terabytes of tracking data.
Well, if location data is considered part of this "metadata", then I don't see how anyone could argue against the dangers of this.
My physical location in the real world I consider way more private in matters of wide scale tracking than what I write or say.
For instance, I hardly ever let my browser determine my location and send it to some site, it's none of their business where I am, and if I want the local weather they can get the name of the city I'm at.
But I was hoping this article would be about another, way more dangerous, because way more information-rich type of "metadata": Social graphs and contact lists. The problem with this is, humans underestimate the depth of this kind of data because we're not really well-equipped to reason about them.
If you have a table that consists of (time, location) records, it's pretty easy to envision what sort of information could be extracted from this data. Add a few more fields, and it becomes harder, maybe you need some creativity and statistics, but it's all basic detective work.
A free form directed graph (such as a social graph or collection of contact lists) doesn't look like a table at all (well, you can represent it as a table, but that won't make you much wiser). It's in fact a very high-dimensional object.
The older generation out here, may remember when they first encountered the WWW, when you could only navigate it by clicking links. I got this sense of vastness, perhaps even helplessness. They don't call it hypertext for nothing. The sense of vastness comes because clicking and navigating those links gives an idea of moving through a space. Except this space is in some sense "larger" than our usual 3D space. Every door (link) can open into every room, regardless of whether it would be possible in a physical space.
This is why those "graph of (part of) the Internet" pictures you sometimes see are generally always a tangled clutter of strings, usually vaguely ball-shaped. This is because there is no sensible representation of this type of inter-connected data. You can't make a hierarchy or a map, at least, not in the general case (and the thing you want to reason about is the general case, most of those graphs are exponential small-world graphs, highly inter-connected).
Same thing for social / contact list graphs. Except they usually don't have web-rings or directories (you can sometimes make them like FB does, but they aren't generally available, again the general case).
So okay we're not really good at keeping large graph networks of "friends of friends of friends" and other relationships in our heads and reason about them. We're really not. What you think you can reason about those graphs is just scratching the surface.
Computers, however, and Big Data Machine Learning algorithms in particular, have no problems at all with this type of data. An algorithm never lived in a 3D space, it doesn't care if a dataset makes no sense as a physical configuration of nodes, in order to navigate it and extract information from it.
Another important distinction is, people tend to think of these social graphs as labeled nodes with edges between them. Which is correct, in a sense. But it gives the impression that the labels are more important than they actually are. This may sound weird, in the building/room analogy, if you have millions of rooms, and every room is directly connected to 50-200 other rooms, somehow the shape of the paths between the nodes and way they are connected becomes a vastly more information-rich data source than the actual values of the labels of the nodes themselves.
They don't need your name or your photo, the local shape of your social graph is a highly unique fingerprint of whoever you are.
And you can delete Facebook, but on the next social network you sign up for (or any of the other social graphs you're generating, email/IM contact lists, etc), this fingerprint will echo, and in many cases be similar enough to clearly indicate this is the exact same person. No names necessary. (this may be a bit harder if you have a strictly separate business persona and social persona, but there are still some unexpected artifacts to pick up for a ML algo even in these cases) If you're not on a network at all, your presence can be extrapolated from the "hole" in the graph you left (all your friends are there, with their particular local graph shapes, but one node is missing), that is even if you have nothing to hide, you will be leaking info about those who do.
Thanks for this. It is a highly informative comment, especially regarding big data algos.
Extending the 'hole' analogy, do you think the watchers / algorithms could complete reasonable extrapolation on you if your group of closest acquaintances all decided to disappear from the network?
Perhaps even this more extreme measure would be fruitless as each of your friends has a fingerprint that they 'remove' from their respective unique graphs. Your group's disappearance would be a larger void, but each member's tendrils would carve out unique telltale gaps.
> Well, if location data is considered part of this "metadata", then I don't see how anyone could argue against the dangers of this
I remember a "scandal" that occurred in my country's Parliament in the early 2000s (2002 or 2003), when one of the local mobile carriers decided to display the GSM cell towers' names on the mobile phones' small screens (close to the "battery still left" icon). Some of the MPs thought that as being way too obtrusive, but nobody cared because they're seen as being corrupt by definition, the mobile company ended up by not displaying the info anymore (but still collecting it, of course) and everything was fine.
There was of course that other thing that happened to the same company (one of the 3 largest global brands in the industry) a couple of years later, with one of the mobile company's office people (a lady) being jealous on her boyfriend and asking some guys "in the IT department" if there wasn't a way for them to check said boyfriend's messages and calls, all this "as a small favor from colleague to colleague", which of course there was a way to do that. I can't remember if the boyfriend was cheating or not.
If you heard about it, you can bet he was. Nevertheless, power and abuses go hand-in-hand. I don't know what it is about human nature that causes us to give those in power the benefit of the doubt. Hell, in America at least, people knew 250 years ago that power begot abuse, and wrote "release valves" into the constitution to prevent that abuse from becoming overwhelming. I wonder why they didn't think people would become overly apathetic in the meantime.
Did they ever write any essays or letters on why they didn't make voting compulsory? Was it a feeling that such compulsion impinged on freedoms, or that it wouldn't help fix the problem of apathy? Or did they just think it would be absurd if people voluntarily turned down their chance to pick their representation in government?
Given the remarkable intel that can be gathered, I'm surprised the NSA/CIA/FBI aren't giving away smartphones to targets as anonymous presents or under the pretense of winning a contest.
Eventually, all the social and location graphs will be mapped for all of humankind - and we shall find out that everyone, on the whole planet, is exactly 42 feet from Kevin Bacon.
If some agency like NSA etc wants to know about you in great detail, clearly they have the data, and will be able to very quickly put it all together.
The other side of this coin is that commercial parties like Facebook etc have the same potential detail and insight about anyone.
There is also very high probability that similar data is being put together by entities somewhere between the NSA and Facebook, for purposes that are much more starkly not in your best interests eg fraud.
Bottom line: anyone is an open book on the internet.
Is this common? This is the first I've ever heard of such a feature, ever. Considering that we're talking about the OS being completely shut down, I'm skeptical of this existing in smartphones.
I believe it is normally a separate microcontroller. You generally need something to power up the main phone and to deal with battery charging (thats not usually the main CPU, although my Android phone does display an animated icon on screen when powered off and charging, so unclear whats driving this).
Most computers have a number of extra microcontrollers. You would have to do a teardown to see how they might be wired up.
It is not ultra-paranoid. The government's own rules regarding cell-phones in certain highly classified areas (SCIFs) require that they be left outside precisely because you can't guarantee that off means off.
People might think that (apart from GPS) signals to one tower only are unlocalizable. Add the variable of signal strength (with fairly uniform xmit pwr) to that single vector and it gets more interesting.
Figuring out where you live and work, to a reasonable accuracy, is quite easy; you simply look at where the most outgoing calls/SMS originate from at certain hours of the day over an extended period.
We built up our own social graph. You treat calls and text messages as directed edges and phone numbers as nodes. These were fascinating to look at.
You can even try to guess when someone gets off a plane. When a plane lands you'll suddenly see lots of incoming undelivered text messages as people turn their phones back on. If a node was last seen in a far away cell, but then reappears in this group, you can cross-correlate with arrival times and make a reasonable guess.