You can infer some amazing things from simple metadata. I spent six months in an R&D team at a large mobile telco, with the task of trying to infer as much as possible from anonymous customer data just like this.
Figuring out where you live and work, to a reasonable accuracy, is quite easy; you simply look at where the most outgoing calls/SMS originate from at certain hours of the day over an extended period.
We built up our own social graph. You treat calls and text messages as directed edges and phone numbers as nodes. These were fascinating to look at.
You can even try to guess when someone gets off a plane. When a plane lands you'll suddenly see lots of incoming undelivered text messages as people turn their phones back on. If a node was last seen in a far away cell, but then reappears in this group, you can cross-correlate with arrival times and make a reasonable guess.
INterestingly what you describe is probably not legal under EU privacy laws. People are horrified by NSA just collecting this data. And yet you calmly describe this process.
Your opinions are not given in your post - you're not saying whether it's good or bad to do this - but it's clear that the company you worked for didn't see doing this as evil.
I find it fascinating that this kind of data mining has been going on for years and that opposition has been so quiet.
(Please, this post is not any judgement about you!)
All the telcos collect this data as far as I know. They're allowed to for the purposes of improving and maintaining their network. A few crunch it for marketing purposes but this has to be opt-in (not that customers would have any idea what that might entail, even if the privacy policy describes it broadly). I can't comment on the legality of the project I worked on, but I assume it was checked out by legal counsel.
I personally wouldn't want my data mined in this way. I don't retain any brand loyalty, lets put it that way.
It may well be legal, if part of the stated purpose of collecting the data, as agreed with the customer in the T's and C's that they thoroughly read through and agreed to, was to collect data for research, network development, service development and other wooley terms that cover this kind of R&D.
What many companies do is anonymise the data, remove the actual phone numbers /account details and replace with dummy numbers. While not ideal (backwards matching is possible due to the clues the data "gives up"), its probably safer than it sounds.
There's also the matter of who's doing it. It is, imo, one thing that the company whose services I use collect data on my use of their equipment - for gathering network performance data, troubleshooting (and billing info) that they are using themselves and not handing over to other parties.
It's another thing to have government agencies snooping in such data for entirely different purposes.
If anonymisation is reversible (it seems clear that it is reversible, if it was anonymised at all), the data likely again falls under the Data Protection Directive in the EU.
There, personal data legally requires explicit permission for each specific purpose it's used for and cannot be stored any longer than is necessary for that purpose.
Improving cell coverage / planning for growth is a specific purpose. You can then argue how long to keep the metadata for - any reasonable argument starts in years.
I think we need to accept that metadata and all digital comms is communications in public. And that we need social conventions backed by law to make certain things politely not read unless a warrent is served.
>Improving cell coverage / planning for growth is a specific purpose.
Agree, so you just need contractual permission (not hard to get). You can't decide later that you want to use it for some other apparently-innocuous reason.
>any reasonable argument starts in years.
Cell coverage data from years ago is relevant to today's growth? Although that might be enough to get you out of a legal hole, I find that highly dubious.
The law is there, but it's not understood, not clear enough, nor enforceable enough for commerce to fall in line.
what your company doesn't make year over year comparisons? Many telco's at least in the US don't own every tower they rent and if they can compare that we aren't utilizing this tower is this a downward trend? Should we not renew our contract for this cell tower location. Then there is just the planning aspects of anticipating heavy use patterns for major events, concerts, festivals etc. you can't compare how your system is handling the added demand as an even grows if you don't have data
This is exactly what the NSA have been doing, according to an earlier whistle blower, see eg my submission to HN here with his keynote from Hope 9 (2012):
I went to a presentation by DONG Energy [1] where they were discussing how, with their in-progress upgrade to remote-reporting, per-residence electric meters, they can soon infer all sorts of things about people's apartments and daily habits from the distinctive patterns of electric usage. Not just aggregate usage like inferring when someone's awake or asleep, but in much more detail based on the distinctive patterns different devices make.
They do seem sensitive to privacy fears (perhaps partly because the regulatory climate forces them to be), but the level of detail they were able to get out of the electric data in a prototype system was quite eye-opening. They had some ideas about using it for consumer self-education, e.g. feeding it back into a small display near the meter that would make energy-saving suggestions. But even that could get creepy, because it could make suggestions about specific devices you owned, when you never told the energy company that you owned them!
[1] An unfortunate acronym, from Danish Oil and Natural Gas
Could you just clarify something? You state this data is anonymous, but that you use phone numbers as nodes? Do you mean some sort of ID number representing phone numbers, or actual phone numbers? I ask because I wouldn't consider phone numbers anonymous.
The input space is too small for SHA1 to effectively anonymize. The NANP, for example, has less than 10^9 possible numbers; it would be a very simple task to create a rainbow table mapping every possible phone number to its corresponding SHA1 hash.
For the same reason, you can't just use a simple cryptographic hash to "anonymize" data such as birthdates, zip codes, SSNs, or PINs.
Using a key derivation function with a very high cost factor can mitigate this to some extent (e.g. making it take 5 seconds on an average CPU to generate the hash from a phone number), but it by no means makes for secure anonymization; eventually computing power will catch up.
Encrypting the number with a secret key (or using an HMAC), and destroying the key after the anonymization takes place might be a reasonably secure way of doing this, however.
Figuring out where you live and work, to a reasonable accuracy, is quite easy; you simply look at where the most outgoing calls/SMS originate from at certain hours of the day over an extended period.
We built up our own social graph. You treat calls and text messages as directed edges and phone numbers as nodes. These were fascinating to look at.
You can even try to guess when someone gets off a plane. When a plane lands you'll suddenly see lots of incoming undelivered text messages as people turn their phones back on. If a node was last seen in a far away cell, but then reappears in this group, you can cross-correlate with arrival times and make a reasonable guess.