Hacker News new | past | comments | ask | show | jobs | submit login
The Difficulty of Private Contact Discovery (whispersystems.org)
134 points by matt_xyz on March 15, 2016 | hide | past | favorite | 78 comments



The problem isn't that social networks can't implement this, it is that they don't want to.

Facebook and LinkedIn, for example, want to know who all our contacts are, because that data is so valuable to them.

This needs to be built into the mobile OS. The OS should only be allow third parties to access some kind of protected contacts, nothing more.

Furthermore, we need fine-grained control in the mobile OS, to optionally block apps from accessing private information.

Apps should still work if certain privacy features are enabled by choice, rather than the default today, which appears to be "install our app and accept our defaults for 'all the things' or don't use our app".

Bravo to Whisper Systems calling this out, and trying to work towards a technical solution.


You might want to reread the post: "As far as we can determine, practical privacy preserving contact discovery remains an unsolved problem."


But there is some degree of success mentioned in the article, even if it doesn't work for huge networks.

That being said, some degree of privacy is better than none at all. Currently Facebook et al upload every single piece of contact information we have on our devices. That's insane from a privacy perspective.


It's really not, though — see the comments below. There are current approaches which send on the order of magnitude as much data as in one party's address book.


> This needs to be built into the mobile OS. The OS should only be allow third parties to access some kind of protected contacts, nothing more.

In the meantime, some Android versions support providing a blank-list when untrusted applications request access. The choice between "all contacts" and "no contacts" works at least 90% of the time.


As far as I know, "some" is just 6.0, which is currently at 2.3% adoption (I have it on my Moto X and like it. About a third of the apps I had before the upgrade fail to work if not given contacts or location, so I just uninstall them.) iOS has the same feature but much greater adoption.

What I would really like is a "some contacts" option. I still have people I haven't called in years in my phone, but if I could give an app a list of 20 family, friends, and coworkers that would make things simpler.


I've never had a compatibility-problem with the CynanogenMod's "privacy guard", I'm not sure how the implementations differ.


Where the article enumerates the reasons one might not want to share one's entire address book, it leaves out the one most important to me. If I allow an app access to my entire address book, it can see atrributes on an entry that I feel I should not leak such as birthdate and home physical address. Address books need a finer grained way to express such access control. Even when sharing a contact with a third person and not a service, I might not want to share the entire address book entry for a person.


In iOS, you can present a contact picker and the app will only get access to the user final selection (a contact or a contact property).

https://developer.apple.com/library/prerelease/ios/documenta...


Interesting. I was aware that this new API exists. I've not seen an app use it yet. I was not aware it could narrow the selection down to just a single contact property. The documentation does not make that clear.

Nevertheless, apps like Facebook aren't going to use this as they want the whole address book and not a single person.


IIRC android has this too, it's just that it is seldomly used on both platforms because it is inconvinient (imagine having to select all contacts and search from them individually in e.g. WhatsApp) and means the app can not mine your data.


I often see people, including myself, using the "note" field in their address book for additional and often private data. Am I correct to assume this data is also available to apps accessing the address book?


I have moved to a git tracked contacts folder with a text (.org) file for each contact. Works for bookmarks, todo lists also. Easy to sync, private by defaulT.


Can you use that on your mobile device?

I assume Git is on your private server?


Yes, private server. mobile apps / website for myself are on the todo list. someday...


My understanding of the API in iOS is that once the user has granted access to the "Address Book", the app can read every entry in there.


The use cases are interesting to me. I once crossed a border where my phone was taken from me, into another room. It doesn't take a rocket scientist to work out that an image was lifted from it before it was returned to me. Sending text messages from that same country had a typical latency of between one and four hours before arriving back home. There are not many people I want to communicate with privately, so I wrote my own secure app to send encrypted text to a web site located in a country I trust, from where it's forwarded as required. I know my contacts, and don't want those contacts known to others. There aren't many I feel the need to exchange messages with securely, and therefore don't want my contacts "discovered".

So it would be nice to find something some day that lets me confidently opt out of ANY kind of contact discovery. As it is I don't install any apps on my phone that require permission to access my address book. I realize that this probably puts me into a tiny subset of user profiles, so not holding my breath. My own solution is clunky, but it appears to remain my best option for the foreseeable future.


What do you think of Wire.com? Doesn't need access to address book (like WhatsApp didn't until a few years ago). See discussion elsewhere on this page for a few links.


Phil Zimmermann did come up with a protocol that avoids sending everything, which I implemented here:

https://github.com/SilentCircle/contact-discovery

Moxie said it wasn't "meaningfully private", if I recall correctly, without elaborating further.


As far as I can see this is quite similar to the 'sharded bloom filter' discussed in the submitted link.

This approach is dismissed in the document as

> In the end, this strategy seems to provide very little at the cost of fairly substantial complexity.

Since it seems you use it for silent circle, how does it work there? How many 'longer hashes' do clients get on average per submitted 'short hash'? How large is the 'false positive' rate you get from the long hashes? Since the argument against the 'sharded bloom filter' in the submitted link boils down to 'the tradeoffs don't work', I would be very interested to know how these tradeoffs work out for silent circle.

As far as I understand, you would still need to have some rate limiting in order to stop someone to collect all registered phone numbers though, wouldn't you?


> As far as I can see this is quite similar to the 'sharded bloom filter' discussed in the submitted link.

It's pretty similar, but not exactly like that. The main difference is that the server doesn't need to transmit a bloom filter, it transmits partial hashes, but yes, there's a tradeoff.

> How many 'longer hashes' do clients get on average per submitted 'short hash'? How large is the 'false positive' rate you get from the long hashes?

I'm afraid I don't know the answer to these questions, as we don't keep logs... The false positive rate of the long hashes should be astronomically low (since they are full 256-bit hashes), but it doesn't much matter because the client only needs to be reasonably sure that the user exists so it can perform an actual query with the plaintext identifier.

The problem isn't really "we shouldn't know any of your contacts", it's "we shouldn't know any of your contacts that aren't using our service", since clients have to divulge some of their contacts in order to set up calls with them or send them messages. That means the client only has to know which of their contacts are probably on the server so they can request more information, which will, at that point, tell them with certainty whether a contact is using the service.

P.S. My opinions are my own, I'm not speaking for SC, etc.


I have to say that this sounds very reasonable (and certainly a lot better than sending your plain text contacts).

Does anyone know what kind of solution Signal is using nowadays? The linked article is not very recent and they might have changed their method in the meantime. I had a quick look at their source but was not successful in finding the relevant part for uploading the contacts.


I think the concern may be that the total space of hashes for valid email addresses, especially ranked by domain usage, is small enough that SHA hashing doesn't meaningfully obscure content.

From the article:

Hash It!

The first instinct is often just to hash the contact information before sending it to the server. If the server has the SHA256 hash of every registered user, it can just check to see if those match any of the SHA256 hashes of contacts transmitted by a client.

Unfortunately, this doesn’t work because the “preimage space” (the set of all possible hash inputs) is small enough to easily calculate a map of all possible hash inputs to hash outputs. There are only roughly 10^10 phone numbers, and while the set of all possible email addresses is less finite, it’s still not terribly great. Inverting these hashes is basically a straightforward dictionary attack. It’s not possible to “salt” the hashes, either (they always have to match), which makes building rainbow tables possible.


The linked protocol takes care of that by only sending few enough bits that you can't tell what was hashed, though.


From 2014. Previous discussion: https://news.ycombinator.com/item?id=7007554


I was recently reading about some of the interesting work with invertible bloom filters/invertible bloom lookup tables, eg [0] [1].

There might be either a clever way to exploit IBLTs to to make a low-information query of the data (like "Oblivious Selection from a Table" in [1]), or otherwise there may be a way to use them to more efficiently implement the bloom filter/sharded bloom filter options.

My information theory/coding theory is not strong enough to really know, but it might be worth looking into!

[0] https://www.ics.uci.edu/~eppstein/pubs/EppGooUye-SIGCOMM-11.... [1] http://arxiv.org/pdf/1101.2245.pdf


The average Android user has approximately 5000 contacts synced to their device.

This seems staggeringly large.


Yes, that number startled me as well - it is so far beyond Dunbar's number that it must represent the result of some automatic collection process. Perhaps Gmail harvests all the email addresses passing through your inbox and stuffs them into the contact list? I imagine that most people using Android phones are less persnickety than I am about allowing Google to manage their lives, so I can imagine that Google might be routinely shuffling large numbers of "contacts" around, which the user of the device neither knows or cares about.


I think it's the gmail "you've emailed them so now they are a contact" and other social services syncing their address books into your address book.


Well, here's another UX nit to pick with Signal. On Signal, I have multiple contacts that have been uploaded/updated somehow. They are grayed out in the address book so I'm kinda sure they are not on Signal just someone else that knows them.

What's not clear to me is why/how this is being determined.

On a side note, this is one reason why I would contend that phone numbers (or a SHA256 equivalent) is not a good selector to use for Signal. It could easily be a privacy preserving custom field (that is then hashed) without accidentally sharing PII. This would also have the benefit of being able to reset the field without too much pain. Having to reset and get a new number is much more painful.

One last thing, has anyone else noticed that Signal seems to use an inordinate amount of data when not using it? Once again, not clear to me why this much data transfer is needed. I just checked my phone and 359MB of data has been sent in the last 30 days. I use Signal sparingly (mostly because most of my contacts are not on it) and never a voice call over LTE. So I have no idea what is causing that much data to be transferred.


Signal for iOS desperately needs a new maintainer to replace Fred.

Here is the tracking issue for data usage.

https://github.com/WhisperSystems/Signal-iOS/issues/866


They don't pay that well relative to the local market, insist you be in SF at the same time and only want full time. They can solve the problem by allowing remotes, having a few part timers or paying the SF standard salary. If you try to do a simple github pull request on the repo, it takes a very long time to do a code & design review so it makes possible maintainers discouraged from contributing.

It's the old choice triangle: fast, cheap, good: pick two.


Totally agree with you.


Ugh. I don't want to be a downer but it sure seems that the pace of Signal updates is quite slow.

Maybe Apple or another tech company will step up and donate some resources to this effort.

Heck, I'd be for Apple to help with the UX of Signal leaving the OpenWhisper team to focus on the security tech...


> Once again, not clear to me why this much data transfer is needed. I just checked my phone and 359MB of data has been sent in the last 30 days.

Odd. On my system -in the last 30 days- Signal has transferred 1.15MB of data. I'm using it on Android. Are you on iOS or something?


Yes I noticed that too. It was using a significant portion of my monthly mobile data limit (roughly 60mb of 500) that I actually removed the "mobile data" permission. Which is highly inconvenient.


I use Signal daily with a couple of contacts and have used 3.77mb of data in the past 24 days of my data's billing cycle.

edit: I don't use voice


As Moxie notes, this problem is known as private set intersection: how can two parties P1 & P2 discover which elements they have in common, without revealing all their elements to each other or to a third party? His error is in describing scenarios where P1 is Open Whisper Systems and P2 is one of its users: yes, it would be insane to use a protocol where OWS must transmit its entire user database (hashed, encrypted or permuted) due to bandwidth reasons. But that's crazy anyway: OWS can see who is talking to whom on its servers already. Let's not do that!

The Right Thing is for individual users of Signal to privately compute set intersections between their own contacts. If Alice knows Bob, Charlie & Dan, and Bob knows Alice, Charlie & Eve, then when Alice & Bob add each other then they could each execute a protocol which would reveal that they both know Charlie.

In the last discussion of this, pbsd posted a link [1] to a great Microsoft paper covering private set intersection [2]. It works its way through a number of different protocols, but what it ends up with is this:

- Alice generates three sets of dummy data D1, D2 & D3 (where the lengths of all three sets are equal to a security parameter t, and the items in the sets are invalid phone numbers (perhaps, elements of the form 'invalid:$RANDOM_DATA') and sends them to Bob securely. These dummy sets will be used to observe if the server is cheating.

- Bob checks that they are well-formed (i.e., of the right length and invalid phone numbers).

- Alice & Bob then generate a shared key K1 using a simulated coin-tossing protocol.

- Bob then generates a shared key K2 with the server (in this case, the OWS server) using the simulated coing-tossing protocol.

- Alice then sends to the server SHUFFLE{HMAC(K1, Charlie:1), HMAC(K1, Charlie:2), … HMAC(K1, Charlie:n), HMAC(K1, Dan:1), HMAC(K1, Dan:2), … HMAC(K1, Dan:n), HMAC(K1, D1_1), HMAC(K1, D1_2) … HMAC(K1, D1_t), HMAC(K1, D2_1), HMAC(K1, D2_2), HMAC(K1, D2_t)}, where n is another security parameter, Charlie is Charlie's phone number and Dan is Dan's phone number. Since the server doesn't know K1 and it doesn't know how the set was shuffled, it has no idea which of the hashes is which phone number, nor which is a dummy.

- The server then generates a random SEED and sends to Alice KEYED_SHUFFLE(SEED, {HMAC(K2, HMAC(K1, Charlie:1)), HMAC(K2, HMAC(K1, Charlie:2)), … HMAC(K2, HMAC(K1, D2_t))}). Since Alice doesn't know K2, and since she doesn't know how the set was shuffled, she learns nothing yet.

- Bob sends SHUFFLE{HMAC(K2, HMAC(K1, Charlie:1)), HMAC(K2, HMAC(K1, Charlie:2)), … HMAC(K2, HMAC(K1, Charlie:n)), HMAC(K2, HMAC(K1, Eve:1)), HMAC(K2, HMAC(K1, Eve:2)), … HMAC(K2, HMAC(K1, Eve:n)), HMAC(K2, HMAC(K1, D2_1)), HMAC(K2, HMAC(K1, D2_2)), … HMAC(K2, HMAC(K1, D2_t)), HMAC(K2, HMAC(K1, D3_1)), HMAC(K2, HMAC(K1, D3_2)), … HMAC(K2, HMAC(K1, D3_t))} to Alice.

- Alice is know able to compute the intersection of the message she received from the server and the message she received from Bob. She does so, and sends the intersection {HMAC(K2, HMAC(K1, Charlie:1)), HMAC(K2, HMAC(K1, Charlie:2)), … HMAC(K2, HMAC(K1, Charlie:n)), HMAC(K2, HMAC(K1, D2)} to Bob.

- Bob knows K1 & K2, and can calculate HMAC(K2, HMAC(K1, X)) for any X he knows (including the members of the three dummy sets D1, D2 & D3). He can now check that neither Alice nor the server has cheated. First, he finds the original X for each HMAC(K2, HMAC(K1, X)) in the intersection. Then, if any member of D2 is missing from the result, or if any member of D1 or D3 is in the result, then someone has cheated, or if he doesn't see all of Charlie:1, Charlie:2 … Charlie:n then someone has cheated. In this case, Bob knows that neither Alice nor the server cheated, and now he knows that they share Charlie in common.

- Bob tells the server 'continue' and the server sends Alice the random seed used to in the KEYED_SHUFFLE; Alice uses this to determine which item she sent the server corresponds to which item the server sent to her, and from that which item in the intersection she sent to Bob corresponds to Charlie:1, Charlie:2, … Charlie:n, Dan:1, Dan:2, … Dan:n and so forth. Alice, too, can see if either Bob or the server cheated.

At the end of this protocol, Alice & Bob both know that they share a contact in common with the same phone number. In place of phone numbers, of course, one might use pseudonyms, public keys &c. They've sent relatively small messages (just a few times the size of their contact lists), and the server doesn't know how many they have in common, or whether they have anyone in common.

So, problem solved.

Edit: I should note that there's no way for Alice to know that Dan also has an account if they don't know any other user in common, and of course Bob & Alice need to exchange identities offline in person first. But, once that exchange has happened, everything else Just Works™ and due to the small-world phenomenon as more users use it, more users discover one another using it. Users who know one another can even re-run the protocol every once in awhile, when discovering new contacts.

[1] https://news.ycombinator.com/item?id=7009674 [2] http://research.microsoft.com/pubs/194141/sapsi2.pdf


Despite this excellent explanation by wtbob, despite Moxie's blog posting, despite EFF, Snowden, Poitras, Schneider supporting it, I distrust Whispersystems for persisting Signal must have access to the addressbook by refusing to work without. I tried it several times during the years, hoping they would have reverted this and allow usage without access to the addressbook. Everytime I deregistered again.

Today I will delete Signal from my devices because earlier this month Wire.com included PFS [0] without requiring access to the addressbook. I’ll start using Wire.com despite it not being fully opensource. Until now I was using iMessage and FaceTime Audio as they seem the most secure PFS solutions till date that don’t require the addressbook to be uploaded. For some contacts I have no option but using WhatsApp which is quite a hassle without allowing WhatsApp access to the addressbook. From today I’ll switch to Wire.com as it seems more secure than Apple’s offering as Wire.com allows signature verification.

Regretfully Tom’s Hardware [1] doesn’t mention the addressbook in it’s comparison-table neither mentions Apple's offerings. Neither does wikipedia [2]

Edit: found a German video-blog [3] that does mention the Address book capabilities of Wire.com

[0] http://www.reuters.com/article/us-dataprotection-messaging-w...

[1] http://www.tomshardware.com/news/wire-app-complete-end-to-en...

[2] https://en.wikipedia.org/wiki/Comparison_of_instant_messagin...

[3] https://www.youtube.com/watch?v=9kCsFjBuH0o&t=74


> Despite this excellent explanation by wtbob, despite Moxie's blog posting, despite EFF, Snowden, Poitras, Schneider supporting it, I distrust Whispersystems for persisting Signal must have access to the addressbook by refusing to work without.

The mechanism I explained would permit Signal to work without uploading one's address book to the Signal servers. It would, of course, still require access to the address book itself in order to perform contact discovery. I personally don't mind allowing access to the address book itself; what I mind is uploading its contents.


> what I mind is uploading its contents.

You're correct that uploading its contents is what it is all about for me. And why doesn't Signal use the new CNContactPickerViewController on iOS? But if I'm not able to verify the code, why does Signal require access? If that is their business model, that is fine for Moxie etal but not for me.

@wtbob thanks for the clarification. I wonder what you think of Wire.com?


I would want to find out more info about Wire first. There just does not seem to be a whole lot of information available as well as people (that I would generally trust their opinion on) on the relative security of Wire.

BTW, I did notice that Wire added E2E after the fact and not out the box so not clear if there's some tradeoff there.

Also, Wire appears to use SRTP instead of ZRTP.


> ...I distrust Whispersystems for persisting Signal must have access to the addressbook by refusing to work without.

Signal works just fine when I tell Cyanogenmod to deny it read and write access to my contacts. I can draft new messages by entering a phone number.


This does look like an interesting protocol, but Signal almost certainly wants offline contact discovery.


Why not share the contact list with the people on your contact list? Facebook demonstrated that most people don't have a problem with that and most people I want to contact are already contacts of some contacts that I have.


That's the goal. The problem is discovering it. If x has me on the list and I have x on the list, how can our two Signal apps discover that? Without telling the Signal server, and without telling random third parties?


> If x has me on the list and I have x on the list, how can our two Signal apps discover that? Without telling the Signal server, and without telling random third parties?

Take a look at the protocol documented in https://news.ycombinator.com/item?id=11289223

It's secure against an honest server and a malicious person, and against a dishonest server; it's insecure against a colluding server and person.


Fascinating algorithm. I wonder how many Signal users it would discover in the ~500 typical contacts, given one starting contact.


There needs to be a decentralized way to tell [some of] your contacts directly that you have app X installed (or that you support messaging protocol Y).


"Some of", you say.

Tell all of the contacts is simple, just send them an SMS. But you probably don't want that, right? Maybe you even don't want everyone to know what apps you use? So it's "some of", at which point you're back to the same problem: Discovering which of the contacts to tell.

Like so many networking problems, this one is easy to solve provided that it's already solved by someone else. See RFC 1925 points 6 and 6a.


RFC 1925 has no points 6 and 6a. Did you mean 2.6 and 2.6a?

What you could do is publish the information which apps you have or which protocols you support using FOAF spheres [http://reagle.org/joseph/2003/09/foaf-spheres.html]. Encrypt the information using the public keys of your friends. Publish the information on your private web server so no central entity can see who retrieves it.


I have United and other services in my addressbook. This is a non-starter as I don't want to have to selectively determine on a case-by-case basis who to or not to send to.


So, then, once everyone imagines they're safe in a totally private encapsulated world of secret hashes, then comes along a peer-to-peer initiative to target a brute force attack on the entire keyspace, using strategies not unlike the bitcoin blockchain.


Can't you just make the hash more expensive?

Also, couldn't you regularly rotate salts?


As I noted in another comment, I don't think this is the right solution.

I would argue that things need to be tied to an identifier other than the phone number. Therefore, if something does happen, it's much lower friction to simply reset the identifier.

To deal with some of the UX pain, one could force a voice call requirement instead of just a fingerprint reverification.


Ten character alpha numeric rainbow tables can be had for only a few thousand dollars.


There cannot be precomputed tables for every custom hash computation. Especially an expensive one. Even less so if you add a salt.

It could be made prohibitively costly for all but the most determined adversaries - and those could probably get your social graph much easier from other sources.

Only problem I can see is it would take a few minutes after the app is installed to find your contacts..


You can't use a salt, the contact you want to match has to be able to compute the same hash.

You can hash the contact pair instead of one side of it, but even then an attacker can rent X instances of some hardware that is Y times faster than your phone, so if you target a difficulty of ~1 second per contact on your phone, you get a difficulty of 2700 hours / XY for the attacker to recover all of your contacts

So if Y is 10 times faster than your phone, you only need X=10 to recover the contacts for a given number in a day. If the attacker is willing to set X=1000, Y isn't even relevant.


> You can't use a salt, the contact you want to match has to be able to compute the same hash.

I think the misunderstanding here is that I meant one salt for all users at a time, instead of one salt for each hash.

And as long as stored hashes can be recomputed (by active users), the 'global salt' can be rotated over time.

We can pre-compute hashes with future salts in case apps are not constantly online. Then we could rotate salts, say, once every 2 hours.

The point being to make rainbow tables more expensive.

> You can hash the contact pair instead of one side of it, but even then an attacker can rent X instances of some hardware that is Y times faster than your phone, so if you target a difficulty of ~1 second per contact on your phone, you get a difficulty of 2700 hours / XY for the attacker to recover all of your contacts

> So if Y is 10 times faster than your phone, you only need X=10 to recover the contacts for a given number in a day. If the attacker is willing to set X=1000, Y isn't even relevant.

I think this should be (10^10/3600 ~ 2.7 million hours) / XY. Also, let's say we can make it Y seconds (perhaps the hash is memory hard, etc). Then to compute a single table (per salt) we need 2.7e6 hours, at say $0.01 per hour, so $27k per table.

Taken together that would mean to reveal the contacts of signups happening within a 2 hour window, would cost quite a sum.

It's not secure, of course - but it does give some measure of privacy.

If you think users are willing to wait more than an hour, you could 10x the cost too. But that's probably not practical anymore!

Edit: Oh. I see I made a mistake here! If the server computes a single rainbow table it will be able to retroactively de-anonymize all users that are already active at that time! :( Sorry.


Yeah, somehow I missed a couple of zeros, it would be 2.7 million.

Such a salt wouldn't tighten up the window though, for contact discovery to work, all the outputs have to be available to whatever is doing the matching.


OK hold on. Say the rotating salt thing doesn't work, for whatever reason.

But you yourself made a better suggestion! If we do hash (phone #) contact pairs, and every client publishes multiple hashes, one for each of its contacts, then a cheating server would need to compute (max) (10^10 * 10^10)/3600 ~ 2.7e17 hours worth of hashes.

That would actually be secure, no?

Edit: messed up the math myself this time; I think this is correct now though


For the contact discovery to be private, the phone has to calculate the hash. So the server cost also depends on the performance differential (many entities would likely still spend $50,000 to discover the contacts of a high value target).

Adding information increases the cost, but it also reduces the likelihood of a match.

You've edited your comment since I wrote the above. Anyway, if the attack is directed at a single user, there's only one 10^10 involved (and really, the phone number space isn't that big).


Really sorry about the edits! Thinking in real time :)

So about the size of the phone number space. Interestingly I guess you could at least limit it by country. But, unless you already know the mobile phone numbers of the target's acquaintances, (in which case there is no real danger of 'leaking' your social graph), you can't really limit it more, because mobile phone numbers are pretty random within each country's phone number space (I think).

So say the US has at least 0.5 billion assigned mobile numbers, and you don't know which are assigned, so the space could easily be 10^10 for just the US. (I can't find any quick numbers on Google right now).

But even if we go down to say, 100 million, we would still get (10^7 * 10^7)/3600 ~ 2.7e11 hours worth of hashing to get all contact pairs. Which is still secure.

Now as far as limiting one end point to a single phone number - that would be a special case. I'm not sure if this isn't moving the goal post a bit.

As long as the server does not have access to your own phone number, it, or anyone who has hacked into the server, cannot really use that to limit their computations. Unless they are already targeting you through other means (possibly you are a person of interest to someone), and so know your phone #.

In that case yes, they could reveal (part of) your social graph for a (maybe) affordable cost. But in that very special case, I think we are dealing with nation-state adversaries.

And in that case, the telcos probably already have a list of your phone # contacts. So it would probably be easier to simply get it from them.

The only thing left to be revealed would be that you communicated with a specific acquaintance using that particular app, at some particular time.

So a targeted attack could leak that metadata.

It's not perfect.. but I think it might just be good enough for most users.

If we go off the deep end with an attack tailored to a single phone #, we also have to worry about a whole lot of other attack vectors that are probably more important.

Lastly, note that the user (if sufficiently paranoid) could simply opt out of the automatic contact search post-install.


As for mobile phone numbers being randomly placed inside larger space of all phone numbers, while this makes some sense for US/NANP, in most of the rest of the world mobile numbers have some semi-well defined prefixes (as in "number that starts with 6 or 7 is certainly mobile phone, discerning what operator is sligthly non-trivial") and even for NANP there probably is some publicly available database that says whether given number is mobile or not.


I don't think it's moving the goalposts, the ability to focus an attack on an individual number is an important consideration for an individual evaluating the system. That only their contacts are revealed is not much consolation when their contacts are revealed.

The ability to move out to the phone network at observe metadata there does make the discussion somewhat academic.


The point is that the hash needs to be same for all users, so salts are out of question and custom hashing schemes does not help you as the assumed installed base is significant.


See my sibling comment about the salt.

As long as hashes are computed client-side, I don't see why the size of the installed base matters?


Size of installed base matters because it is perfect argument for why using some weird custom hashing does not solve anything as all clients have to use the same algorithm and when your network is interesting target it is also big enough to account for whatever effort is required to break the hash by brute force or generate useful rainbow tables.


> This would be a great problem to really solve well. If anyone develops any insight into a practical

> privacy preserving mechanism for achieving this type of contact discovery, or notices something

> we’ve missed, please let us know here, on twitter, or on our development mailing list.

> – Moxie Marlinspike, 03 January 2014

This idea is probably full of holes, but there might be something useful here...

Could you use layered system of bloom filters of shards of the hash(telno) space?

(Figures are made up, but are roughly the right order of magnitude for millions of service users.)

The filters could be made smaller, at the expense of having more of them.

Layer 0 is a 1MB bloom filter with a 10% error rate, for all hashes of telnos

Layer 1 are 16 1MB bloom filters with 1% error rates, sharded by the last nibble of the telno's hash

Layer 2 are 256 1MB bloom filters with 0.1% error rates, sharded by the last byte of the telno's hash

Layer 3 are 4096 1MB bloom filters with 0.01% error rates, sharded by the last 12 bits of the telno's hash

...

A newly installed app uses the following algorithm to determine which of its contacts are already users of the service, without disclosing the phone's contact list to the service:

The app makes a copy of the list of contacts' telephone numbers (E164 format).

Request the L0 bloom filter. Using L0, 90% of non-member telnos are removed.

Request only the required L1 shards, for the remaining telnos. Using the necessary L1 shards, more non-member telnos are removed (99% are gone now).

Request only the required L2 shards, for the remaining telnos.. Using the necessary L2 shards, more non-member telnos are removed (99.9% are gone now).

Repeat until required error rate is reached. i.e. The list only contains telnos of other service users.

The benefit of this method is that bloom filter shards are only requested for numbers which are probably service users, thus disclosing much less information to the service provider.

90% of contacts' numbers will never be used to request a Layer 1 bloom filter, and those that are, will only disclose 4 bits of the hash to the service provider, narrowing them to 6% of the worlds telnos.

99% of contacts' numbers will never be used to request a Layer 2 bloom filter, and those that are, will only disclose 8 bits of the hash, narrowing them to 0.4% of the worlds telnos.

99.9% of contacts' numbers will never be used to request a Layer 3 bloom filter, and those that are, will only disclose 12 bits of the hash, narrowing them to 0.02% of the worlds telnos. (240 million telnos, if there are 10 billion telnos, as per the webpage.)


The server can cheat.

If I provide an L0 with numbers that begin with `1` and you don't download an L1, I learn you don't have any numbers that begin with `1`.

Facebook says they get 5 new profiles every second[1], so it may be justifiable to query too often.

[1]: https://zephoria.com/top-15-valuable-facebook-statistics/


Good catch. This is a weakness, but it may not be relevant to WhisperSystems' use case.

If you provide an L0 with numbers whose hashes end with '1' (hex), and I don't request L1/..1 shard, then you learn that I don't have any numbers whose hashes end with '1' (hex).

I have 1024 contact telno hashes.

Assuming I request only when the app is installed, and daily after that, it will take you:

16 days to learn which final nibbles my contact hashes have - probably all of them.

256 days to learn which final bytes my contact hashes have - probably all of them.

4096 days to learn which final 12 bits my contact hashes have - probably around a quarter of them.

Three years in, you know which 900-1024 of the 4096 12 bit combos match the last twelve bits of one or more of the telno hashes.

~950 * 4096 = 15,200 days to learn the final two bytes of hashes, of which there will probably be ~1000 unique values.

You now know 1000 buckets of 6 million numbers which contain one or more numbers from my contacts.

If you update the app, over the air, to query every second, this would take just a few hours. But then, you could just update the app to grab all my contacts and send it instantly.

The "server cheating" is a valid attack, but it may not be important, as there are easier attacks if we assume a malicious service.

I assumed that WhisperSystems were trying to limit the information they had, as a good actor, rather than ensuring complete app safety vs a malicious service provider.

i.e. their motivation is not to have the information to turn over, rather than to not be able to get it when compelled to modify their app. This is an important legal distinction in some jurisdictions.


I think it's a bit worse than that. Suppose the server wants to learn whether you have a certain number in your contacts.

Then they simply put that number into your lower-level Bloom filters for L0, L1, L2, ... even if that number is not really a member of the service. To prevent a final match, they only need to keep the target number out of the final BF.

Then they passively watch which BF's you retrieve at L1, L2, ... etc. If you never retrieve the BF for Ln, despite the bits for the target number being present in the L(n-1) Bloom filter, then they know you don't have the target as a contact.

Does that make sense?

Of course, this attack and the previous one both go away if we assume the server is "honest but curious".


That makes sense, but any malicious service could pretend that a particular telno was a user of the service, in order to see which users had that contact.

I'm assuming the objective is to minimise the service's unnecessary knowledge of non-service-user contacts.


> I'm assuming the objective is to minimise the service's unnecessary knowledge of non-service-user contacts.

Right, so an honest-but-curious adversary model sounds like it's a good fit for the server. Then all these attacks where the server lies about something are out of scope.

Again, I like this idea so I think it's worth thinking about it some more. Not trying to nitpick here, just trying to help think through what you can do.

What about the clients? Do you want to limit what they can learn?

Given any telno, a client can quickly tell whether that telno belongs to the service, even if the client doesn't really have that number in its contacts. Does that matter? I guess we come to a sort of meta-point here. What does it mean for a number to be in someone's address book? Anything? A fundamental part of the problem seems to stem from trying to make this work with identifiers that come from a (cryptographically speaking) very small space.

Extending the attack above, really the client can get a pretty decent approximation of the entire list of telnos in the service. If he checks all 10 billion telnos against the BF for L0, he gets a list that includes all subscribers and several false positives -- each number in the list is only 90% likely to be a subscriber. If he wants to refine his estimate for the likelihood that any given number is a subscriber, he retrieves the corresponding BF for L1. So he starts with a decent estimate of the subscriber list, and with each BF that he downloads, is estimate improves. How bad is that? I'm really not sure...


The number of downloads required to get each additional nibble is 16 times that of the previous nibble.


Also -- I was too sloppy in my earlier analysis, and as a result, my numbers could be way off. The attacker might learn much less than I initially thought. It depends on how many users are subscribed to the service.

After the client downloads the BF for L0 and hashes the entire space of 10 billion telnos, he knows 90% of the non-subscribers for certain. All the telnos that matched in the Bloom filter are either (a) subscribers of the service or (b) false positives from the BF. Let's say there are n subscribers. Then the attacker gets a total of n + 0.10 * (10^10 - n) hits on L0. The likelihood that a matching telno is actually a subscriber is n / (n + 0.10 * (10^10 - n)).

So he learns something with each level of the Bloom filter, but getting to 90% certainty may require retrieving several levels of the BF.


Thanks for your questions and analysis.


Your idea looks promising. Thank you for posting this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: