As far as I can see this is quite similar to the 'sharded bloom filter' discussed in the submitted link.
This approach is dismissed in the document as
> In the end, this strategy seems to provide very little at the cost of fairly substantial complexity.
Since it seems you use it for silent circle, how does it work there? How many 'longer hashes' do clients get on average per submitted 'short hash'? How large is the 'false positive' rate you get from the long hashes? Since the argument against the 'sharded bloom filter' in the submitted link boils down to 'the tradeoffs don't work', I would be very interested to know how these tradeoffs work out for silent circle.
As far as I understand, you would still need to have some rate limiting in order to stop someone to collect all registered phone numbers though, wouldn't you?
> As far as I can see this is quite similar to the 'sharded bloom filter' discussed in the submitted link.
It's pretty similar, but not exactly like that. The main difference is that the server doesn't need to transmit a bloom filter, it transmits partial hashes, but yes, there's a tradeoff.
> How many 'longer hashes' do clients get on average per submitted 'short hash'? How large is the 'false positive' rate you get from the long hashes?
I'm afraid I don't know the answer to these questions, as we don't keep logs... The false positive rate of the long hashes should be astronomically low (since they are full 256-bit hashes), but it doesn't much matter because the client only needs to be reasonably sure that the user exists so it can perform an actual query with the plaintext identifier.
The problem isn't really "we shouldn't know any of your contacts", it's "we shouldn't know any of your contacts that aren't using our service", since clients have to divulge some of their contacts in order to set up calls with them or send them messages. That means the client only has to know which of their contacts are probably on the server so they can request more information, which will, at that point, tell them with certainty whether a contact is using the service.
P.S. My opinions are my own, I'm not speaking for SC, etc.
I have to say that this sounds very reasonable (and certainly a lot better than sending your plain text contacts).
Does anyone know what kind of solution Signal is using nowadays? The linked article is not very recent and they might have changed their method in the meantime. I had a quick look at their source but was not successful in finding the relevant part for uploading the contacts.
I think the concern may be that the total space of hashes for valid email addresses, especially ranked by domain usage, is small enough that SHA hashing doesn't meaningfully obscure content.
From the article:
Hash It!
The first instinct is often just to hash the contact information before sending it to the server. If the server has the SHA256 hash of every registered user, it can just check to see if those match any of the SHA256 hashes of contacts transmitted by a client.
Unfortunately, this doesn’t work because the “preimage space” (the set of all possible hash inputs) is small enough to easily calculate a map of all possible hash inputs to hash outputs. There are only roughly 10^10 phone numbers, and while the set of all possible email addresses is less finite, it’s still not terribly great. Inverting these hashes is basically a straightforward dictionary attack. It’s not possible to “salt” the hashes, either (they always have to match), which makes building rainbow tables possible.
https://github.com/SilentCircle/contact-discovery
Moxie said it wasn't "meaningfully private", if I recall correctly, without elaborating further.