Hacker News new | past | comments | ask | show | jobs | submit login
MongoDB Releases Queryable Encryption Preview (mongodb.com)
120 points by andrewbarba on June 7, 2022 | hide | past | favorite | 67 comments



This is a really neat technology, but I don't understand it's use case. I've worked in HealthTech and currently in the compliance space. I'm skeptical of Mongo's claims (and their familiarity with compliance laws). Kind of feels like a solution in search of a problem.

"In use" implies that you have a need to process that data. It doesn't matter if the end client is submitting queries in plain text (protected in transit) or this fancy encryption, the client (or server) still needs to be authorized to query that data. Translating from plain-text to encryption does not add additional protections from a compliance perspective.


This seems more applicable for the SaaS hosting model where the database service is managed by a 3rd party. So the use case is "I trust your SaaS service is compliant with my legal obligations to protect my customer data, but it'd be easier for everyone involved if your database service also has no way of seeing sensitive data fields. That would make it easier for me to pass my compliance audits, otherwise I need to audit you." So the data is encrypted client-side before it's sent over to the database service, and the database service is not able to decrypt it, but can still can include the encrypted value in a query.


The problem is that situation doesn't really exist.

At an organizational level, it's extremely hard to control what information get put into a SaaS. There are far too many ways in which data can be de-anonymized or inferred against (e.g. a field existing can have privacy implications).

It's far safer to use a SaaS provider that meets general control requirements than to try to shoe-horn encrypted data into them.


Not sure about any specific use cases for this Mongo feature, but security requirements are often just checkboxes without much thought put behind them, or not valid in certain contexts, like SaaS/cloud.

Classic example is on-prem enterprises requiring data encryption at rest when moving to a cloud vendor. I can explain to a client that encrypting an S3 bucket with an AWS-managed key doesn't really prevent anything beyond someone physically stealing a hard drive from the AWS data center, and that the cloud provider can still see all of their data because they control the encryption key... or I can just click the "encrypt data" flag on the S3 bucket, make their security and compliance officer happy, and be done with it.

So, you're totally right, but this might be a case they needed to satisfy where an enterprise security team or regulatory agency said that they couldn't put X data field in the cloud unless it was encrypted, but X data field was really important to the application team.


> that encrypting an S3 bucket with an AWS-managed key doesn't really prevent anything beyond someone physically stealing a hard drive from the AWS data center

It does, though. To get that data, you now need access to the bucket itself _and_ the KMS-managed encryption key. You might not be protecting the data from AWS, but one bucket misconfiguration doesn't lead to wholesale data loss now.

Is it perfect? No. You can misconfigure both. But misconfiguring KMS access is harder to do.


To be clear, I"m talking about using the default "AWS KMS key" as they call it now, not managing your own keys. Just click the box on S3 and it's encrypted at rest, but completely transparent. If a user has access to the S3 bucket, they have access to the data within it. This has been sufficient for every enterprise client I've worked with because it checks the "data encrypted at rest" box for SOC2, ISO-27001, etc.


Correct, a database feature can't prevent errant developers from posting customer credit card numbers on Twitter either, but compliance auditors won't accept "We don't bother encrypting data because it's pointless!" as an answer. The very real use case here is when private contracts between an end customer and a service provider says "You are in breach of this contract if load my data to any 3rd party service that has the ability to see my data, regardless of whether they use that ability or not. You are allowed to use that other service if you extend your audit program to also audit that other service provider according to my standards and report back to me." I've personally seen this situation prevent the use of SaaS. An encryption control like this where the 3rd party doesn't have keys to decrypt the data is considered satisfactory.


> It doesn't matter if the end client is submitting queries in plain text (protected in transit) or this fancy encryption

It's not just the query that is encrypted in this case, but the data being queried. From MongoDB's description, the server never receives or stores plaintext data, and the query results can only be decrypted by a client who has the same key that was used to encrypt the data in the first place. From a compliance perspective, that's amazing if it works. It means the server is never storing or processing anything but ciphertext.


Yes, and in the context of Mongo-as-a-Service, it's amazing both to the client and also the service provider (less liability).


In one of the books about the general idea, _Translucent Databases_, the idea is to save the costs of securing the raw data. Someone might break into the database server (or listen on the wire) and find only encrypted values. This can make many different architectural use cases easier to deliver.

In the most extreme cases, the unencrypted values never leave the client. The database can concentrate on delivering storage and fast query answers without paying much attention to issues of security. Clients don't need to trust the database because they control the encryption.


I see this as more about fundamental trust.. confidentiality from the service providers, not compliance


To be fair: the Compliance Regime didn't invent any of the technologies they create frameworks around, and if we followed every compliance framework's recommendations to a T, the systems produced would by-and-large be insecure. They paint with an extremely large brush; and its a toss-up whether the auditor has even been involved in anything technology-related beyond auditing for, well, decades. There's good ones and bad ones, but the integrity of many audit processes relies to a significant degree on the goodwill of the SMEs of the systems and processes being audited.

Just as a dumb example; an auditor says passwords need to be hashed with bcrypt. They find a code sample that says "store(bcrypt(password))". Awesome; complied to a T. But true security goes beyond that: are we using a library for bcrypt, or an internal implementation? Is the internal implementation well-implemented? Is the library free of CVEs (maybe they check that)? Did we trace that call to ensure the data generated is what is inserted to the db, or was it intercepted by some middleware? Did we name that function 'bcrypt' but its actually just MD5?

My point is really not to assert that auditing is pointless, but rather its fundamentally limited in what kind of attestations it can make.

One great example I can pull from a few recent audits I've been through: serverless tech like Fargate. This oftentimes blows auditors away (or, rather, it used to; nowadays they've seen it so often that they just know). It checks so many boxes. They'll present multi-page forms about data center colos and operating system security and operator SSH access and we'll say "We use Fargate". "Oh nice, ok we can check all of these and carve out with AWS's attestation for (ComplianceFrameworkX)". It saves hours, days, of time.

That's, I think, where homomorphic encryption can go. That isn't what this is, but it's a step toward that. It's not about meeting today's compliance frameworks; it's about evolving the framework. And, in the interim, as advanced R&D teams meet these auditors, they'll educate-up how, yeah, you've got a lot of questions here, but its not that we do or don't meet them: its that they're fundamentally the wrong questions to ask; but we understand the spirit, here's how we meet the spirit, and here's how we're actually better than if we had just checked Yes on all of them.

Third example: years ago, our team was the first time our auditor had ever seen LetsEncrypt and k8s certificate-manager (then it was called kube-lego). He wanted an attestation that TLS certificates were current and not near-expiration. We countered: they can't be near-expiration, because we have automated systems which renew them. He'd never seen anything like it; he was used to expensive certificates and operations runbooks for renewal; and we nerded out for ten minutes showing it all off. Instead of documenting a runbook for renewing certificates, he documented our runbook for maintaining this automated service and ensuring uptime. Win-win.

Its a slow process, and its made even slower because there are tons of people in the industry who treat the frameworks as gospel. But, ultimately; we control the technology, not them. We decide what is secure; they just attest to it and double-check.


This feature is a result of MongoDB's acquisition of Aroki. It looks like a good product but we actually beat them to it with https://cipherstash.com/activestash

CipherStash works with any Database and also supports Range queries and sorting/ordering. We do it in the application layer. Only supports Ruby so far but C#, Java, Python, Rust are in the works.


What about Go, or even Tcl, and Ocaml? Do you have pointers to docs that'd help OSS efforts in this department?


Not yet but that's a good suggestion! The core client code is Rust so additional languages are (mostly) just native bindings to Rust. We will be releasing the Rust SDK publicly soon and welcome contributions!


Help me understand this...

It says it will support prefix search, substring search, and the like. Can anyone point me in the right direction on what the algorithm may be here? I don't get how you could do those things without making the encryption less secure and/or decrypting every record the fly.

Another interesting use case I found that isn't mentioned here is sort. I've had customers ask me to be able to sort the results by PII and we tell them... no, we can't do that because the field is encrypted.


These things are indeed possible while maintaining fully semantically secure encryption. Recent, mostly theoretical work shows that this is possible using fully homomorphic encryption. The basic idea is, the client can encrypt its query, the server can process the encrypted query and produce an encrypted result, and send this back to the client. It sounds impossible, but it isn’t! Very cool stuff. There are actually also some practical implementations that work… so it’s gradually exiting the “theoretical only” stage.

MongoDB is very short on details, and I suspect they do something worse than homomorphic encryption, that does indeed make some kind of compromise between privacy and convenience.


Yeah, they contrast their method with homomorphic encryption, which makes me share your suspicion


Searchable encryption trades privacy for efficiency. However, the privacy loss can be tuned. For example, SE constructions will specify whether they leak search-pattern (how many of the same queries a client makes), access-pattern (the frequencies with which different items are accessed) or other things. Usually, a client can pay in storage/bandwidth to mitigate these leakages.


Yeah, I've been looking for more information and I can't really see any indication as to how they are planning on implementing it. The whole thing seems more like marketing than actual innovation: searching encrypted data isn't that complicated if you are always dealing with the entire ciphertext, it's just another string in that use case.


> searching encrypted data isn't that complicated if you are always dealing with the entire ciphertext, it's just another string in that use case.

This isn't really true because there are multiple ciphertexts that can decode to the same plaintext in any modern encryption algorithm. If you skip that property you weaken the encryption. (Chosen plaintext attacks)


it's not complicated if they are using deterministic encryption - which brings it's own issues


It is less secure than your standard symmetric encryption. I guess they would use deterministic encryption in which 2 entries with same email address will have the same record string ( this leaks information to attacker ). Prefix search & sort can be achieved by using order preserving encryption. Not really sure about sub-string though.


I've researched order preserving encryption before but the tradeoffs (mainly that the attacker can tell the order and use that to narrow the search space) always seemed like high risk.


High risk compared to what? The alternative is absolutely no privacy (status quo) or no/limited functionality (not very useful). Seems like strictly better than having no privacy.


Depending on your compliance needs and the sensitivity of your data, "limited functionality" may be a reasonable tradeoff, though.


Using fake encryption is much riskier than no encryption, because if you think you are safe you will do unsafe things with your data. If you know you are unsafe then you will take appropriate precautions.


Related video explaining encryption schemes to make encrypted data in a DB queryable:

CryptDB: Processing Queries on an Encrypted Database

https://youtu.be/xsaXMUelOEA?t=807


I was under the impression that cryptdb "encryption" was thoroughly broken. Am i mistaken?

E.g. googling i found http://cs.brown.edu/people/seny/pubs/edb.pdf


Not broken according to the response to that paper:

the conclusions drawn by this paper with regard to CryptDB's guarantees for medical applications are incorrect: had the guidelines been followed, none of the claimed attacks would have been possible. [1]

[1] https://css.csail.mit.edu/cryptdb/response.html


This is really neat. Recently I explored similar functionality for relational databases and only got as far as implementing column-level encryption [0] in this Go library [1], but without support for querying the encrypted data. HashiCorp Vault's transit secrets engine supports Convergent Encryption [2] which provides limited ability to query the encrypted data, but I haven't yet experimented with it. If anyone is doing something like this in production, would love to hear about your experience.

[0]: https://en.wikipedia.org/wiki/Column_Level_Encryption

[1]: https://github.com/bincyber/go-sqlcrypter

[2]: https://www.vaultproject.io/docs/secrets/transit#convergent-...


The MuchPIR project (https://github.com/ReverseControl/MuchPIR) implements Information-Theoretic Private Information Retrieval (IT-PIR) in Postgresql; In addition to the demo there is a high performance version available for commercial use.


I didn't know this was a thing. The article mentions it can do equality, range, prefix, suffix and substring queries. Does this mean that the encryption scheme creates sortable 1:1 mapped results after encryption? Kind of like a shift cipher?


They mention this:

"Queryable Encryption was designed by MongoDB’s Advanced Cryptography Research Group, headed by Seny Kamara and Tarik Moataz"

Some related papers with those two as authors:

https://eprint.iacr.org/2016/453.pdf

https://cs.brown.edu/people/seny/pubs/sgx.pdf


The problem is: is also the full query encrypted or just some values that are considered sensitive? I remember a research form some years ago showing that if an attacker is still able to see the SQL code can recover the content of the database by looking at the queries, the responses and "putting the pieces together". Now, if the target was to get the exact values inside the database (think about employees wages) it still required to observe a very big number of queries, but if you were interested in getting a reasonable interval for each value then the number of queries needed become small enough to be doable in practice.

Unfortunately I don't seem too be able to find this again, but a quick search turned out two papers that say that just encrypting your db isn't enough: [0], [1]. In particualr [1] doesn't seem to go into the details of how you could recover the data, but mentions that many operations as performed by "normal" databases leak information if performed over encrypted data. Maybe someone that is more familiar with Queryable Encryption can comment on this?

[0] https://www.cs.cornell.edu/~shmat/shmat_hotos17.pdf [1] https://www.microsoft.com/en-us/research/wp-content/uploads/...


You're on the right track. I work in the data security space and while this is a cool release, it's not novel[0] and has been around for a while[1]. As a general rule of thumb, the first thing to check is if the provider is asking you to pass in your query in plain text AND without a local client (very important, because if you're sending data in plaintext, the threat model is now transitioning to a honest-but-curious model).

This is obviously not that. They're encrypting locally. However, Simon Oya & Dr. Kerschbaum's paper, https://arxiv.org/abs/2010.03465, demonstrate a fantastic efficient attack to recover keywords on most constructions without a lot of queries. It is yet to be seen how effective MongoDB's implementation will be.

This is a very interesting space but structural encryption is the right way to put the theory into good use.

Most of the other encryption mechanisms such as homomorphic, partially homomorphic, etc. are just too impractical or require very specific niche use cases to be useful.

There are other misnamed technology I've seen in marketing such as "polymorphic encryption" or "vaultless" - but most of these haven't had real research or cryptanalysis behind it.

[0] https://info.ionic.com/hubfs/IonicDotCom/Resources/Assets/Se... [1] https://eprint.iacr.org/2017/111.pdf


Thank you for the information. Just one question: what do you mean by "without a local client"?


At a very high level, there has to be some sort of client-side encryption before transmitting the query to the server. That's usually the basic premise, client-side encrypts then transmits to the server which operates and returns a result that (in a simple 2 party system), the client (and only the client) will be able to decrypt. That is usually how these types of encryption protocols work.

That's what I mean by a "local client," there has to be something on the client side and it cannot just be something that communicates over the internet to a server w/o some sort of local encryption first.


I like it. There is always a way to hack something. This is an additional layer of security that yes, can be also broken.


Neat. Did they fix their blog's pagination yet? If you hit next enough times you may or may not be able to take down the site, don't ask me how I know.

(their pagination is implemented just by increasing the limit parameter).


Is this actually possible? Couldn't you make many repeated queries and slowly decrypt the text by e.g. slowly narrowing the range?


This is possible. The goal is that the server knows as little as possible, while the client has full information. It's order revealing encryption. The server side knows the ordering of the values, but doesn't know any specific value. When queried, it is always getting prefixes (or exact matches) following the same encryption scheme, so it can compare those to the corpus and select results since the query parameters fall into the same ordering. The server doesn't have access to the keys needed to generate query parameters, so in theory it would be difficult for the server to perform narrowing queries on its own. Over time the server could gather statistical results that may reveal more about the data it's holding. Also, these schemes may need to produce the same cipher text for the same input, so frequency distributions can be used to reveal information.


Yeah the article is very thin on technical details. To make this work as they describe, it must not be possible for any client to "forge" queries, or else they could trivially decode the content by sending prefix queries of increasing length.

It's also difficult to see how this could work on the server side without exposing some information about the encrypted fields. For example, if all documents have a value that begins with "a", then there must exist a prefix query that matches all those documents. I would expect it to be possible to figure out whether such a query is possible or not, only given access to the encrypted data, but even if that's not possible, the simple fact that a prefix query was issued that matched all documents gives away that information.


You could have a larger range than domain and throw in some noise. Exact match queries would need to become range queries that are de-noised at decryption.


Yes. This is the fundamental problem with this.

For something like, HIPAA, this ads very little value if fields are semi-known.


This looks really cool. Albeit feels that it is actually a feature implemented in the driver (client side) so my initial impression is that is not a meanignfull innovation on the server side. This can be implemented with any Database, even with current MongoDBs


Nope it’s implemented on the server side. I think that they are going to talk more about it at a session and maybe even in a keynote


> This can be implemented with any Database, even with current MongoDBs

Is it really all client side? How could they do things like substring matching without sending the entire index back and forth to the client? The graphic seems to show the query being executed solely on the server (although graphics often lie).


Perhaps encrypted trigrams (or some such thing) are sent during insert and search.

Then it's just a matter of counting matching trigrams/chunks. The server doesn't need to know how to read the trigrams.


We use Mongoose, for sensitive data we have a wrapper around the .pre Save() method da encrypts it before sending data to the downstream db. Feels that MongoDB implemented that, in a more elegant structured code.


I call bullshit.

So let me get this right - its encrypted but you cansearch prefix and suffix?

So all the attacker has to do is do it one letter at a time, see if it starts with A, B, C, once they figure that out, go to the next letter and so on. (I presume that the DB is not supposed to be trusted since they make such a big fuss about only being decryptable on the client side)

Also there doesn't seem to be a whitepaper detailing algorithms or their threat model. Bitcoin scams try harder then this.


The use case you're outlining is someone already has access to the database. They can just do a find() in that case and get everything, no query required. You're basically describing an lz77 SSL hack that's like 20 years old, I'm pretty sure they would think of this.

The use case here is just "advanced encryption at rest". Encrypting at rest is one thing, but this means people are less likely to see PII by accident, for example.


That's not what their blog post says. To quote:

"Queryable Encryption implements a fast, searchable scheme that allows the server to process queries on fully encrypted data, without knowing anything about the data. The data and the query itself remain encrypted at all times on the server."

They are strongly implying that the someone with access to the database should not be able to decrypt the data. According to their blog post that seems to be the entire value proposition compared to what they describe as traditional encryption at rest.


To me this is not what it means. To me it just means I can autocomplete emails etc while not storing the raw, unencrypted email value on the server.


It’s already been mentioned that “Queryable Encryption was designed by MongoDB’s Advanced Cryptography Research Group, headed by Seny Kamara and Tarik Moataz" - are you calling bullshit on their work? What are your qualifications?


So long as whatever system they designed has not been published and reviewed by independent experts, then yes. I don't have to be an expert in this space to recognize what the norms are for making new production ready cryptosystems are, and that this doesn't remotely meet them.

Designing secure cryptosystems is hard. Experts fail at it all the time. The lack of technical details is a major red flag.

Not to mention the distinct possibility that even if this group made a secure system, the mongodb marketing dept may very well be misrepresenting its security/limitations.


If it is going to the likes of aws kms everytime it will blow budgets


can this be done in postgres via client or via server? I found it really nice


seriously did not think we would see homomorphic encryption productized for a few more years. pretty impressive!


> Some of the existing tools, such as homomorphic encryption or secure enclaves have performance unsuited to scalable encrypted search, require proprietary hardware, or have uncertain security properties.

I don't think this is exactly homomorphic. I hope they put out a whitepaper so researchers can properly evaluate its security.


Nice catch, I was scanning for homomorphic encryption, but missed this. Have no idea how else they would implement this.


Homomorphic Encryption is available at large scale today for limited use cases.

See the MuchPIR project (https://github.com/ReverseControl/MuchPIR) which implements Information-Theoretic Private Information Retrieval (IT-PIR) in Postgresql; In addition to the demo there is a high performance version available for commercial use.


Its not Homomorphic but "structural encryption". Less useful than HE but faster.


Correct. It's not homomorphic encryption, but rather more like TDE (Transparent Data Encryption) except that MongoDB service isn't decrypting the data. This is essentially client-side encryption (at the driver) and without server-side decryption.


Faster has a usefulness all its own


Absolutely! FHE isn't practical for most applications.


Homomorphic encryption allows you to modify the encrypted data without decrypting it or even knowing the the content. I don’t think this is homomorphic encryption.

If they are able to do this without decrypting the data then I think you could describe this as a somewhat week encryption that exposes some data attributes as queryable. You could not implement this with strong encryption without at least decrypting for indexing.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: