Hacker News new | past | comments | ask | show | jobs | submit login
IBM Fully Homomorphic Encryption Toolkit for Linux (github.com/ibm)
152 points by 0xedb on Aug 2, 2020 | hide | past | favorite | 36 comments



Hi Hackers! This is Eli (one of the authors of the toolkit). I wanted to make sure you all know you can check out the code. It is freely available on GitHub as linked by the OP. The press this weekend in places like Ars talks about trials, which we did and those were awesome, but the real story right now is the toolkits are out there for anyone to get access to the tech. I love that it was shared here. Several posts on Ycombiantor have come up as a result.

I just wanted to say that I think we packaged some cool demos in the toolkit. One is the privacy preserving search we debuted in the MacOS toolkit we put out a few weeks ago, and this one also has a fully encrypted neural network inference over credit card fraud data. If you like encryption, or like the idea of encrypted machine learning check it out! We built all the special dependencies for you, along with an integrated IDE setup to run the examples trivially. The encrypted ML example also uses a brand new, fresh out of the IBM research kitchen, encrypted machine learning library that makes it work.

This stuff is not fiction it is real and you can run it today if you want! Our toolkit is based on Docker and comes in Ubuntu, Fedora, and CentOS. You can even pull the docker images from Docker Hub. IF you want to see more of this effort show us some love on GitHub and Docker Hub by smashing that star button! Instructions are in the readme. Most people who know docker can get up to speed and running in less than 10 minutes. https://github.com/IBM/fhe-toolkit-linux/.

Monitoring the entirety of the internet for good questions and comments is not one of my superpowers. If anyone has questions get in touch with us on slack directly. The development team is here to help. Questions are great, we are trying to get together an FAQ. Hit us up on Slack here: https://app.slack.com/client/T0133ARBGBV#/. We want your feedback, questions, and ideas to help spread the word.

P.S. Thanks to user Darkstryder and throw0101a who commented below! You did some nice explanation for KaiserPros question, and shared some nice links for this community!


The story yesterday said you support addition, multiplication, and enough for Turing completeness.

Surely, that means you also support equality tests. With that, it’s easy to build a lookup table, and the whole thing devolves to a glorified Caesar cipher. (With a permutation function instead of a rotation.)

What security guarantees does this library provide? What’s the attacker model? I see nothing about this on the front page of your github repo, or in the press releases.

Edit: For instance, can it tolerate chosen plaintext attacks? In a naive scheme:

If an attacker can get the cipher text for “1”, then they can compute 1+1=2, giving them the ciphertext for 2, and then, inductively, all the natural numbers.


Not the author but hopefully I can shed some light on security aspects based on my current understanding.

The premise of FHE is that you send both data and code encrypted to a 3rd party system for execution and you can assume that an attacker has control of it. The attacker will see a series of seemingly random steps that mutate seemingly random data and send an encrypted response that can only be verified by the client. You don't have the answer in advance but you have a way of validating that the calculation was performed correctly as asked and the answer can be trusted. A statistical attack can be avoided since the client has the option to rotate the encryption keys without the 3rd party's knowledge although I'm not an expert so any mathematical comment is welcome.

My knowledge here is limited, but the short summary is that if done correctly FHE just looks like random execution on random data and that's what eliminates most attacks.


Encryptions are randomized, not deterministic.


How would you compare this toolkit to something like SEAL? Are there particular selling points or things that I should be aware as a potential user?


Hi Eli, Is your toolkit source open and free? as i may use and hack on it? Is it a service or could i use it in an airgapped network?


The toolkits are absolutely free to download and use and modify. When I say free I mean both gratis (no cost) and it is MIT licensed for the code IBM provided. We would love to see community contributions. Once downloaded you do not need network connectivity or anything to use the demos or play with the code!


What kind of operations can one actually do on the encrypted data? I'm struggling to understand what the use cases are

(I saw the one about machine learning, but thats broad and vague.)


To give an answer a bit more technical than the others, theoretically a encryption scheme is considered fully homomorphic if it can support an arbitrary number of additions and multiplications on its inputs (bits or integers). In a lot of fully homomorphic encryption (FHE) schemes bits are represented through big matrices that are in part random and part deterministic. You can add and multiply these matrices and it will naturally add and multiply the encrypted bit inside as well, but you need the secret key to decrypt the matrix and learn if it contains a one or a zero.

From these operations (add and mul) you can create a NAND gate using the formula 1 - A * B (assuming A and B are matrices containing encrypted bits) and create any arbitrary boolean circuit from there. These circuits works on the encrypted data without the need to decrypt it.

(there is an additional step called bootstrapping that most schemes require in order to allow circuits of unlimited depth but I won’t get into that in such a short answer)


Excellent, thank you!

Seeing that one can NAND things, everything makes a lot more sense.


Perhaps:

> In this paper we present an electronic voting system based on homomorphic encryption to ensure privacy, confidentiality in the voting. Our proposal offers all the advantages of the multiplicatively homomorphic encryption cryptosystems. The proposed voting scheme is suitable for multi-candidate elections as well as for elections in which contains neutral votes.

* https://ieeexplore.ieee.org/document/7492759

Others:

> * Securing Data Stored in the Cloud. Using homomorphic encryption, you can secure the data that you store in the cloud while also retaining the ability to calculate and search ciphered information that you can later decrypt without compromising the integrity of the data as a whole.

> * Enabling Data Analytics in Regulated Industries. Homomorphic encryption allows data to be encrypted and outsourced to commercial cloud environments for research and data-sharing purposes while protecting user or patient data privacy. […]

* https://www.venafi.com/blog/homomorphic-encryption-what-it-a...

White paper from a Microsoft crypto conference:

* http://homomorphicencryption.org/white_papers/applications_h...


It has the potential to re-invent how we use compute services. A company can provide service X that is guaranteed to not be able to see the data you give it.

Personally, I think it is the most applicable to the corporate "giants" of the world: now they can feel more free to offer each other services that operate on data that would normally be considered too risky to share, like financial data or private customer data. When the data is secured through FHE, there's no risk to letting anyone else run computations on it in order to provide you with value.


sure, but think of this.. Most people will say that listening in to a phone conversation is 'spying' and not right, and, it turns out that listening is hard, due to poor connections, accents and other noisey issues. BUT at scale, simply collecting and mapping the patterns of connection/duration over time is much more informative than most people imagine, and few object to that. Surprise.

Run the scenario you describe, and people being intrusive, rent-collecting bullies that they can be, and do you think the sole-sourcing of "giants" will be a tame and orderly affair ?

There are two directions to take that thought at least.. one is, what did previous groups do in past times that did create positive results; and second, a tech games-theory approach to de-centralization mixed with special purpose business entities, with real checks and balances.

Personally, I assume that business cannot be trusted over time to keep the consumer of services best result in mind.


In theory any decidable function (any function that is guaranteed to terminate on any input). You need the control flow to be independent of the inputs, otherwise it would not really be "encrypted" (input-dependent control flow would reveal something about the inputs), but it is always possible to transform a (decision) function into its "input oblivious" version without too much of a performance hit. In the case of FHE, the "instruction set" is just addition and multiplication i.e. the computation is expressed as a polynomial.

Practicality is a problem. You cannot have random access without performing some kind of linear scan over an array (all of which is expressed as a polynomial), so most software will perform extremely poorly. The actual cryptosystems are reasonable in terms of performance, but still carry a significant overhead. Even for ML tasks you are probably looking at 100-1000x performance overhead.


Can the performance overhead be latency masked down the line so you don’t have as much overhead in practice? For example compression of slow storage media like tapes technically adds overhead but in practice because the slowest part of the chain is usually reading/writing off the tape you either don’t experience any actual overhead in practice or even often experience a small speed increase unless you can massively parallelize media access with multiple heads at which point the CPU/RAM becomes the bottleneck again.


It depends on the task.

Some tasks intrinsically require a lot of random access to a large memory, meaning a memory larger than your fast CPU, and there is no way to buffer enough state inside the small memory of the CPU to make the random access non-random. (Note, random access here really means data-dependent addresses that don't follow patterns you can predict without doing the computation; they are not really random.)

For those tasks, address traffic from the CPU to the memory reveals information even if the data is encrypted, and this is the reason why it is said to require a linear scan of memory to perform random access while hiding the address of interest.

A more extreme version of this occurs with scanning large databases without revealing what you're looking for.

I'm not sure if the full linear scan is really essential, or if there's a way of representing data in memory (or data store) differently that relaxes the full scan requirement. E.g. by having the memory itself do some polynomial processing.


Thanks for helping the conversation along jlokier! I thought since you mentioned the database concepts in your response you might be interested in checking the database example in our toolkit. It is really not a literal database but rather an in-memory key value search as that is easier to code and demonstrate. One of the nuances of FHE programming is that a full table scan is needed for cases like our database search because fundamentally we cant test and branch to bail out early when we find something we want. Key comparison to an encrypted search key are obviously not literal matches. Each has its own polynomial representation with its own error added to keep terms from encrypting to the same value every time. There is no way to compare for short circuit evaluation for instance. So we end up doing an operation on each value and effectively summing up those partial solutions in a way that lets us determine if a match was found. If anyone is interested let us know on our public slack channel and someone can answer more tech details there.

You guys rock. Great questions and comments on this forum.


Thanks for the enthusiasm and insights - and the toolkit of course! :-)

> fundamentally we cant test and branch to bail out early when we find something we want

I think this is derived from assumptions that are not necessarily valid, with a suitable representation in memory.

In other words I think you can bail early without revealing too much, with a suitable data encoding.

But after writing out a sketch of a mechanism in a HN comment I realised the idea may be patentable, and HME is probably an attractor for that sort of thing.

(I also realise I may be full of nonsense, but you've got to start somewhere :-)

I'm pretty open about ideas. An open source kind of person. But then I remembered, I posted something before and found someone else built on the idea and patented it (citing my post), stopping me from developing the idea. Yes, that happened, albeit a long time ago :-/

Hmm, uncomfortable.


"In other words I think you can bail early without revealing too much, with a suitable data encoding."

Maybe, but whatever that is would not be encryption. An FHE scheme is still a public key encryption scheme, so it should not be possible to compute any predicate over the message without having access to the private key. The homomorphic property allows you to compute a new ciphertext that will decrypt to the result of a computation, but that new ciphertext will still completely hide the plaintext.

Basically the reason you cannot "bail out early" is that FHE does not allow you to see an indicator value to tell you when to do so -- any such value would itself be encrypted. If it were not, then it would reveal "something" about the inputs; coupled with the fact that the inputs can always be modified (by e.g. changing the function that is being computed), that would be a complete break.

The only good technique we have right now for addressing this problem is to use some kind of interaction between multiple parties, which allows sublinear random access and can allow branching (and thus early loop terminations etc.). There are cases where that works well and there are many real-world use-cases of those techniques (I work on this for a large tech company). The tradeoff is that you need to do a lot of communication, which is typically more expensive than computation, and if too many parties are offline the computation will not proceed (in the two party case nobody can be offline). You also need to trust the parties not to collude with each other, which is sometimes a concern in practice.


> Basically the reason you cannot "bail out early" is that FHE does not allow you to see an indicator value to tell you when to do so -- any such value would itself be encrypted. If it were not, then it would reveal "something" about the inputs

It depends what you're doing.

If the idea is to search for a value in an encrypted list to see if it's there, or lookup a value in a key-value store, that's similar to a database query. Unencrypted, that can be performed in a hard-bounded sublinear time, so encrypted, the steps can also be sublinear without revealing content to the processor, if there is a way to calculate the addresses in each step homomorphically, i.e. without the processor knowing the meaning of those addresses, just calculating them from data it sees.

That's where the encoding comes in, storing the data in such a way that knowing individual encrypted addresses doesn't reveal much about the data, and depending on what properties are important to you, also in such a way that access pattern structure such as repetition is also hidden because the addresses for encrypted access are not repetitive. This requires the encoding to do some level of data diffusion.

Branching is also not ruled out if the processor can't determine whether a branch has occurred. That means although the processor cannot determine when a search is short-circuited, the result of the search may still be used "early" by later stages in the homomorphic calculations, provided this does not look like a branch to the processor. Think of predicated instructions.

> The only good technique we have right now for addressing this problem is to use some kind of interaction between multiple parties,

That's not what I have in mind, but there's some similarity. I imagine each memory having some dedicating processing on board, just to reduce communication overhead during writes, which are heavy due to data diffusion.


> if there is a way to calculate the addresses in each step homomorphically

Calculating the addresses homomorphically means that you will have a ciphertext that decrypts to the address, not an actual address.

> storing the data in such a way that knowing individual encrypted addresses doesn't reveal much about the data

As I pointed out in my earlier reply, with FHE if you reveal anything then you are actually revealing everything, so what you really need to say there is that you do not reveal anything about the data.

Beyond that, what you are describing sounds like oblivious RAM (ORAM), which is a well-studied technique that cannot be applied to FHE. Basically an ORAM has two levels of addresses, the "inner" address for a flat array you are simulating, and the "outer" address that you are actually working with to traverse the data structure. The outer address reveals nothing about the inner address, but to ensure that this is true you have to update some kind of state after every access (otherwise you could use the outer address to determine if two consecutive read operations retrieved the same data). It is also a requirement that the outer addresses are available as plaintext, and as I said, FHE only allows you to compute more ciphertexts.

Fundamentally the challenge with FHE is that the processor is free to compute whatever, as many times as it wants, and so nothing can be revealed except for ciphertexts. Introducing interaction ensures that the computations are limited (by other parties that are not colluding with the attacker), which is why it is possible to have things like ORAM (because you can ensure that the ORAM state is properly updated etc.) and branching. Interaction also makes it possible to talk about "controlled leakage" since you can limit the number of operations that leak data. You are correct that an ORAM can reduce communication among the parties in an MPC protocol, but it is not a silver bullet -- ORAM overhead is non-trivial and asymptotically it will always be worse than computation on plaintexts.


I'm not thinking about ORAM - as you point out, it reveals access patterns.

Thus the reference to data diffusion, i.e. inner words are not stored at a single physical outer address, they are mixed with other data, and the outer addresses used to interact with memory state vary for the same inner address, without the RAM having any secrets.

The outer addresses don't need to be plaintext, but they do need to satisfy relationships with each other that are algebraically useful yet not discernible without a secret.

> with FHE if you reveal anything then you are actually revealing everything

Is this not subject to cryptographic vs information-theoretic hardness arguments: If I reveal to you 2^-256 bits of data per operation (i.e. astronomically close to zero), we'll call that good enough because you can't do enough computation to matter?


"The outer addresses don't need to be plaintext"

Yes they do, because that is what you will actually be accessing on the physical system. Call it what you want, but at some point you need to have a physical address that you will read from for any computation to proceed.

"If I reveal to you 2^-256 bits of data per operation"

Sure, but that is not going to help you get around the fundamental need to simulate random access by performing linear scans when you are using FHE. Think of it this way: you need to read some physical addresses, somehow, regardless of how you encode your data. If you only revealed a vanishingly amount of information about the physical addresses to access, then you would not have given the processor enough to do anything more efficient than a linear scan.

Now assume the processor could somehow access fewer locations than a linear scan accesses, which implies that it does not access all physical addresses. The processor can try to query all logical addresses (since it must store that much data anyway it should be able to enumerate them), and choose two such that one accessed a particular physical address and another did not (this must exist, because everything is being accessed -- if somehow such an address could not be found then all queries will be linear scans). The two logical addresses must differ in a particular bit position. What the processor can do is to take a given ciphertext, homomorphically compute its individual bits, and then insert each bit in the position where the two addresses differ -- and then use the access pattern to determine the value of each bit.


(Reminder I am a maintainer of the toolkit and an IBMer, opinions are my own, etc. ) This is a pretty new field with people just starting to take it seriously for large problems. It seems from what I have seen in experiments, there can be big tradeoffs in latency against throughput and like most software, decomposing the problem and efficiently mapping it onto the underlying data structures gets the big performance wins. It is still very easy to do something naively with poor performance. Also while the performance overhead is "high" it is orders of magnitude better than it was years ago. The overhead in terms of resources is tractable. RAM is inexpensive and so is CPU time compared to the cost of a single data breach both financially and in terms of user impact. :)

My final comment on performance is that the way the code is generally structured it makes sense to do things in batches in parallel as opposed to doing individual atomic transactional data operations, but I think theres a lot of room to explore in the performance and optimization space.

I think better reference examples, and a common platform to run examples on will foster the healthy exchange ideas and will help advance the art a lot. One of my personal goals for the toolkit is to provide people with demos to run (and with community help or a massive outpouring of interest and or funds to my benevolent corporate overlord), and help explore these tradeoffs using demos which are better than trivial complexity. Some computer scientist, software engineer, or applied mathematician might get interested in this subject and get hacking because we tried to make it easier to experiment.

We know we COULD do a lot with this stuff (tons of applications or ways to use FHE), but few ARE doing things with this to solve real problems.

Oh and if anyone wants, of course IBM can be contacted if you have a business problem you want to discuss, or need corporate education and such on these things. For non corporate folks, this toolkit gives EVERYONE access to some baseline reference technology. Stepping away from the corporate perspective, speaking as an individual and an individual and maintainer of the repo, soon I hope some more baseline educational materials if the community help engage with us to understand questions and better explain things. Reactions to this toolkit and inquiry from potential clients will really help us accelerate this work.

Thanks for the continued interest in the hacker community and I wish we had time to respond to every interesting comment here!


The other thread we had about this stated that arbitrary math is possible, but slow as heck (40 times slower or so). You can either tell it to operate on two encrypted items ("multiply value 2 and 4 from the data") or give either operand as-is ("add 5 to value 3 in the data").

So general-purpose, apparently. Just slow.


Is it really only 40 times slower? I thought it was millions of times slower. If it's only 40 times slower than a modern CPU, then this technology is a lot more interesting than I thought.

From a few minutes of Googling, I can't find any obvious benchmarks.


It depends on what you are computing. In some cases it will be as fast as the insecure version (not counting encryption/decryption time). On the other hand, random access will have to be simulated by a linear scan over an array, which is asymptotically slower. Complex control flow e.g. deeply nested loops with input-dependent side effects can cause the overhead to balloon.


This article[1] has IBM slides which suggest 40-50x computation penalty and 10-20x RAM penalty.

[1]: https://arstechnica.com/gadgets/2020/07/ibm-completes-succes...


Thanks for passing on that reference @onepointsixC. Yourself and the parent poster might be interested in a webinar we posted on YouTube which talks about those slides from one of the paper authors (Flavio) and yours truly. That slide is about halfway through the video, but the whole thing is worth a watch. https://www.youtube.com/watch?v=W9G1s1t_d80&list=PL0VD16H1q5...

We put up a cryptography playlist during the last week on youtube since it felt like people didn't have enough resources to refer to easily. We hope it helps! It includes the video walkthrough of how to get the toolkit running and run some demos.


Per the discussion from Friday[1], any function that you can unroll into a (stateless) circuit.

Edit: which, AIUI, means any function where you can a) prove it halts, and b) recursion and loops are converted into loops that execute for a constant number of loops known at compile time.

[1] https://news.ycombinator.com/item?id=24007566


Multiplication is certainly possible (thats why ML works with it)


I don't have a great understanding of the algorithms here, but I have a question.

If I had some data encrypted with these tools, which I made public for multiple entities to process, is it possible for me to allow them to add data to the encrypted system? i.e., is there a way for multiple entities to add more data to the encrypted data (and do calculations with the encrypted contents) while only one party is able to extract the results of the whole resulting collection of data?


It is possible. Encrypting a piece of data requires the public key but not the secret key. Therefore you can give your public key to all of these multiples entities, they encrypt data and do computation on it, and only you can decrypt the resulting computation using your secret key.

I have to say that a related question I have is how to get the opposite : making sure the third party only used the inputs I gave them and did not replaced them with their own, and making sure they processed these inputs through the exact program I gave them and not an altered version of it.

I have rough ideas about how one could do something like this but if anybody had a reference to literature on the topic, that would be great.


My only hope for homomorphic encryption is that it doesn't cause a new era of invasive DRM and anti-circumvention methods.


Would you please elaborate on how you believe this might happen? I can't see how this would enable a "stronger" DRM scheme, as the video/audio/whatever would still have to be decrypted.


That's true. I stand corrected. And having just entered the term 'homomorphic encryption' into Google patents it would appear the space is not very active at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: