I 100% believe that federated learning is going to be the new standard process in the future for many applications. Sending the model to the data instead of sending the data to the model (in the cloud) just makes so much more sense from a privacy and bandwidth perspective plus you can use the user's computational power instead of your own.
The major problem from the federated model is that this assumes a blind trust of the labels for the learning you're trying to do. I find this approach frankly lazy and trying to make up for not having good data (and importantly good informational content) by throwing a bigger N at the problem.
I'm not saying there aren't cleaver ways to to semi-supervised approaches, but if you never look at your data (and the places where your model fails) how can you ever try to debug your way out?
On one hand, updating ML models without seeing the data sounds like madness. It puts seemingly impossible constraints on QA: Meaningful model improvements typically come through careful error analysis, discovering entire classes of errors that call for different approaches, bug fixes, new training data types, preprocessing and model architectures. Not uptraining with random amounts of unknown data.
On the other hand, federated learning is something that's clearly desirable, a compelling pitch. So you could argue these "technical impossibilities" might be overcome with enough cleverness, eyeballs and complex enough ecosystems.
I don't think we're nowhere near, but it's one of those paradigm shifts that have my ears perked up. I'm curious what'll come out of this, cheering on Andrew Trask and his OpenMined from the side lines!
Re: not having data to QA model, you can use the distribution of the generated labels to inform model updates. Alternatively you can get labels at the edge using a local edge interface then use those.
Thinking about this some more, such federated training will require more careful differentiation between two training modes that have been traditionally clumped together:
1. Model development, improvements and tuning (choosing correct preprocessing, features, architecture, introspection, debugging…). Client-side unsuitable.
2. Training / updating of fixed, well-designed models on new "unseen-by-human-but-otherwise-well-behaved" data. Client-side suitable.
So far, the former has always been the bigger challenge, hence my QA concerns.
But perhaps once your problem is well-understood, and data known to be equally-distributed and well-behaved, the latter becomes increasingly useful. It's definitely an exciting new direction. The OP identifies personal / PII data as the flagship use-case, which sounds about right.
I currently have a cohort model in PyTorch (well Pyro) and we will gather data from 10+ centers during a trial. That sort of collection was difficult to get agreement for, and it will be interesting to see how this sort of thing could help. If we could have the data stay in the hospitals we would get more data I suspect.
I'm currently working with on-site hospital data and have been designing our own in-house federated library on top of Keras to train in-aggregate while leaving all data on-site.
I wonder, where do you draw the line between your private data and just a model update (federated learning)? E.g. if I would analyze all the model updates from an individual person, you probably get a good understanding about exactly the private data which you wanted to hide, or not?
Indeed - this is where we will combine the Federated Learning implementation with tricks from Differential Privacy and Secure Aggregation - which can help give formal guarantees on the amount of information present about any individual person within a gradient.
Federated learning would be amazing for medical data. It has hard to get data out of hospitals, so training models locally and merging them is a nice feature.
Of course, one has to make sure the training data cannot be inferred from the model.
PySyft will also (soon!) have features for Differential Privacy built in as well - which will facilitate making sure the model doesn't memorize things (technically it's an upper bound)
This is a super interesting idea. How does it work with creating a better model? It would make sense if you train the model in series but is there a method to merge(?) the models when it's trained in parallel?
There are many ways to combine results during training in a federated workflow. One method I'm currently using on healthcare imaging data is to simply average each client's weights after every epoch. My server gathers all client updates, averages the weights, and sends the new calculated wights back to the clients to continue training.