Hacker News new | past | comments | ask | show | jobs | submit login

The major problem from the federated model is that this assumes a blind trust of the labels for the learning you're trying to do. I find this approach frankly lazy and trying to make up for not having good data (and importantly good informational content) by throwing a bigger N at the problem.

I'm not saying there aren't cleaver ways to to semi-supervised approaches, but if you never look at your data (and the places where your model fails) how can you ever try to debug your way out?




On one hand, updating ML models without seeing the data sounds like madness. It puts seemingly impossible constraints on QA: Meaningful model improvements typically come through careful error analysis, discovering entire classes of errors that call for different approaches, bug fixes, new training data types, preprocessing and model architectures. Not uptraining with random amounts of unknown data.

On the other hand, federated learning is something that's clearly desirable, a compelling pitch. So you could argue these "technical impossibilities" might be overcome with enough cleverness, eyeballs and complex enough ecosystems.

I don't think we're nowhere near, but it's one of those paradigm shifts that have my ears perked up. I'm curious what'll come out of this, cheering on Andrew Trask and his OpenMined from the side lines!


I agree - it's a problem worth solving.

Re: not having data to QA model, you can use the distribution of the generated labels to inform model updates. Alternatively you can get labels at the edge using a local edge interface then use those.


Thinking about this some more, such federated training will require more careful differentiation between two training modes that have been traditionally clumped together:

1. Model development, improvements and tuning (choosing correct preprocessing, features, architecture, introspection, debugging…). Client-side unsuitable.

2. Training / updating of fixed, well-designed models on new "unseen-by-human-but-otherwise-well-behaved" data. Client-side suitable.

So far, the former has always been the bigger challenge, hence my QA concerns.

But perhaps once your problem is well-understood, and data known to be equally-distributed and well-behaved, the latter becomes increasingly useful. It's definitely an exciting new direction. The OP identifies personal / PII data as the flagship use-case, which sounds about right.


You should be able to test the model and reject ones that are below threshold




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: