> Do you have a thumbnail of every photo client side In the happy path the files...

rock_hard · on Aug 29, 2021

As someone who has worked on systems like these let me translate:

“You stuff will be private but in return accuracy will be so bad that the UX is gonna suck!”

That’s the key piece people miss when they wanna do anything with ML…that’s it’s a different problem compared to writing code because it’s not about the code anymore, it’s about having great training data!

vishnumohandas · on Aug 29, 2021

Apple Photos seems to be using just Core ML[1] for on-device recognition and it does a pretty good job. As for Android, we plan to use tflite, but the accuracy is yet to be measured. And if customers do install our desktop app, we will be able to improve the indexes by re-indexing data with the extra bit of compute available.

We don't feel that the entire UX of a photo storage app will "suck" because of a reduced accuracy in search results, and we think that for some of us the reduced accuracy might not be a deal breaker.

[1]: https://developer.apple.com/documentation/coreml

divbzero · on Aug 29, 2021

Up until recently I’ve used Apple Photos happily since it provided a good combination of convenience plus the privacy of on-device recognition. You have a compelling product if you can convince customers you are as reliable and more trustworthy than Apple. You do face the disadvantage of not being the default option for iOS/macOS but that should be balanced by being available cross-platform in Android, Linux, Windows.

creato · on Aug 29, 2021

Core ML and TFlite are just tools for running ML models. Generating the models is the hard part, and that is what encryption will make more difficult.

vishnumohandas · on Aug 30, 2021

We will resort to models that are available in the public domain.

rock_hard · on Aug 30, 2021

Bingo!

godelski · on Aug 29, 2021

To be honest, that wasn't a concern with my question. I think most people on HN understand this aspect. My question was more about how you improve your models when you don't have the same feedback mechanisms as non-privacy preserving apps. Google can look at your photos and see what photos fail and collect the biased statistics. In a privacy preserving version you won't be able to do this. Sure, you can on an internal dataset, but then there are lots of questions about that dataset's bias and if it is representative of the real world. I mean how many people think ImageNet is representative of real world images? A surprising number.

Nalta · on Aug 29, 2021

As someone else who works on systems like these, I agree training data is the whole problem. However you can use some techniques like homomorphic encryption and gradient pooling to collect training data from client code while remaining end-to-end encryption. It's hard, but it's not impossible.

finnh · on Aug 29, 2021

Really? Have we had a revolution in homomorphic encryption such that it can be used for anything other than 1-million-times-slower proofs-of-concept?

I know IBM has released something lately, but given the source..

Does anyone use HE for the type of ML application you are describing?

godelski · on Aug 29, 2021

So I guess there is more to the question that I'm asking.

> Our accuracy will not match that offered by services who index your data on their servers. But there's a trade off between user experience and privacy here,

I think most people here understand that[0]. We are on Hacker News after all and not Reddit or a more general public place. The concern isn't that you are worse. The concern is that your product has to advance and get better over time. That mechanism is unclear and potentially concerning. The answer to this is the answer to how you ensure continued privacy.

You talk about the "push files/thumbnails for indexing" and this is what is most concerning to me and at the heart of my original question. How are you collecting those photos for _your_ training set? Obviously this isn't just ImageNet (dear god I hope not). Are you creating your own JFT-300M? Where are those photos being sourced from? What's the bias in that dataset? Obviously there are questions about the model too (CNNs and Transformers have different types of biases and see images differently). But that's a bigger question of training methods and that gets complicated and nuanced fast. Obviously we know there is going to be some distillation going on.

There's a lot of concerns here and questions that won't really get asked of people that aren't pushing privacy based apps. But the biggest question is how you get feedback into your model and improve it. Non-privacy preserving apps are easier in this respect because you know what (real world) examples you're failing on. But privacy preserving methods don't have this feedback mechanism. We know homomorphic encryption isn't there yet and we know there are concerns with federated learning (images can be recreated from gradients). So the question is: how are you going to improve your model in a privacy preserving method?

[0] I think people also understand that on device NNs are going to be worse than server side NNs since there's a huge difference in the number of parameters and throughput between these and phone hardware can only do so much.

vishnumohandas · on Aug 29, 2021

> how are you going to improve your model in a privacy preserving method

We will not improve our models with the help of user-data and will resort to only pre-trained models that are available in the public domain.

istingray · on Aug 30, 2021

This is one of your best replies in the whole thread.

Yes to this. Prove it as well.

godelski · on Aug 30, 2021

Why is it such a great reply? They didn't really answer my question.

istingray · on Aug 30, 2021

I liked the clarity of response. Public models, not user data seems a clear answer to your question?

godelski · on Aug 30, 2021

Not really. In fact it might suggest something I'm specifically more worried about. Datasets that we use in research aren't really appropriate in production. They have a lot of biases that we don't exactly care about in research but you do in production that can also get you into a lot of political and cultural trouble. So really if they are going to just use public datasets and not create their own then I expect a substantially low performance, potential trouble ahead, and I'm concerned about who is running their machine learning operations.

istingray · on Aug 30, 2021

Appreciate the detail here. Given your relevant experience sounds like something that the devs need to address.

godelski · on Aug 30, 2021

Being in the ML community I have a lot of criticisms of it. There are far too many people, especially in production, that think "just throw a deep neural net at it and it'll work." There is far more to it than that. We see a lot of it[0]

[0] https://news.ycombinator.com/item?id=28252634

istingray · on Aug 30, 2021

Wow fascinating. What do you ideally want to see in terms of datasets enabled by user data?

Having vendors vacuum up my data is sub-optimal from a privacy/ownership standpoint. I'm curious how to enable models without giving away my data. Open source models owned by society? Numerai style training (that I don't understand) https://numer.ai/ ?

godelski · on Aug 30, 2021

Datasets are actually pretty hard to create. You can see several papers specifically studying ImageNet[0] including some on fairness and how labels matter. There's also Google's famous private JFT-300M dataset[1]. JFT was specifically made with heavy tails in the distribution to better help study these areas, which is specifically the problem we're interested with here and one that is not solved in ML. Even with more uniform datasets like CIFAR there are still many features that are noisy in the latent space. This is often one of the issues with doing facial recognition and why there's issues with people with darker skin. Even if you have the same number of dark skinned people as light skinned you may be ignoring the fact that cameras often do not have high dynamic ranges and so albedo and that dynamic range play a bigger role that simply "1M white people and 1M black people". There's tons of effects like this that add up quickly (this is just an easy to understand example and one that's more near the public discourse). You can think back to how Google's image search at one point showed black people if you searched gorilla. On one hand you can think "oh got a dark color humanoid" or you can think "oh no... dear god...". That's not a mistake you want to make, even if we understand why the model made it. It is also hard to find these mistakes, especially because the specifics of them aren't shared universally across cultures because this mistake has to do with historical context.

This is still an unsolved problem in ML. Not only do we have dataset biases (as discussed above) but models can also exaggerate these biases. So even if you get a perfectly distributed dataset your model can still introduce problems.

But in either case, we don't have the same concerns in research as we have in production. While there are people researching these topics most of us are still trying to just get good at dealing with large data (and tails) in the first place. Right now the popular paradigm is "throw more data at the model." There are nuances and opinions to this why this may not be the best strategy and why we should be focusing on other aspects (opinions being key here).

Either way, "using publicly available datasets" is an answer that suggests 1) they might not understand these issues and 2) the model is going to have a ton of bias because they're just using off the shelf models. I want some confidence that these people actually understand ML instead of throwing a neural net at the problem and hitting go.

> I'm curious how to enable models without giving away my data.

Our best guess right now is homomorphic encryption. But right now this is really slow and not as accurate. There's federated learning but this has issues too. Remember, we can often reconstruct images from the dataset if we have the trained model[2]. You'll see in this reference that while the reconstructions aren't perfect, they are more than satisfactory. So right now we should probably rule out federated learning.

> Open source models owned by society?

Actually models aren't the big issue. Google and Facebook have no problem sharing their models because that isn't their secret sauce. The secret sauce is the data (like Google's proprietary JFT-300M) and the training methods (though most of the training methods are public as well as few are able to actually reproduce due to not having millions of dollars in compute).

I hope this accurately answers your questions and further expands on the reasoning behind my concerns (and specifically why I don't think the responses to me are sufficient).

[0] https://image-net.org/about.php

[1] https://arxiv.org/abs/1707.02968 (personally it bugs me that this dataset is proprietary and used in their research. Considering how datasets can allow for gaming the system I think this is harmful to the research space. We shouldn't have to just trust them. I don't think Google is being nefarious, but that's 300M images and mistakes are pretty easy to make).

[2] https://arxiv.org/abs/2003.14053

istingray · on Aug 31, 2021

godelski, I really appreciate such a thoughtful response to my curiosity.

Looking at this while better understanding the problem, I wonder what features I really want for my own photo library. Thinking of iOS photos. Matching people together seems hard. But grouping photos by GPS location or date is trivial. So we have to get clear on what features are important for home photo libraries.

I can now see how the idea of "use public libraries = solution" falls short. It neither presents a viable solution or demonstrates rigorous understanding.

godelski · on Aug 31, 2021

Hey, that's what HN is about. You got experts in very specific niches and we should be able to talk to each other in detail, right? That's the advantage of this place as opposed to somewhere like Reddit. Though expanding size we face similar issues.

These are good points about GPS and other metadata. I didn't really think about that when thinking about this problem, but every album I create is pretty much a combination of GPS and temporally based (though I create this with friends). But I think you're right in suggesting that there are likely _simple_ ways to group some things that aren't currently being done.

> I can now see how the idea of "use public libraries = solution" falls short. It neither presents a viable solution or demonstrates rigorous understanding.

ML is hard. But everyone sells it as easy. But then again, if it was easy why would Google and Facebook pay such a high rate for researchers? There's a lot of people in this space and so it is noisy. But I think if you have a pretty strong math background you start to be able to pick out the signal from the noise better and see that there is a lot more to the research than getting SOTA results on benchmark datasets.