Why is it such a great reply? They didn't really answer my question.

istingray · on Aug 30, 2021

I liked the clarity of response. Public models, not user data seems a clear answer to your question?

godelski · on Aug 30, 2021

Not really. In fact it might suggest something I'm specifically more worried about. Datasets that we use in research aren't really appropriate in production. They have a lot of biases that we don't exactly care about in research but you do in production that can also get you into a lot of political and cultural trouble. So really if they are going to just use public datasets and not create their own then I expect a substantially low performance, potential trouble ahead, and I'm concerned about who is running their machine learning operations.

istingray · on Aug 30, 2021

Appreciate the detail here. Given your relevant experience sounds like something that the devs need to address.

godelski · on Aug 30, 2021

Being in the ML community I have a lot of criticisms of it. There are far too many people, especially in production, that think "just throw a deep neural net at it and it'll work." There is far more to it than that. We see a lot of it[0]

[0] https://news.ycombinator.com/item?id=28252634

istingray · on Aug 30, 2021

Wow fascinating. What do you ideally want to see in terms of datasets enabled by user data?

Having vendors vacuum up my data is sub-optimal from a privacy/ownership standpoint. I'm curious how to enable models without giving away my data. Open source models owned by society? Numerai style training (that I don't understand) https://numer.ai/ ?

godelski · on Aug 30, 2021

Datasets are actually pretty hard to create. You can see several papers specifically studying ImageNet[0] including some on fairness and how labels matter. There's also Google's famous private JFT-300M dataset[1]. JFT was specifically made with heavy tails in the distribution to better help study these areas, which is specifically the problem we're interested with here and one that is not solved in ML. Even with more uniform datasets like CIFAR there are still many features that are noisy in the latent space. This is often one of the issues with doing facial recognition and why there's issues with people with darker skin. Even if you have the same number of dark skinned people as light skinned you may be ignoring the fact that cameras often do not have high dynamic ranges and so albedo and that dynamic range play a bigger role that simply "1M white people and 1M black people". There's tons of effects like this that add up quickly (this is just an easy to understand example and one that's more near the public discourse). You can think back to how Google's image search at one point showed black people if you searched gorilla. On one hand you can think "oh got a dark color humanoid" or you can think "oh no... dear god...". That's not a mistake you want to make, even if we understand why the model made it. It is also hard to find these mistakes, especially because the specifics of them aren't shared universally across cultures because this mistake has to do with historical context.

This is still an unsolved problem in ML. Not only do we have dataset biases (as discussed above) but models can also exaggerate these biases. So even if you get a perfectly distributed dataset your model can still introduce problems.

But in either case, we don't have the same concerns in research as we have in production. While there are people researching these topics most of us are still trying to just get good at dealing with large data (and tails) in the first place. Right now the popular paradigm is "throw more data at the model." There are nuances and opinions to this why this may not be the best strategy and why we should be focusing on other aspects (opinions being key here).

Either way, "using publicly available datasets" is an answer that suggests 1) they might not understand these issues and 2) the model is going to have a ton of bias because they're just using off the shelf models. I want some confidence that these people actually understand ML instead of throwing a neural net at the problem and hitting go.

> I'm curious how to enable models without giving away my data.

Our best guess right now is homomorphic encryption. But right now this is really slow and not as accurate. There's federated learning but this has issues too. Remember, we can often reconstruct images from the dataset if we have the trained model[2]. You'll see in this reference that while the reconstructions aren't perfect, they are more than satisfactory. So right now we should probably rule out federated learning.

> Open source models owned by society?

Actually models aren't the big issue. Google and Facebook have no problem sharing their models because that isn't their secret sauce. The secret sauce is the data (like Google's proprietary JFT-300M) and the training methods (though most of the training methods are public as well as few are able to actually reproduce due to not having millions of dollars in compute).

I hope this accurately answers your questions and further expands on the reasoning behind my concerns (and specifically why I don't think the responses to me are sufficient).

[0] https://image-net.org/about.php

[1] https://arxiv.org/abs/1707.02968 (personally it bugs me that this dataset is proprietary and used in their research. Considering how datasets can allow for gaming the system I think this is harmful to the research space. We shouldn't have to just trust them. I don't think Google is being nefarious, but that's 300M images and mistakes are pretty easy to make).

[2] https://arxiv.org/abs/2003.14053

istingray · on Aug 31, 2021

godelski, I really appreciate such a thoughtful response to my curiosity.

Looking at this while better understanding the problem, I wonder what features I really want for my own photo library. Thinking of iOS photos. Matching people together seems hard. But grouping photos by GPS location or date is trivial. So we have to get clear on what features are important for home photo libraries.

I can now see how the idea of "use public libraries = solution" falls short. It neither presents a viable solution or demonstrates rigorous understanding.

godelski · on Aug 31, 2021

Hey, that's what HN is about. You got experts in very specific niches and we should be able to talk to each other in detail, right? That's the advantage of this place as opposed to somewhere like Reddit. Though expanding size we face similar issues.

These are good points about GPS and other metadata. I didn't really think about that when thinking about this problem, but every album I create is pretty much a combination of GPS and temporally based (though I create this with friends). But I think you're right in suggesting that there are likely _simple_ ways to group some things that aren't currently being done.

> I can now see how the idea of "use public libraries = solution" falls short. It neither presents a viable solution or demonstrates rigorous understanding.

ML is hard. But everyone sells it as easy. But then again, if it was easy why would Google and Facebook pay such a high rate for researchers? There's a lot of people in this space and so it is noisy. But I think if you have a pretty strong math background you start to be able to pick out the signal from the noise better and see that there is a lot more to the research than getting SOTA results on benchmark datasets.