Hacker News new | past | comments | ask | show | jobs | submit login
What a Deep Neural Network thinks about selfies (karpathy.github.io)
262 points by vkhuc on Oct 25, 2015 | hide | past | favorite | 50 comments



A guide on how to take a good selfie that others will like:

  be female
  be blonde
  be attractive
Incidentally, Christian Rudder did a really good "study" on the dating site pictures a few years ago:

http://blog.okcupid.com/index.php/dont-be-ugly-by-accident/


A better guide on how to take a selfie:

    Don't take a selfie.


And, if you are female, chop off you forehead.


Also, long hair in front of your shoulders (no ponytail).


apparently, she should be white too.


This is neat. I bet Facebook or OkCupid are sitting on all sorts of click data that could be used to develop tools for helping people make their photos look better. (Even if, personally, I can't wait for a cultural backlash against internet narcissism...)

[Edit: Even better, he didn't use click data to train the model, just public likes.]


The idea to use a convnet to reframe the selfie is neat. Makes it 5% better. Also, if it can be run on the phone, it could possible warn people they are about to post a shitty selfie before they do.


I would like to see a deep dream selfie ...

Feed it an initial picture (noise, clouds, a selfie) and then backwards manipulate the input to maximize the assessed quality of the "selfie".

I guess that would look pretty funny.


He did run something like that for cropping. He showed his favourite two "rude" ones at the bottom, where the 'Net cropped out the face of the person taking the selfie.


Actually, he used random crops and selected the highest rated. A "deep dream selfie" would actually run the neural network in reverse so as to generate a completely different image.


One thing I always found interesting is Lecun is credited with developing covnets, but Hinton is apparently credited with scaling them and showing the world how great they are in the paper from 2012 - why was Hinton's group (Toronto) able to publish these ground breaking results before Lecun's group (NYU)


Geoff Hinton answers this question in episode 6 of the Talking Machines podcast. http://www.thetalkingmachines.com/blog/2015/3/13/how-machine...

Geoff Hinton had grad students who wanted to work on the problem, but Yann LeCun didn't.

"In about 2012, it should have been Yann's group, but Yann was unlucky, he didn't have a student who really wanted to do it. But we had a couple of students who wanted to do it and we took all of Yann's techniques and added some of our own."


interesting - i took the course but did not notice that - thanks!


IIRC the deep learning revolution started with pretraining and RBMs, which I believe Hinton invented.


>Be female. Women are consistently ranked higher than men. In particular, notice that there is not a single guy in the top 100.

This sounds true, but it can't be the real reason—selfies are ranked relative to the other images by the same user. So unless users are taking a lot of #selfies of people of different genders, we can assume the dataset is already controlled for the gender of the person in the image, no? Unless there's some confounding factor at play, such as some demographic segment being more likely to optimize for good selfies occasionally but have boring feeds the rest of the time.

would be super interesting, if the data is available, to normalize this by exposure. Of the people that saw an image, how many clicked "like"?


Well, one of the other factors is long hair and the tendency to oversaturate the face. Those factors don't seem independent to me, men are less likely to sport long hair and they're also less likely to oversature the face to measure up to some skin perfection standards (think of it as the photographic equivalent of makeup).

> but it can't be the real reason

Can't? Ontop of the above-listed aspects it is entirely possible that there is a bias that both sexes find female appearance somewhat more aesthetically pleasing.

Similar to how focus group testing for computer voices tends to result in female voices being chosen (at least that's what I often hear, couldn't find a solid source).

Even if the bias is small the correlated factors would amplify it when you're optimizing for a maximum, i.e. for the top selection.


Neither of those explain why it would rank above the average of other female faces, in general.

Discussion about this with the author reveals that I was misinterpreting how they were collecting averages. I was assuming the "like" count was coming from each photo collected, but instead they collected the photos and average likes in individual steps, where the average likes were across recent posts by that user, rather then the selfies by that user.


I screwed up on this point by the way - I had done this part of the experiment a few months ago and I incorrectly remembered the details. I went back and looked through the code and adjusted the post with more regarding this important point. In particular:

"Now it is time to decide which ones of those selfies are good or bad. Intuitively, we want to calculate a proxy for how many people have seen the selfie, and then look at the number of likes as a function of the audience size. I took all the users and sorted them by their number of followers. I gave a small bonus for each additional tag on the image, assuming that extra tags bring more eyes. Then I marched down this sorted list in groups of 100, and sorted those 100 selfies based on their number of likes. I only used selfies that were online for more than a month to ensure a near-stable like count. I took the top 50 selfies and assigned them as positive selfies, and I took the bottom 50 and assigned those to negatives. We therefore end up with a binary split of the data into two halves, where we tried to normalize by the number of people who have probably seen each selfie. In this process I also filtered people with too few followers or too many followers, and also people who used too many tags on the image."


Still no men in the top 100 ? There must be something deep to learn about the difference in sexes there, I am just not sure what it is.


> focus group testing for computer voices tends to result in female voices being chosen

I personally prefer the Alex voice from Mac OS to female voices. It has nice intonation. If only I could make it correct some of the mistakes it makes, for example not being able to distinguish "read" in past tense from "read" in present tense which makes it sound silly. Another error it makes is confusing "live" as in "live concert" with "live" as in "live in USA" (they are called heteronyms and are a special case in TTS).


You can fix this by misspelling your input text. Use 'red' as the read past tense. Use 'laif' and 'lif' in the latter.


Yeah, female users probably post more pictures and also probably have more friends.


This also would be controlled for by the tools the blog author used though—if a women has more friends, then they would also probably get more likes on all of the rest of their images. Not sure if posting more photos would drive the average up or down, but it would probably drive the "above the baseline" selfies in the same way.


More friends, but then the likes are not uniformly distributed with the increase of friends. Also more pictures means the "best picture" could be more of an outlier.

So best pictures might rise further above baseline for that person. That is, top picture gets 1000 likes, but most pictures get zero. Sort of like Zipfian distribution of words.

Anyway, these things are actually really hard to control for particularly because different types of friends/people have different effects on the likes. Now add to this cultural differences between countries/states/universities/rural-urban, etc.

I think the best method that is actually practical was the one okcupid did at some point with "my best face" where you rate a bunch of people's pictures and they rate yours. Then you figure out what pictures are good from the data.

If they kept the data for all these contests, it would be much easier to interpret in aggregate.


How to take a good selfie: don't be black or dark-skinned, unless you're a celebrity.

How do we prevent our AIs from learning racism?

EDIT> Informative article, BTW. A good read.


If a given question has an answer that is due to racism, the answer is still the answer. For example, if society has some underlying racism that factors into what it considers attractive, that doesn't change what it considers attractive.

I don't think these algorithms are learning racism. They are only being blunt in revealing what already exists.


> If a given question has an answer that is due to racism, the answer is still the answer.

That's why it's important to be clear about the question. This ConvNet doesn't really answer the question "What makes a good selfie". It answers a much narrower and more complicated to state question.

The absence of reflection in the system means that if it's used to answer a question that's superficially similar to the designer's intent, there's no way to reason around the bias in the training data.

Imagine I'm a Canadian who trains an automated turret to classify friend / foe based on data from Afghanistan and Iraq. I've not trained the system to answer "Is this group of pixels a friend / foe", in the general sense. If the system is used outside the narrow context of its validity, say in Northern Ireland, or in a civilian Muslim neighbourhood in Paris, we should expect bad results.

So you're right to point out that the racism is in the social context. But I'm arguing that we don't actually want a classifier to learn that if there's a good chance it'll be used in a way that discards or ignores that social context. Same as using an expert system outside its domain.


This is an important point. People are thinking about it, and a lot of it will have to do with how the input data is gathered and curated.


I think it's less about the head getting chopped than about having "the head take up about 1/3 of the image," as Karpathy says. So what the net is learning is composition, or balance in an image, which is really cool. The rule of thirds is actually pretty well know to people in photography:

https://en.wikipedia.org/wiki/Rule_of_thirds

(Our deep-learning framework http://deeplearning4j.org missed his list, but it's got working convnets, too.)


possibly, but none of the cropped examples have cropped chins. It's also well known in photography that you can cut off someone's forehead, but never their chin.


Echoing a law of video games: "nobody looks up"!


One caveat with these machine inspired knowledge: they are prone to error, probably more than humans, at least for now.

For example, if you train a CNN directly with human faces, its recognition rate comes way below what a human is capable of. Only after you apply tons of handcrafted optimizations, which are mostly black art, will you get close to or surpass a human's capability. Without much domain specific tuning, an AI's insight is far from reliable.


This is more wrong than right.

The example is correct, but not for the reasons stated. Humans are very, very good at face recognition. However, CNNs are pretty close to human performance for face detection.

Only after you apply tons of handcrafted optimizations, which are mostly black art, will you get close to or surpass a human's capability. Without much domain specific tuning, an AI's insight is far from reliable.

This just isn't the case. Take the GoogLeNet or VGGNet papers, build the CNN as described using Caffe/whatever, train as described in the paper and you'll end up with something that is pretty much on par with human performance for categorizing ImageNet images.

Take that same CNN architecture, and retrain it for another domain and it will perform roughly as well there too, for the task of categorizing into ~1K-10K image classes.

This isn't domain specific tuning. It's domain specific training, which is very different (although collecting the data is a big job).

Only after you apply tons of handcrafted optimizations, which are mostly black art, will you get close to or surpass a human's capability.

For CNNs, this is pretty much entirely false.


A GoogleNet or VGGNet has tons of parameters. How many convolutional layers are stacked together, the size and stride of each one, where to put the dropout layers, where to put the full connection layers, how they are connected together, global learning rate and momentum and decay, local learning rate and momentum and decay, each of these myriad parameters have an unpredictable effect on the final result. The initialization of the network also has a major bearing on the final outcome. It is almost a chaotic system where nothing small can be safely ignored. One time my result of training a CNN was swung by the `batch_size` parameter and to this day I don't know how.

Those parameters are exactly the type of handcrafted optimizations I am talking about. You cannot just fill in arbitrary numbers and expect the network to fare well. In fact, you cannot even expect it to converge.

You can take those papers and build a world class classifier only because someone else has taken all the time to optimize for the specific case. Once you switch the task, the result will be OK, but nowhere close to what a human or a true AI would give you. Not until you take the time to optimize the parameters.


A GoogleNet or VGGNet has tons of parameters.

Kinda, but they are defined for you. For example the GoogLeNet design is described in[1]. Page 5 lists the parameters, the diagram on page 6 shows how the layers are linked.

Yes, I agree that the design of a new neural network architecture is a skilled process, and there is a lot of hard work there. I couldn't agree with that more, but that isn't what we are talking about here.

It is quite possible to take a CNN like GoogLeNet designed for a specific purpose and reuse it in similar situations. GoogLeNet will always do pretty well for image classification.

I think of it as analogous to a piece of software like a database. Designing a new database system is hard, but taking something like SQLite and using it is easy. Yes, you can tune it and get better performance out of it, and yes, it will break if you use it in the wrong circumstances, but it is generally pretty reliable if used as designed.

Now this analogy breaks down because industrial use of CNNs is pretty new compared to Database systems. It's more like trying to get msql running on your Slackware 0.9 system in 1993 it is getting Postgres on Ubuntu 15.10.

Nevertheless, there isn't really a black art to using an existing CNN. Lots of schlepping to get CUDA running on your machine, though.

[1] http://www.cv-foundation.org/openaccess/content_cvpr_2015/pa...

[2] Not MySQL, msql: https://en.wikipedia.org/wiki/MSQL


with the training/test data sets, wouldn't it be possible to find the best parameters with a genetic algorithm? i mean, sure, it'd take really long ... well, probably too long.


What type of handcrafted optimizations are you talking about here?

The state of the art I've read about* (deep CNNs) in later years rely more on generalized tricks like augmenting the training data (artificially inflating the data set), pre-training and fine-tuning, ReLU, regularization methods like dropout, etc.

For anyone interested, here [1] are some benchmarks.

* Late night here, but often in the vein of this [0] work.

[0]: https://www.cs.toronto.edu/~ranzato/publications/taigman_cvp...

[1]: http://vis-www.cs.umass.edu/lfw/results.html


It seems this neural network has a sense of humor if you look at the "Finding the Optimal Crop for a selfie" area.

You can see it optimized the last selfie by cropping the face fully out of the picture.. :))


DNN is a key technology of the future. I highly recommend the education program Professor Karpathy mentions at the end of this post. All are excellent and free.



A really good read. Good intro to ConvNets, a well designed and implemented test. Ad funny.


looking at the top 100 one can only wonder how Hollywood has figured it out well before mighty power of computer :)


For me, this thing about having the top of your head cut from the picture is new. Who would have thought..


Makes a bit of sense, in combination with the "be female" advice, cutting off the forehead puts the center of the photograph closer to her cleavage, and typically shows off her entire chest.


Cleavage does not feature a lot in the top 100 actually, but I'm half way there, in a sense that I'm a female. I'll definitely try the half-forehead thing next time!


We could try and see if the activation for good selfies comes from the cleavage or the eyes.


I thought cutting off forehead happens when the target is closer to the camera, so it is more personal.


It seems that eyes and mouth, and their alignment, matter most for female attractiveness.[0,1]

[0] http://www.nbcnews.com/id/34482178/ns/health-skin_and_beauty...

[1] http://www.ncbi.nlm.nih.gov/pubmed/25836007


Well, you don't need to ask a deep neural network to say that selfies are getting stupid daily with teens sticking their tongues out


BEEP BEEP. Bad selfie detected. You run the risk of making a fool of yourself! BEEP BEEP




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: