DeepFace: Closing the Gap to Human-Level Performance in Face Verification

scotth · on March 13, 2014

It's surprising to me that there isn't a single positive comment in this thread considering how amazing this is. Sure, it has other implications, but were we really hoping to prevent computers from recognizing faces permanently?

normloman · on March 13, 2014

No, we were just hoping to prevent creeps like facebook from using it to tailor advertising.

dave_sullivan · on March 13, 2014

Ah, there's how you get to minority report type stores/advertising being used in practice...

1. You upload photos to Facebook. Facebook detects various commercial products or even--assuming deep learning can take things to "higher level" representations--"style preferences" of people recognized in the photo. It will also develop style preferences based on context, location, etc. (dive bar or classy lounge?) Because it knows who is in the picture, etc. it can correlate that with an identity.

2. Facebook then sells this "identity" to the Gap -- no actual information about the user, just these raw vectored "style preferences" which contain all knowable brand information about each user. Facebook can provide massive coverage here.

3. You walk into a Gap store. Gap has installed software provided by Facebook to detect your face/person/style preferences (but no personally identifiable information, just that "the person with this face has these preferences and probably makes this much money so you might offer X,Y,and Z at these prices") and you then get an offer via facebook message (or "facebook offer"?) on your phone to buy what you think is actually a really cool jacket at an admittedly reasonable price (based on what you're used to paying for jackets).

This probably has massive ramifications for outfits like Costco or Target or Walmart where individual consumer preference/taste/whatever can really make a difference in choosing effective lineups of products... Maybe they manage to offer deals that sort of price themselves based on what they know the user will pay?

Almost not a bad idea...

reader5000 · on March 13, 2014

"Luddite news"

cs702 · on March 13, 2014

The key innovation is an accurate, reliable method for rotating faces so they're 'looking straight at the camera' before feeding them to a deep neural network. They call this 3D photo rotation process "frontalization." Figure 1 on page 2 of the paper shows at a very high level how this is being done. Very nice!

weezer · on March 13, 2014

Just a little background the paper itself doesn't provide:

The 3-d modeling and rotation is building on the work Yaniv did as part of Face.com (face recognition startup), which was acquired by Facebook.com. Studied here: http://vis-www.cs.umass.edu/lfw/results.html

Also Marc'Aurelio was just hired away from Google and is a deep learning expert.

apu · on March 13, 2014

Actually, that's only one of the contributions, and I'm not so sure it's the "key innovation". Every other recent face recognition method also tries to do some kind of alignment to make faces more similar in pose/expression/lighting prior to classifying them; and of these, several also fit faces to a 3-d model to rotate to frontal (with varying quality).

See my other comment for my guess on what's actually providing the boost: https://news.ycombinator.com/item?id=7393378

cs702 · on March 13, 2014

Yes, other attempts "also fit faces to a 3-d model to rotate to frontal (with varying quality)," as you put it, but this method for rotating faces appears to be superior -- more accurate and reliable.

It looks like the main contribution to me :-)

thearn4 · on March 13, 2014

Yeah, generally speaking for object detection - Scale, position, rotation: typically you can identify something based on 2 of these really efficiently (for example, wavelets give you a transform that is sufficient for object identification under scale and position transformations in O(N) operations).

If you'd rather have orientation identification (ie. rotation angle) and scale in that mix but don't care about position, the radon transform is nice and easy to work with.

But beyond inverting 1-2 key transformations, one usually has to pay a pretty hefty computational cost which often precludes online (near real-time) use.

_ntka · on March 13, 2014

Important to note:

- they still need 1000 labeled samples per identity

- their network can only handle 4000 distinct identities (at 97.25% accuracy) at a time

It's still a very worrying development for online and offline privacy.

timdumol · on March 13, 2014

Actually, based on my reading of the paper, it seems that they learn a representation using one data set (the one with 1000 labeled samples per identity), then use that representation to classify on other training sets (like the Labeled Faces in the Wild, which has 13,323 photos of 5,749 celebrities). In fact, from what I can tell from section 5.1, they seemed to use face pairs (and so trained on 1 sample per person, and then tested on the other sample).

tl;dr: They don't need 1,000 labeled samples per identity (once done with the representation phase), and they achieved 97.25% accuracy on ~6,000 distinct identities, with only one training photo per identity.

thatcherclay · on March 13, 2014

I think you guys are reading different parts of the article.

They present both results, supervised and unsupervised (where unsupervised uses the SFC dataset to train). They achieved 95.92% accuracy LFW with unsupervised (section 5.3) - so they can train on SFC and then classify a single image in a different domain with 95.92% accuracy.

They achieved the 97.25% accuracy level was achieved as you say, when they let the pairs into the training set. But they overfit with LFW alone, and has to add an additional 100k identities with more samples (30) per identity. A very impressive measure, but not quite as good as being able to generalize with 97% accuracy from a single training photo.

JoeAltmaier · on March 13, 2014

I looked hard, and could not prove they can identify a single image with that accuracy. I couldn't tell WHAT the trial protocol was. Where is that in the document?

apu · on March 13, 2014

The LFW test is verification: given two faces you've never seen before, tell whether they're the same person or not. That's a much weaker test than saying "who is this?"

The LFW benchmark protocol is described in the original tech report: http://vis-www.cs.umass.edu/lfw/lfw.pdf

bayesianhorse · on March 13, 2014

Actually no. For one thing, this isn't exactly a stealthy or cheap thing to do. It involves datacenters full of computing resources even for 4000 identies.

I also don't believe it's so much a privacy issue. If I upload pictures to facebook I actually want them to be seen by human beings. The face-recognition only helps with that.

If facebook recognizes me on a picture someplace else, I actually rather want to know about it. I'm not super famous, so unexpected pictures of myself can be more of a bad thing...

ta_fbp · on March 13, 2014

Privacy issue is not about what you want, it's about what can be done (and often is without you knowing).

Automated facial recognition is a serious privacy concern, it's not just about the slimy despicable thing facebook is. For example, in the uk where there are more cctv video feed than people to watch them, an automated facial recognition can track you around constantly. Their current automatic number plate recognition is already a serious privacy concern.

Remember 7th cube's voyeur's dream[1] released in 2005? the same but with more camera now being able to identify you based on your facial features. [1]: http://www.pouet.net/prod.php?which=16410

bayesianhorse · on March 13, 2014

Wouldn't it actually be better to stop a government from doing that? If you can't, there's something wrong that no restrictions/bans on technological progresss will fix...

Usually it's easier to get the government in line than to prevent some technology to be developed...

chroem · on March 13, 2014

So how is that NSA reform coming along? Good I hope?

adyus · on March 13, 2014

This can already be done by someone who truly wants to keep an eye on you, only it takes more manpower (well, detective power).

This is simply the same trend as ATMs replacing (some) bank tellers many years ago.

Will it be easier to perform and abuse mass surveillance? Sure. Will people with something important to hide still wear disguises? I'd bet yes.

My stance is that we can't fight progress, but we can start fighting the people bent on abusing the powers that progress bring. Identifying them is another issue (perhaps some form of facial recognition? :)

Houshalter · on March 13, 2014

>Will people with something important to hide still wear disguises? I'd bet yes.

Facial recognition systems can be (and are) trained to not be fooled by things like facial hair, hair changing, eyeglasses and sunglasses, hats, etc. Although it can obscure your features for sure.

If someone does something really odd to try to avoid the facial recognition (I've people posting things like making your hair non-symmetrical, or just avoiding the camera in the first place) then they just train on those things too. And that person gets flagged as being really suspicious. And things like full face masks are banned in some places already.

adyus · on March 13, 2014

And slowly, suspicious behavior is suppressed.

This will have two effects.

People who are truly suspicious for nefarious reasons will gradually stop being so.

On the other hand, many will be incorrectly flagged as suspicious and harassed without reason.

The trick is to strike a fair balance between the two. But who decides what's fair?

userbinator · on March 13, 2014

> For one thing, this isn't exactly a stealthy or cheap thing to do.

That's actually the more disturbing part about this: it require computational resources that only governments and large corporations can afford, so they are the ones who gain the most from it; and it gives them more leverage over the population.

bayesianhorse · on March 13, 2014

"Leverage" is kind of vague. It's hard to make money from large-scale abuses...

And it doesn't look like the NSA for example is doing anything useful from its own perspective with the data they actually have...

Pwnguinz · on March 13, 2014

"Leverage" doesn't always equate to profit (or even revenue). The "goal" of the NSA is not profit, but mass surveillance. Whereas surveillance historically has been centralized, with the onslaught of mobile devices, we now have prevalent distributed surveillance (and perhaps more so than surveillance).

All the data everyone is feeding into FB/G+/Twitter/etc about their friends and acquaintances could not have been collected on such a mass scale by the NSA alone.

If you're a privacy conscious individual, there's definitely reasons to be fearful of this approach to surveillance by any governmental intelligence agency.

thinkpad20 · on March 13, 2014

> If facebook recognizes me on a picture someplace else, I actually rather want to know about it. I'm not super famous, so unexpected pictures of myself can be more of a bad thing...

If all they did was notify you, that would be OK, I guess, although I'm not totally cool with it. I'm more worried about them seeing my picture and tagging it as me without my permission.

I think the crux of it is that facebook is pretty much the least trustworthy, most give-an-inch-they-take-a-mile company out there as far as privacy is concerned...

phpnode · on March 13, 2014

wow, does anyone apart from facebook and the government actually want facebook to do this? it's pretty terrifying

bovermyer · on March 13, 2014

I do. Assuming that Facebook uses it in what appears to be the logical choice (auto-tagging photos), this would be a fantastic way to find photos of me that I don't already know about.

userbinator · on March 13, 2014

Not surprisingly, they are called facebook after all...

visarga · on March 13, 2014

deepfacebook

Btw, does this software match faces to people or just draws a rectangle around faces?

_ntka · on March 13, 2014

It matches your face to your name with 97.25% accuracy, assuming that they have at least 1000 labeled photos of you to start with.

beagle3 · on March 13, 2014

1 or 2 photos are enough for that accuracy. The 1000 photos were only needed in finding the right face representation.

apu · on March 13, 2014

Having worked on this problem before (the comparison to human performance they cite is from my work) and seeing all the recent successes of deep learning, I'd bet that a lot of the gain here comes from what deep learning generally provides: being able to leverage huge amounts of outside data in a much higher-capacity learning model.

Let me try to break this down:

In machine learning, when you have input data that is labeled with the kinds of things you are directly trying to classify, that is called "supervised". In this case it's not quite supervised, because their main evaluations are on the LFW dataset, which is a verification dataset, whereas their training on SFC is a recognition task. The difference is that in verification, you are given photos of two people you've never seen before and have to identify if they're the same or not. In recognition, you are given one or more photos of several people as training data, and asked to identify a new face as one of them. In theory, you could build recognition out of verification (verify all pairs between training images and test input images and assign the top-scored name as the person) but in practice it's much better to build dedicated recognition classifiers for each person.

Their main network is trained on a recognition task, using their SFC dataset. They show these recognition results in Table 1 and the middle column of Table 2. An error number of 8.74% (DF-4.4M), for example, means that they were able to successfully name the person in 91.26% of input images. However, this error rate crucially depends upon two key factors: (1) the number of people they're trying to distinguish between, and (2) the number of images they have per person. For this test, it was ~4,000 people, and ~1,000 images/persons, respectively.

If you were to add more people to the database, or have fewer images per person, this accuracy would drop. You can see this clearly in Table 1, where subsets DF-3.3M and DF-1.5M have correspondingly lower error rates because they have fewer people (3,000 and 1,500, resp). Similarly, the middle column of that table shows how error rates rise when you reduce the number of images per person.

In contrast, all subsequent results are shown on verification benchmarks (LFW and Youtube Faces). In large part, I suspect this is because of the realities of publishing in the academic face recognition literature: you have to evaluate on some dataset that the community is familiar with to get your paper accepted, and LFW is the de-facto standard these days, and it only does verification not recognition.

Here, their performance is certainly very good, and an improvement over previous work, but not an unexpectedly huge leap. If you look at the LFW results page, you can see that recent papers have been edging up to this number quite steadily: 95.17% (high-dim LBP), 96.33% (TL Joint Bayesian), 97.25% (this paper) http://vis-www.cs.umass.edu/lfw/results.html

Nevertheless, how are they able to get this boost in performance? What recent papers in this field have increasingly been discovering is that having higher-dimensional features can really give you a big boost, or to put it another way: having a higher-capacity model is what buys you the additional performance.

In machine learning, the "capacity" of a model refers (in a loose sense) to how powerful it is. The basic tradeoff is that a higher-capacity learner can more accurately classify testing data BUT it requires much more training data to learn. The problem is that for the LFW benchmark, the amount of direct training data you have is strictly limited: there are 6,000 pairs of faces, and you train on 90% of them and test on the remaining 10%. This is not nearly enough data to train a high-capacity model.

So what people have been doing is training the bulk of their models on some other data, for some other task, and then adapt that model to the LFW problem, using the LFW training data essentially to "tweak" the classification model for this particular task. That's why the LFW results tables are now broken up into different sections according to how much outside data was used and in what form.

In the case of DeepFace, this takes the form of the SFC dataset and learning a network for recognition, not verification. Since they have access to lots of data of this form, they can successfully train a high-capacity model for it. Then they simply "chop-off" the last layer of the network -- the one that does the final recognition task, and instead replace it with a component for verification using only LFW training data. Or for their "unsupervised" results, using no LFW training data ("unsupervised" in quotes because it's not really unsupervised).

BTW, this approach of training a deep network for some task, and then cutting off the last layer to apply it to a different task (in effect making it simply a feature-extraction method) is quite common, and has been applied successfully to many problems that might not have enough data to train a high-capacity model directly.

Anyway, if people have more questions, I can try and answer them. (I'm not one of the authors, but I am in the field.)

boomzilla · on March 13, 2014

Thanks for the write-up. This is very informational.

Could you elaborate a bit more on the "capacity" of learning models? Can it be quantified and is it some how related to the VC dimension of a particular learning problem? It would be great if you could give some example of "capacity" for the more well known models: trees, naive bayes, SVM, one hidden layer neural nets, etc.

apu · on March 13, 2014

Yes, capacity is intimately tied to VC dimension; in particular, VC dimension is one way to measure capacity. See the Wikipedia article for more information: http://en.wikipedia.org/wiki/Vc_dimension

I'm not an expert on deep learning (although I generally understand how they work on vision problems), so I'm not sure if you can precisely measure the capacity of deep networks. Informally, the primary number that seems to matter is the number of parameters in the network that have to be learned. This paper quotes that at "more than 120 million".

SVMs, in contrast, typically work with feature dimensionalities (i.e., # of parameters) that are on the order of 1,000 - 100,000. You can't directly compare these numbers because there are various non-linearities involved, but this deep learning network is definitely much higher capacity than an SVM would be with normal feature dimensionalities.

somberi · on March 14, 2014

I envy your ability to write clearly and combine different threads together.

chriskanan · on March 13, 2014

A paper posted on arxiv a few days ago by Fan et al. claims to have a similar level of accuracy (97.3% on LFW): http://arxiv.org/pdf/1403.2802.pdf

Both methods use deep neural networks, but have a lot of differences, e.g., the Fan et al. paper doesn't use a 3D face model.

somberi · on March 14, 2014

May be a stupid question - The Social Face Classification (SFC) dataset that they refer to - Is it published to the world? I wonder if they can deduce "emotions" from SFC dataset and use it as a training set for images in the wild.

beagle3 · on March 13, 2014

Aren't you really glad now you uploaded all these photos to facebook?

DennisP · on March 13, 2014

I didn't. I joined to keep up with a few friends and family, and they uploaded photos with me in them, nicely tagged.

hnha · on March 13, 2014

you can disable their ability to do so.

beagle3 · on March 13, 2014

No, you can't stop them from uploading photos and mentioning your name, even if they didn't specifically put a rectangle on your face.

stdbrouw · on March 13, 2014

Reducing error by 25% from a 96.5% baseline gives you their stated 97.25% accuracy. About 0.75% fewer errors. Still amazing, but less impressive than the abstract makes it sound.

apu · on March 13, 2014

Actually, 25% is the right way to judge this improvement. For example, let's say performance was currently at 99.9% and you improve it to 99.99%. That's not a 0.09% improvement (99.99 - 99.9), but rather a ten-fold improvement (.01% errors vs 0.1% errors).

This has to do with the fact that accuracy/errors are not linear.

joshgel · on March 13, 2014

Great name.

visarga · on March 13, 2014

It comes from deep learning, which is all the rage in machine learning since a few years back.