This is great work for someone who's been dabbling with deep learning. I used to work on OCR systems for postal address recognition and for the most part this is how I would have approached the problem.
A couple of suggestions:
You will almost certainly get better accuracy with a broader array of input fonts. I'm not sure how well you can simulate all of the image conditions that might occur in a real system so adding some real images to the training set (if you can find them - but I suspect there are OCR type training sets available these days) would probably help.
For the speed problem I like the idea of using a simpler net, probably with a course scale image for pre-detection. You can set an appropriate detection threshold to avoid false negatives and thereby avoid running the full net on the vast majority of images which won't contain anything that looks like a plate number.
Can you elaborate on why a broader array of input fonts would help?
Wouldn't best accuracy come from obtaining the single font that is actually used for the licence plates (or finding the closest match), since in this scenario there isn't any variation as all plates are made according to a single standard?
Even if all the fonts are rigorously identical the viewing conditions won't be. There will be plates seen at various angles, bent plates, dented plates, dirty plates etc.
I think that if you're going to use a net you should try to train it in such a way that it learns to generalize the fundamental shape characteristics of the characters, much like the human visual system does, and that requires training with a wide range of different type of characters, all considered to be of the same class.
If you're convinced that you only need to deal with identical characters you're probably better off using some form of template matching. But I think this is rarely the case for real world systems.
The openalpr project is pretty good, but it uses an older approach with lot of hand-coded logic instead of the convolutional neural net approach of this project. openalpr is faster than this, at least now, but I think the neural net has potential to be faster and more accurate.
The author could take a camera on the road and shoot pics for a day or two, then use OpenALPR to predict the plates and check this dataset to see how accurate it is. If it's good enough, it can be used to train a CNN model with real images of cars and no manual labeling.
I started saving file from my dashcam several months ago with that idea - to use the openalpr results as training data for a neural net, but the results I'm getting from openalpr aren't good enough to use without a lot of manual review.
Hard to tell. Unfortunately neither this nor the openalpr project have any benchmarks or evaluations on a test set. It could be that the openalpr project already gives you 99% accuracy (unlikely), in which case there really isn't a point in trying to optimize it more (e.g. move on to harder problems).
If this is the state of the art, then we needn't be very worried. The demo site shows horrible results - the same plate is recognized 2 dozen times as different numbers every time, over a time span of 2 seconds.
I'm pretty sure there are solutions available that works much better though, my point is that describing openalpr as 'state of the art' isn't quite accurate.
I think you are misreading it. It is listing the top 10 possibilities that might represent the number plate and the expected accuracy of each one, with the top suggestion being correct.
Yes I got that part, but it still turns out I'm the idiot here - I had set the view mode to 'individual plates' instead of 'plate groups' (I didn't understand what it was for so I was poking around) and then it shows you the separate images of each car, instead of grouping by car and taking the most likely from several attempts, which was was I complained they didn't do in the GP. When I look at the right data, they get all but one of the two dozen or so I checked right. Impressive (and a bit scary).
Typically you can treat the output value as a probability. This is supposed to be the case with a softmax output layer (as is currently recommended for multiclass classifiers) and I believe it should also work with multiple logistic outputs if you normalize the output vector to sum to 1.
In my experience sometimes both the decision and the confidence will be wildly wrong (ie. wrong answer with 99% probability) but most of the time this works reasonably well.
> ...decision and the confidence will be wildly wrong...
Isn't "confidence" measuring the internal state of the net, the probability based on classes known to the net? In that case the answer can be wrong, but "confidence" is never wrong.
Isn't "confidence" measuring the internal state of the net
I don't quite follow your meaning but no, that's not quite how I think about it.
If x is the input image and C is the class selected by the classifier we expect the confidence to be representative of the conditional probability P( C | x ). There is theory that shows this to be true given certain assumptions, but as is very often the case in this field the extent to which those assumptions are valid is rarely clear. So if the net outputs a confidence of 0.9999 we expect there to be only 1 chance in 10000 that the true class is something other than C. If we test the net and find such a confidence associated with an incorrect classification more frequently than predicted then we can say that the confidence is objectively incorrect.
This isn't something I know about, but I would think that for a classification problem, if the right loss function is used then the NN should learn to maximise the chance that the true answer is within the intervals it provides. However...
> If we test the net and find such a confidence associated with an incorrect classification more frequently than predicted then we can say that the confidence is objectively incorrect.
On the test data performance is never going to match the NN's training set performance, so it would always overestimate its confidence.
It sounds like you are describing accuracy, whereas confidence is closer to a measure of precision. You can have class outputs with low confidence (relative to 1) but still have a highly accurate NN - so long as the correct output has a higher confidence relative to the other outputs (even if the difference is 0.0001). If you've got 4 outputs and you're getting results like {0,1,0,0}, then you should really make sure that you aren't overfit - because that is the most obvious sign of overfitting.
I think you're describing accuracy, as in prediction accuracy.
Confidence is simply how much confidence you have that the predicted class is correct.
This is often important because in a real system you often want to know whether to accept the model's prediction or pass the case on to a more accurate, more expensive model (sometimes a human). You then can select a confidence threshold that gives an acceptable trade-off between false positives and false negatives (ie. choose an operating point on the ROC curve). If your model just gives a yes/no answer with no measure of confidence, or if you don't know how to interpret the output as a confidence, then you don't have sufficient information to do this.
Over-fitting is often part of the problem but in my experience things aren't as black or white as you seem to imply. There are things you can do to limit over-fitting but I don't know of any simple test that will give you a 0 if you're over-fitting and 1 if you're not.
> Confidence is simply how much confidence you have that the predicted class is correct.
If you mean correct among the number of output nodes, then we are talking about the same thing. If you mean correct among all the possibilities to include those unknown to the NN, then I'd appreciate a link - because I've never seen that before and it sounds incredibly useful.
> ...I don't know of any simple test that will give you a 0 if you're over-fitting and 1 if you're not.
Neither do I, that is why I used all those weasel words - but I also don't know of a neural network that wasn't either being misused or being overfit that returned such results. A simple example would be the 2 inputs for the xor test data, a hidden layer with enough nodes to memorize every permutation, and 2 output nodes - each representing the confidence for the potential result of the binary operation.
On your first point in my experience if you use a net with logistic output nodes and you feed it an input that looks nothing like any of the (labeled) training examples I would expect to see low values for all outputs. I would add the caveat that there may be the odd input that is not an example of a class that still provides an erroneous high confidence output but this is not common. These are the types of nets that I have the most experience with. I don't have a link for this, only my recollections from when I was working with these types of nets. I did once see a paper that proved the relationship between P(C|x) and logistic output values trained with the squared error loss but I don't remember if it applied to the case you're talking about and I doubt I could find it again, it may even be behind a paywall I no longer have access to.
I don't think this holds with a softmax output layer because I think that provides an output vector that is implicitly normalized across all classes. It also obviously doesn't hold if you explicitly normalize the sum of the output vector to 1.
For your second point I'm not sure I disagree with what you're saying, certainly not on the basis of the example you provided before since I would never expect to see 0's and 1's in the output of a net.
EDIT: After thinking about this some more I realized that I may have been mistaken in what I said above. It may be that for an unknown input you would expect to see a high entropy output distribution whose sum is close to 1 but where the average output value is about 1/N (ie. the number of output classes). If this is the case then the logistic and softmax cases would be similar. This still indicates a low confidence though, I wouldn't expect to see outputs in the upper quartile, unlike for a cleanly recognized input where you might see 0.8 or 0.9. I'm going to try this with a net I've trained for mnist using random inputs to check but unfortunately I have to retrain the net since I didn't save it and my laptop's busy with another task that I don't want to stop now so it may take a while.
OK this isn't a very scientific test but just eyeballing some outputs for a lightly trained net (classification error about 10% whereas a fully trained net can get well below 2%) what I'm seeing is that with the test set inputs every instance I've looked at has had one output > 0.5, most > 0.7. For random images the highest output I'm seeing is 0.27.
If the net outputs probability of 0.9999 but lower CI is 0.4, then you don't have to trust it. With linear models you can estimate the CI from the data and mean squared error of the classifier.
It sounds like you want to stack another probability on top of a probability. The net gives you an estimate of P(C|x) and then you want a probabilistic measure of how accurate an estimate that is. This starts to get a little difficult for me to think about. It seems like you might be able to estimate it by running the net on a test set (disjoint of course from the training set) and applying standard statistical methods to the results, just like you could for any experiment.
EDIT: Maybe it's more complicated than that though because the experiment is itself estimating a probability, not just the value of some random variable. Maybe we need a real statistics expert to look into this.
The net output is overconfident, but this can be (partially) corrected with probability calibration. It's also used for SVM, which is usually underconfident.
SVM doesn't give CI either. In fact, it doesn't even give probability estimates per sample unless using Platt's technique whose probabilities don't always agree with the classifier's outputs.
What I mean is a confidence interval on probabilities. That is, if you did the training on 100 different training sets that follow the same distribution as your training set, they would have a range of outputs for a particular sample. The 95th highest and 5th lowest would have specific values. These values can be estimated and are the bounds of the CI.
For example, for logistic regression, you can estimate standard error per coefficient like so: https://en.wikipedia.org/wiki/Ordinary_least_squares#Finite_...
That would give you a vector of standard errors the size of your variable(feature) set. Then you can use that to estimate overall range of CI per sample.
This would be useful in the example provided in OP's link. For example, let's say that the classifier says there is 86% probability that the license plate is RBX735.
Can we really trust this 86% probability or is the model primarily relying on some information that is available in only 2 samples out of 1000 that might have been outliers?
You may be correct technically. I'm not familiar with the formal definition of "confidence intervals" in statistics. Informally we often speak of confidence as an estimate of the conditional probability P( C | x ) where C is the class returned by a classifier and x is the input. That's what I thought the parent comment was referring to but I may have been mistaken.
EDIT: For my own information what would knowing the confidence intervals tell you that P( C | x ) wouldn't ?
If C is binary or categorical, then confidence intervals don't make sense to talk about (although you could still talk about them if C is integer valued), and the equivalent thing is P(C|x). Unless I guess if you have a posterior distribution over distribution...
The definition of a Q (e.g Q = 0.95) confidence interval a < X < b is just that P(a < X < b) = Q. It's not even necessary that the interval contain the most probable value of X, but when you actually calculate them to want to tack on extra constraints to prevent stuff like that, like minimising the length of the interval.
If C is binary or categorical, it totally makes sense to talk about confidence intervals. Why wouldn't it? You are trying to figure out if your probability output for a class is reliable for that sample.
The only problem with that project is that the authors "reuse" the "OpenALPR" exact name for a commercial, paid, hosted SaaS version of it. It's devious sneak-marketing that piggy-backs off of the open-source portion's good will.
Totally agreed. A lot of ML posts are "I ran this. This happened." - where what we actually to read is "This is the problem I had. This is how I broke that problem down into these components, and this is how and why I solved those."
I cloned this a few days ago and had some questions when I tried setting it up, and Matthew's been very helpful when I emailed him, so yes, seconded. Great project!
I started training about 24 hours ago with a single i7 860 (2.8ghz) and it's still running, but I'm watching the loss going down so I think it might wrap up within a day or two.
The machine I'm using has an Nvidia GTX 970, but I failed to enable CUDA in the TensorFlow installation so it's not using it. When I noticed that I decided just to let it run and then, when it finishes, train it again with the GPU and compare.
A couple of suggestions:
You will almost certainly get better accuracy with a broader array of input fonts. I'm not sure how well you can simulate all of the image conditions that might occur in a real system so adding some real images to the training set (if you can find them - but I suspect there are OCR type training sets available these days) would probably help.
For the speed problem I like the idea of using a simpler net, probably with a course scale image for pre-detection. You can set an appropriate detection threshold to avoid false negatives and thereby avoid running the full net on the vast majority of images which won't contain anything that looks like a plate number.