2) the 1000 most common lego parts, 'other' and 'mess'. In the end the idea is to get to 20K classes and to sort directly into sets. This is very much a pipe dream at the moment but I think it is doable given a large enough set of samples. The problem is that you have to see all those parts at least a hundred times or so before it gets detected reliably.
3) too little :( The training data is still woefully insufficient but it is now good enough to bootstrap the rest. This took a while to achieve because without any sorted lego to begin with you have nothing to train with. So the first 20Kg or so were sorted by hand and imaged on the sorter without any actual sorting happening (everything into the run-off bin), then labeling the results by hand until the accuracy of the test set (500 parts or so) went over 80%. That was a week ago and since then it's been improving steadily day-by-day.
4) one training run per night, typically a few 100 epochs on the current set but, this will change soon. The machine is now expanding the training set rapidly with associated improvement in accuracy. This means that the training sessions are taking longer and longer but I'll be running fewer of them. What I'll probably do is offload training to one machine which will drop off a new set once per week or so and inference on another which is doing the sorting and capturing the new training data.
Checking the logged images for errors still takes up a bit of time though, but with the current error rate that is very well manageable. (Before it was an endless nightmare).
For more training data, I wonder if you could make Lego parts in Sketchup or some 3D program, then render them in a 'scene' similar to your camera setup using a renderer like Maxwell or V-Ray or whatever. Then you could maybe be able to generate unlimited numbers of sample images to train on.
I'm doing a similar experiment now to train a model to parse out an image of a blood pressure monitor that's a 7-segment LCD display. To do it I separated out each segment of the display as masks with Gimp/Photoshop and then I can create my own images by just overlaying them on top of an image of a blank LCD display. That gets me basically unlimited training photos.
If you could render the 3D parts from various angles, colours, etc then something similar might be possible.
Also, you said you're doing modified VGG and into 20k classes. That works, but another thing to maybe try is use binary_crossentropy as the loss function and a sigmoid (instead of softmax) on the final activation layer, to be able to do multiclass classification.
Then your labels could be a vector of shape possibilities, colour possibilities, or whatever you could divide your 20k classes into.
I've tried the rendering trick but it didn't work well enough, the real pictures seem to give much better results when used on unseen data.
> Also, you said you're doing modified VGG and into 20k classes. That works,
Right now there are 1002 classes, the 1000 most common lego parts, 'mess' and 'other'.
> but another thing to maybe try is use binary_crossentropy as the loss function and a sigmoid (instead of softmax) on the final activation layer, to be able to do multiclass classification. Then your labels could be a vector of shape possibilities, colour possibilities, or whatever you could divide your 20k classes into.
Tagging/multi-label classification is useful because it'll help tame your explosion of classification if you want to expand. For example, it can then handle stuck-together parts by tagging it as both parts rather than putting them into a generic 'other' classification, or you could include separate tags for colors or fakeness or damagedness, avoiding the need for 100,000 categories of 'fake damaged red square brick' etc. It might also improve learning since it's a more natural way of describing the data.
By expanding the training set you mean that a each new sample is added to next training set by default?
Do you use artificial augmentation of the training set (random rotations, translations)?
Somewhat related aside: I have also worked on a classification task (albeit, much simpler one): detect the direction of grain in a piece of wood. I built the first version by manually extracting features (essentially, a few direction sensitive Gabor filters) so that I could collect a training dataset for CNN.
Turned out that accuracy of manual version was more than enough (~98%) so I didn't get to play with fun stuff :(
> By expanding the training set you mean that a each new sample is added to next training set by default?
Every part that passes through the machine is logged and will be made part of the training set for the next round of training.
> Do you use artificial augmentation of the training set (random rotations, translations)?
Yes, quite a bit. But this is not of the same quality as really having more samples though it is definitely useful.
> Somewhat related aside: I have also worked on a classification task (albeit, much simpler one): detect the direction of grain in a piece of wood. I built the first version by manually extracting features (essentially, a few direction sensitive Gabor filters) so that I could collect a training dataset for CNN.
> Turned out that accuracy of manual version was more than enough (~98%) so I didn't get to play with fun stuff :(
Thanks for the responses! I have a feeling that projects involving vision will become more common with the fall in camera and image processing so I'm interested to learn from other's experiences.
> Slick! Do you use that to determine orientation prior to lamination so the laminate does not warp when it ages?
In this specific case- to reduce surface defects during planing. Interestingly, client forgot to specify this requirement when we designed the equipment which feeds planer. Feeding the boards in correct orientation was something the operators had learned from each other as they improved reject rates in downstream scanner but nobody in management knew about that.
1) What's the input image resolution?
2) How many classes you have?
3) How many samples per class did you need to achieve acceptable accuracy?
4) How long did the training take? How many epochs did it require?