Have yet to see an illustration that grasps multichannel convolution filters (MCCF) concept clearly. Why those channel stack size keep growing? How are they actually connected?
The thing that each conv filter consists of kernels in multiple channels (that's why first layer filter visualisations are colored btw - color image is a "3-dimensional" image) - and we convolve each kernel with corresponding input channel, then sum (that's the key) the responses. Then having multiple MCCF (usually more at each layer) yields a new multi-channel image (say, 16 channels) and we apply new set of (say, 32) 16-channeled MCCFs to it (which we cannot visualise by themselves anymore, we need a 16-dimensional image for each filter) yielding 32-channel image. That sort of thing is almost never explained properly.
Ever seen "word2vec"? Each word is embedded in a high-dimensional "concept" space, where words that are similar to each other will be close by.
A CNN does this for localized sections of the images. Each layer looks over a wider area of the original image (because of the convolution) and embeds the "concepts" in that area of the image in a higher-dimensional space than the layer before it.
Valid-only convolution (in the MATLAB sense) by itself reduces the dimensionality of the input; for images, it will go from (h x w) to (h - kh + 1) x (w - kw + 1) per each plane.
You can think of a convnet as a series of feature transformations, consisting of a normalization/whitening stage, a filter bank that is a projection into a higher dimension (on an overcomplete basis), non-linear operation in the higher dimensional space, and then possibly pooling to a lower dimensional space.
The “filter bank” (aka convolution) and non-linearity produce a non-linear embedding of the input in a higher dimension; in convnets, the “filter bank” itself is learned. Classes or features are easier to separate in the higher dimensional space. There are some still-developing ideas on how all this stuff is connected to wavelet theory on firmer mathematical ground and the like, but for the most part, it just works "really well".
For an image network, at each layer there are (input planes x output planes) convolution kernels of size (kh x kw).
Each output plane `j` is a sum over all input planes `i` individually convolved using the filter (i, j); the reduction dimension is the input plane.
for a loop nest that shows what the forward pass of a 2-d image convnet convolution module does. That's gussied up with convolution stride and padding and a bunch of C++11 mumbo jumbo, but you should be able to see what it is doing.
A human child learns much more easily by seeing only a handful of images of a cat and then almost being able to say any type of cat image as it grows (without ever seeing 1 million or billion images). So, there seem to be something that shows that more than the amount of data, the "reality" of seeing a real cat probably includes all possible aspects of a Cat ? There seem to be something missing with this whole deep learning stuff and the way it is trying to simulate the human cognition.
it is very introductory, just as it supposed to be for beginners.
I doubt he tries to be a though leader, rather this post looks like a notes that he made while learning about CNN and published them since they might be useful as a quick-start to someone else.
I am new to CNNs/machine learning, but here's my $0.02:
Regardless of which technique you use, it seems that the amount of data required to learn is too high. This article talks about neural networks accessing billions of photographs, a number which is nowhere near the number of photos/objects/whatever a human sees in a lifetime. Which leads me to the conclusion that we aren't extracting much information from the data. These techniques aren't able to calculate how the same object might look under different lighting conditions, different viewing angles, positions, sizes, and so on. Instead, companies just use millions of images to 'encode' the variations into their networks.
Imo there should be a push towards adapting CNNs to calculate/predict how the object might look under different conditions, which might lead to other improvements. This could also be extended to areas other than image recognition.
People rarely train on billions of images, we're usually around the scale of ~million. This already works quite well in many respects. A back of the envelope calculation assuming about 10fps vision gives ~1B images by age of 5. And humans aren't necessarily starting from scratch as our machine learning systems do.
It's not clear if people can calculate what an object might look like in different viewing angles, but even if they could if you would want to in an application, and even if you did there's quite a bit of work on this (e.g. many related papers here http://www.arxiv-sanity.com/1511.06702v1). At least so far I'm not aware of convincing results that suggest that doing so improves recognition performance (which in most applications is what people care about).
That's not a fair comparison. By that time the toddler can also ask questions, generate new labels using adjectives, label novel instances as compositions of previously acquired knowledge and generate sentences representing complex internal states. They are not limited to observed labels. In fact there is very little supervised learning in the form of [item, label, loss]. Beyond that, with enough stimulation and simply from interacting with each other, children can even spontaneously generate languages with complex grammar; without labeled supervision.
They'd also have gained the ability to do very (seriously) difficult things like walking, climbing objects, the rudiments of folk physics, picking things up and throwing things. They'd have some rudimentary ability modeling other agents.
It's good to be happy with current progress and I do not suffer from the AI-effect but being too lenient can hamper creativity and impede progress by occluding limitations.
Even if we assume that a 5 year old has seen 1000-1500 pictures of say, cats, in his lifespan, it is still far less than the number of images required to train a CNN to label them as accurately as a human can.
And of course, I am not talking about just viewing angles. There are several other factors, but I only mentioned the ones which I could think of.
A human is very good at one-shot learning but CNNs are actually not too terrible either (and this is also an active area of research, e.g. see http://www.arxiv-sanity.com/1603.05106v2). A human might take advantage of good initialization while CNNs start from scratch. Human might have ~1B images by 5 (CNNs get ~1M) of continuous RGBD video and possibly taking advantage of active learning (while CNNs see disconnected samples, which has its pros and cons, mostly cons). i.e. we're disadvantaged in several respects but still doing quite well.
It depends on what we mean by vision. Crows for an example, do the sort of things low level things CNNs are capable of. But for full visual comprehension, they are actively making predictions about physics off a probabilistic world model (learned in part from causal interventions) that feed back into perception.
I've not yet looked carefully into it, but I expect that sort of feedback should drastically reduce the amount of required raw data. Machines might not (at first) get to build predictive models from interactions, but even our best approaches to transfer and multi-task learning are very constrained compared to the free form multi modal integrative learning a parrot is capable of. With very little energy spend.
This is good, it means there are still a lot of exciting things left to work out.
And that posts some advantages we still need to do a lot of work on. Mammals and birds are able to learn online from a few examples per instance, shift to changes in underlying distributions relatively quickly and do so unsupervised.
This is the right perspective. It seems the OP believes actual photographs are privileged in some way. In reality, any visual input from our eyes counts as training data, as you said.
You seem to forget that the photos are labelled, which counts as supervised learning. What us humans excel at is unsupervised learning, which is difficult for machines. But yes, I agree that humans have the advantage of continuous video access.
Disclaimer: I don't really know what I'm talking about, this is just what I think:
Since we know that things usually don't change instantly (for example, a cat won't suddenly change into a dog), if we assume 10fps vision, 1500 pictures of cats would mean looking at a cat for 2 and a half minutes total in 5 years. And since we know cats won't change into something else, if we see a cat walking somewhere, we'll still know it's a cat, giving us the labels we need for the training.
I think that if we assume 30fps (which still seems kind of low), and we assume that the human looks at a cat for 15 minutes (which still isn't much) that's already 27000 pictures.
But I argue that those are pictures of the same cat, and most of the images will be very, very similar. Of those thousands of pictures, only a few, noticeably different pictures would matter.
I've seen many examples where networks are trained on thousands or tens of thousands.
The most common example is digit recognition with the MNIST dataset. This is a common problem given to beginners, and even many beginners to CNN's achieve human-level accuracy. That dataset is in the tens of thousands.
there should be a push towards adapting CNNs to calculate/predict how the object might look under different conditions
Data augmentation like rotations, horizontal flipping and random cropping are a widespread practice.
I don't have great links for this, but for something less technical, you might look at the blog posts from Kaggle competition winners. Here are a couple examples
The thing that each conv filter consists of kernels in multiple channels (that's why first layer filter visualisations are colored btw - color image is a "3-dimensional" image) - and we convolve each kernel with corresponding input channel, then sum (that's the key) the responses. Then having multiple MCCF (usually more at each layer) yields a new multi-channel image (say, 16 channels) and we apply new set of (say, 32) 16-channeled MCCFs to it (which we cannot visualise by themselves anymore, we need a 16-dimensional image for each filter) yielding 32-channel image. That sort of thing is almost never explained properly.