Could anyone recommend a starting point on Neural Networks for the uninitiated? The parts of this I understood were fascinating, but I quickly realized I was looking up every third word, and not really absorbing much.
If I could only read one thing to gain the technical grounding for this history, what should it be?
It introduces you to some of the underlying principles which haven't changed much over time. I highly recommend it if you want to get deeper intuitions on the principles of CNN, LSTM/RNN, Restricted Boltzmann Machines etc. Also, Hinton's Coursera lectures, though not sure if you can access it anymore.
The other suggestions in this thread are quite good. I'll add "Machine Learning" by Murphy. It's not strictly about neural networks but it's an ML classic and a rigorous introduction to the subject that will give you a principled understanding of the statistical fundamentals. For actual NN implementation the Karpathy and Nielson sources are excellent.
Definitely not the starting point for someone who has no clue. Murphy is probably the very last thing one would read before becoming an expert, and start publishing ML papers.
A more accurate title might be "Convolutional Neural Network Architectures" or "Neural Network Architectures for Computer Vision" but still a nice overview!
One thing I'm confused about is that everyone seems to treat "Convolutional Neural Networks" as synonymous with or as being the thing that enabled "Deep Learning", but convolutional neural networks are only for image processing, right? Are many layer ("deep") networks useless outside of image processing? Were there other break-through techniques besides convolutional nets that are necessary for deep networks to work well?
While not strictly necessary, those breakthrough definitely helped: dropout and greedy layer-wise pretraining.
Also convolutions are not only used in computer vision. For example, alphago used them: (the paper is called "Mastering the Game of Go with Deep Neural Networks and
Tree Search"). In my opinion, I would say that convolutions should be useful whenever your data has a spatial aspect to it.
Convolution neural networks and many-layered networks are useful for things outside of image processing. CNNs are used for acoustic modeling in speech recognition, and character-convolutional layers are used in language modeling. And pretty much all neural networks in use today anywhere are many layered.
As mentioned in the article, using convolutional layers in ANNs was an idea from the 1980s, but networks that could be trained on the hardware available at the time were never all that competitive until recently. Once we figured out how to train big/deep networks (use GPUs, have lots of data, maybe use pre-training), CNNs started to perform really well. This did make a positive feedback loop: as CNNs started to work better, deeper networks in general started to get more attention, which got more people into CNNs, etc.
Are there many-layered deep networks that aren't convolutional neural nets, or are CNNs practically necessary to make deep networks work? Are there specific extra techniques not necessary for CNNs that are necessary to make deep non-convolutional networks work well?
In natural language processing tasks you see a lot of non-CNN architectures. These usually are designed to be able to deal with sequential data, so some kind of "memory" is needed.
Sometimes you see this combined with a CNN. There has been a few question answering systems that have one or more CNN layers. In don't entirely understand these designs, but presumably the convultional layers are an attempt to understand the different orders of words.
There are lots of techniques that people use to try to make deep networks work well. Mostly theses are about making errors backprog better. One of the most successful recent innovations is the ResNet architectures (https://arxiv.org/abs/1512.03385), and the related highway networks.
There are successful deep networks with pure feed forward non-convolution layers. There are also deep layers of other exotic non-convolutions flavors like LSTMs and GRUs, particularly useful for sequence-to-sequence tasks like machine translation.
Isn't data and processing power the most important thing with neural networks? Even if I knew how they worked I would have no idea what to do with them that hasn't been done already as a hobbyst without access to huge amounts of data like companies do.
No. As a researcher, you can make it your goal to find/invent a smallest possible architecture for a given task (in terms of number parameters, or number of operations). Alternatively, you can try to invent an architecture to learn faster from data (or require less data to achieve state of the art results).
Hobbyist researcher? Seems like a more accessible plan anyway; fuck bothering with huge datasets and long training times and just focus on optimizing small architecture.
Yes. This is often, perhaps willfully given the incentives, overlooked by computer scientists (and people trained in that discipline). For a complete, cogent argument see:
Yes, you need data. But the amount depends on your task, and there are pretty significant sources of large amounts of data available online.
For example I was at a presentation where a person built a pretty interesting neural model based on 190k clinical records released via Kaggle. In most fields it is surprising how much data is easily accessible.
> The NiN architecture used spatial MLP layers after each convolution, in order to better combine features before another layer. Again one can think the 1x1 convolutions are against the original principles of LeNet
Why would it be against the original principles of LeNet?
As far as I understood from the description, in LeNet the convolutional layer lets you avoid training parameters that will effectively be doing the same thing as a convolution. Adjacent pixels are highly correlated, so convolutions can capture most of the information in groups of adjacent pixels without having to train a fully-connected layer of neurons. Effectively, you're kinda downsampling the image without losing information.
So, if you're using 1x1 convolutions, I think you're basically having a neuron per pixel, so you're forcing your fully-connected layers to learn the spacial correlations of pixels, instead of capturing that information in a convolutional layer. In other words, you're wasting training on capturing spacial correlations of adjacent pixels instead of other correlations.
> So, if you're using 1x1 convolutions, I think you're basically having a neuron per pixel, so you're forcing your fully-connected layers to learn the spacial correlations of pixels, instead of capturing that information in a convolutional layer.
Saying "a neuron per pixel" doesn't mean anything, really, that way of thinking isn't helpful unless you're looking at small multi-layer perceptrons. The right way to think about things is that you have tensors and layers that compute new tensors from old tensors.
A 1x1 convolution only 'sees' the feature channels of a pixel, and does the same thing to each pixel. So a 1x1 convolution on a grayscale input (e.g. a 1x28x28 tensor in the case of MNIST) does nothing, basically, other than scale and bias every pixel by the same linear function. It doesn't "force the network to learn" anything, it's just totally pointless.
One of the uses of 1x1 convolutions is to collapse the feature dimension when you're deeper in the network (e.g. 100 channels to 10 channels) to reduce number of parameters subsequent layers need operate on. It's a "channelwise fully connected layer".
I think you're thinking of (and perhaps what the author was thinking of) is the practice prior to convnets of collapsing the image into a vector and then doing a fully connected layer on it. That indeed doesn't exploit translation invariance of natural images, requires the net to learn the same features in every required spatial position at great expense, and so on. But that has nothing to do with 1x1 convolutions.
Ah yes, you're right, I was thinking of it that way. Thanks a bunch for your clear and thorough explaination, it makes a lot of sense! So if I understand what you're saying, a 1x1 convolutional layer for collapsing 100 channels to 10 channels would take a 100x512x512 tensor and collapse it to a 10x512x512 tensor?
[Also, sorry for attempting to answer your quesiton incorrectly. I was thinking of putting a disclaimer saying I hadn't worked with CNNs and so might be misunderstanding what the convolutions are doing; probably should have haha]
Maybe when the author was saying 'one can think the 1x1 convolutions are against the original principles of LeNet', he was anticipating my kind of confusion? :)
> So if I understand what you're saying, a 1x1 convolutional layer for collapsing 100 channels to 10 channels would take a 100x512x512 tensor and collapse it to a 10x512x512 tensor?
Correct. As I understand it, this would be applying a 1x1 covolution with 10 filters to a 100x512x512 tensor.
If I could only read one thing to gain the technical grounding for this history, what should it be?