Neural Networks Are Impressively Good at Compression

kaffeinecoma · on April 30, 2016

People are knocking this guy for not being an expert and maybe getting some details wrong. Maybe it's a little bit like watching a non-programmer stumble their way through a blog post about learning to program- experienced programmers may cringe a bit.

But I really appreciate these kinds of write-ups: he declares his non-expertise up-front, and then proceeds to document his understanding as he goes along. There's something useful about this kind of blog post for non-experts.

I'm working my way through Karpathy's writeup on RNNs (http://karpathy.github.io/2015/05/21/rnn-effectiveness). I've mechanically translated his Python to Go, and even managed to make it work. But I still don't entirely understand the math behind it. Now obviously Karpathy IS an expert, but despite his extremely well-written blog post, a lot of it is still somewhat impenetrable to me ("gradient descent"? I took Linear Algebra oh, about 25 years ago). So sometimes it's nice to see other people who are a bit bewildered by things like tanh(), yet still press on and try to understand the overall process.

And FWIW I had the same reaction as the author when I started toying around with neural nets- it's shocking how small the hidden layer can be and still do useful stuff. It seems like magic, and sometimes you have to run through it step-by-step to understand it.

karpathy · on April 30, 2016

Sorry about that! There's a lot to cover for one blog post to do satisfyingly. I encourage you to check CS231n for a more thorough treatment where we also discuss, for example, the tradeoffs of different activation functions like tanh(), have a more gentle introduction on gradient descent, I devote a whole lecture to char rnn, assignment #1 (they are available) would demystify the backward pass, etc.

Also definitely +1 for not putting down people who write similar posts. I encourage everyone who is trying to learn to do it through blog posts because it lets you explain/organize thoughts. I also enjoy reading them quite a bit because it illustrates the kinds of conceptual problems beginners face (which is not at all obvious once you've been in the area for a few years). And it's also interesting to see many different interpretations of the same concepts, as everyone has different background and the way they reason through things is usually quite unique. Granted, this one could have been named something more appropriate!

kaffeinecoma · on April 30, 2016

No need to apologize- I learned SO much from your blog, thank you. I didn't realize the course was online (https://www.youtube.com/watch?v=NfnWJUyUJYU). Also, looks like there's a subreddit for it as well: https://www.reddit.com/r/cs231n

It's really wonderful that all of this is freely available, thank you.

jdminhbg · on April 30, 2016

The lecture that covers gradient descent in the Youtube list you linked there is the first time gradient descent actually clicked for me, and I made it through the entire Andrew Ng Coursera ML course. Highly highly recommend it.

andycjw · on May 5, 2016

the video became private, anyone know the title of the video or is there another copy of it somewhere else?

stavros · on April 30, 2016

> People are knocking this guy for not being an expert and maybe getting some details wrong.

I think this style of teaching has great value. Someone who's learning something themselves is the person most suitable to teach it to others, since they know exactly what a novice user doesn't know. For example, I wanted to write something up for monads the other day, since it's a simple concept that's made super confusing by people who dive into mathematical notation right away. The downside with this approach is that the novice lacks experience, so what they're learning may not be entirely accurate.

I think the best approach is a hybrid: Someone who is learning the material explains it, and someone who already knows it points out mistakes. In this case, HN can serve as the expert, and we all end up with a very informative post.

tobz · on April 30, 2016

One of my great regrets with leaving my last job was that: I had little FP experience, and we had one guy with a lot of it. We had done informal teaching sessions, and had planned to try to write blog posts / record podcasts of our little sessions, precisely for the reasons you mentioned, hoping that it would help others absorb the material more easily.

stavros · on May 1, 2016

Why can't you still do it? I'd read that.

tobz · on May 2, 2016

There's probably no reason I can't still do it, aside from physical distance problems and getting the free time lined up on both sides, which was far easier when we could just schedule it during working hours as 20% time.

If we/I ever do it, I'll make sure to send you a link. :)

pvnick · on April 30, 2016

Nando De Freitas has a great youtube channel with videos that you might find helpful, including this one in particular on unconstrained optimization: https://www.youtube.com/watch?v=QGOda9mz_yA&list=PLE6Wd9FR--...

pmarreck · on April 30, 2016

FYI, gradient descent is covered in one of the very first weeks of Andrew Ng's Coursera machine learning class, so perhaps just watch those lessons (free)

Gradient descent is the approximation solution basically because getting the exact solution requires a good computation of inverse matrices which is apparently not yet doable (it's too slow)

toxik · on April 30, 2016

I think the reason people do gradient descent is that the datasets are too large to solve for all inputs simultaneously. It isn't impossible in theory, really.

ctandre · on May 1, 2016

Do you mean to say that it is possible to design your parameters over all inputs without gradient descent? I'm somewhat confused, as I think that that would not be possible in the general case (e.g. nonlinear problems are hard to crack without resorting to an iterative procedure like gradient descent). I can see that gradient descent might still make sense for problems that do have clean analytic solutions (if that's what you meant), as those solutions often turn out to be junk at scale. Linear regression is a good example, as it has a nice closed form expression if the solution exists. But the complexity scales poorly as the naive implementation requires a matrix inversion, so a different method might be employed for a large problem - gradient descent could be a candidate.

I think gradient descent is attractive because it's a memoryless process at the batch level - you can process training data in batches instead of processing the entire dataset in one go, without any explicit tracking of the previous batch history. This is a great feature when the scale of your dataset is mind-boggling. I think this is what you were suggesting?

nshm · on May 1, 2016

Strictly speaking if you split the parameter set on batches and iterate over batches optimizing each set of parameters with a gradient, it is not strictly a gradient decent, it is more a combination of coordinate decent (because you select the subset of coordinates to optimize first) and a gradient decent.

ctandre · on May 1, 2016

Ah yes - that sounds like the stochastic gradient descent I've been hearing about. That makes a lot of sense for very expensive models. Thanks for the response nshm - I've recently taken an interest in ML (coming in with some familiarity with optimization), and it's much appreciated to have some 'REPL' in the learning process.

SixSigma · on April 30, 2016

http://cs224d.stanford.edu/syllabus.html

covers RNN on lec 8

svantana · on April 30, 2016

I thought this would be about something like giving the Hutter Prize [1] a go using character RNNs [2]. Instead, it's a somewhat confused "gentle introduction" to neural nets (which there are plenty of already, of higher quality) and compression is sort of handwavely discussed, not properly with bits and entropy like us information theorists would have it :)

[1] http://prize.hutter1.net [2] https://github.com/karpathy/char-rnn

walrus · on April 30, 2016

It's been done before. Alex Graves achieved 1.33 bits per character using stacked LSTMs.[1] The current Hutter Prize record is 1.28 bpc. Note, however, that the 1.33 bpc value was calculated on a held-out validation set and doesn't include the network weights.

[1] http://arxiv.org/abs/1308.0850

twotwotwo · on May 1, 2016

Man, the RNN-generated Wikipedia data around page 14 of http://arxiv.org/pdf/1308.0850v5.pdf is fascinating.

Some things you'd expect from a modern language model--like, your phone keyboard's predictor already generates some grammatical-looking phrases, and there's only room to do so much in a phone. I'm more impressed that (as the author notes) it gets syntax right, including matched pairs over long distances, occasionally makes words up using plausible components, suffixes, etc. ("quanting"), and learns other quirks (e.g. there's Cyrillic after the [[ru:]] link and some Hangul after the [[ko:]]). Those aren't intuitively things you'd expect from applying arithmetic to a bunch of characters.

blazespin · on April 30, 2016

It's an interesting idea though, as sophomore as the attempt was. I'd be curious to see images lossy compressed via a neural net. Do you know of any better attempts than this?

petra · on April 30, 2016

Not excatly what you've asked, but even more interesting:There's a startup that uses deep learning to take a compressed video and knows how add details and improve the quality, just by understanding what we are expecting.

Can't find the startup's name though.

hellameta · on May 1, 2016

http://www.piedpiper.com/

fegu · on May 1, 2016

The middle out Compression company

kaffeinecoma · on April 30, 2016

A "kind of" compression using polygons: https://rogeralsing.com/2008/12/07/genetic-programming-evolu...

I believe it was discussed here a few years back.

tacos · on April 30, 2016

Naive backprop is just another way of expressing the same algorithm. This corollary was actually used to analyze early nets in the 90s. I can't pull up papers for some reason, they're out there though.

jg8610 · on April 30, 2016

It's good to see people write up their experiments, it's useful for the rest of us to test how we understand neural nets.

I think there are a few mistake in your maths though. You can learn a 1-1 discrete mapping through a single node where you are using a one-hot vector. You just assign a weight to each of the input nodes, and then use a delta function on the other side. If I understood correctly, this is what you are doing.

Also, if you use a tanh in your input layer, but keep a linear output layer (as you start off with), you are still doing a linear approximation because you have a rank H (where H is the hidden layer) matrix that is trying to linearly approximate your input data. This is done optimally using PCA.

I'd second the advice to look into the coursera courses, or the nando de freitas oxford course on youtube (that actually has a really nice derivation of backprop).

brendanofallon · on April 30, 2016

I think some of his statements are a little off-base, for instance, regarding the choice of tanh() for the activation function he says: "But mostly it’s 'because it works like this and it doesn’t work when we change it.'" People have spent a fair amount of time investigating the properties of different activation functions, and my understanding is that tanh() is generally not in favor, and ReLU (rectified linear) is probably a better choice for many applications, for reasons that are well understood. Maybe the author isn't all that familiar with the field?

tgflynn · on April 30, 2016

It depends on who you read. LeCun certainly advocated tanh for a long time. ReLU seems to be currently favored, especially for deep nets but I'm not sure it's being used universally. In my own experiments I am seeing it train faster than logistic, but not necessarily give better results (maybe because you don't want the net to train too fast to reach the best optimum).

His overall point that the field is largely empirical with little solid theory to guide one's choices is, in my opinion, not wrong. It may be that a general theory doesn't exist and different network architectures and hyperparameters are needed for different types and sizes of problems.

argonaut · on May 1, 2016

It doesn't depend on who you read. ReLUs (and their variants) are virtually universal now. LeCun advocated tanh but that was years and years ago before ReLUs came out (they came out in 2011) and he advocated ReLUs as early as three years ago (http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-icml2013.pdf). Empirically they give better results and are faster.

(I say "almost" because while I haven't seen a single state of the art result without ReLUs, I'm sure there is probably a random paper out there)

tgflynn · on May 4, 2016

As far as I can tell LSTM's are still using a combination of logistic and tanh transfer functions, at least through 2014 and some of the most impressive deep learning results have been obtained with these.

For example: http://arxiv.org/abs/1308.0850

walrus · on April 30, 2016

In my totally unscientific experiments, tanh worked better on shallow networks.

astazangasta · on April 30, 2016

Lossy compression. Reminds me of this hilarious bit: http://web.archive.org/web/20050402231231/http://lzip.source...

algorithm314 · on April 30, 2016

Neural networks are used for over a decade in Context Mixing compressors like paq https://en.wikipedia.org/wiki/PAQ

pigscantfly · on April 30, 2016

I was expecting to read about some experiments with autoencoders here, not this tutorial. I'm not sure how the author is learning about neural nets or where they are now, but anyone at this level of knowledge who wants to know more would be well served by going through Geoff Hinton's class on Coursera.

jcoffland · on April 30, 2016

This is not compression. The author focuses on the small number of nodes and calls it compression but it's the edges that encode the information and there are a lot of them.

zardeh · on April 30, 2016

>If you count the number of connections on that last picture you will notice that there are fewer than there were in the first network. There were 55=25 at first, now there are 52*2=20. That reduction would be larger if I had more input and output nodes. For 10 nodes that reduction would be from 100 connections down to 40 when inserting two nodes in the middle. That’s where the compression in the title of this blog post comes from.

tgflynn · on April 30, 2016

You can get (lossy) compression by using an autoencoder with a hidden layer that has fewer nodes than the number of inputs, since then you only need to store the hidden layer excitations for each example. The weights and network architecture itself are part of the compressor, not the compressed data.

As for how well this works compared to other approaches, I don't know, and this blog post certainly doesn't address the question.

frozenport · on April 30, 2016

How can the author draw conclusions without citing any code or experiments. For example, if we try to use the features DeepDream leaned we can see that our image will have terrible compression artifacts (faces of cats). DNNs do dimensional reduction, it is not clear if this correctly preserves image features.

IamFermat · on April 30, 2016

This is really cool. Totally can see a way where the algo would adapt based on object getting compressed

john_reel · on May 3, 2016

ZPAQ and pngwolf have very nice approaches to learning algorithms for compression. Especially ZPAQ.

rdlecler1 · on April 30, 2016

I'm sure with perturbation analysis you could also remove even more edges from the ANN.

tacos · on April 30, 2016

This is both a fascinating read of an obviously bright and clever fellow and also a TERRIBLE way to approach building intuition around the emergent behaviors of neural networks. If you want to learn about them without repeating the terrible mistakes this guy is about to make if he continues down this path, pick up a book.

CardenB · on April 30, 2016

What sort of terrible mistakes are you referring to?

tacos · on April 30, 2016

A fundamental misunderstanding of the space, focusing on the wrong things, thinking he can visualize (or should visualize) the mechanics, hyperfocus on topologies, misunderstanding of key distance functions, sigmoids, training/overtraining, add buzzword here.

Right now he's a chef who obsessively spent six months thinking about knives. But it's about the meat and the heat, not the chef knife.

Just pick a problem and start playing with it! If you lean to this sort of pedantic "I MUST UNDERSTAND ALL FUNDAMENTALS" (I do!) I still recommend Matlab (available for $49-$99 in student editions) because the tools are so good and the docs are so damn readable. This post is basically a matlab doc page minus the links to the stuff that actually works.

eli_gottlieb · on April 30, 2016

>Just pick a problem and start playing with it! If you lean to this sort of pedantic "I MUST UNDERSTAND ALL FUNDAMENTALS" (I do!) I still recommend Matlab (available for $49-$99 in student editions) because the tools are so good and the docs are so damn readable.

Don't pay to torture yourself! Use Python instead! Numpy and scipy are the standard libraries for most scientific code nowadays, and you'll be able to use Theano, Caffe, TensorFlow, PyMC, scikit-learn, pybrain, and loads of other machine-learning and statistics libraries with Python interfaces!

tacos · on April 30, 2016

+1 for python however good luck building fundamentals given the sea of chaotic products. The whole point of the Matlab recommendation is that, like Amazon S3, they haven't really touched the damn NN toolkit in a decade. While y'all were tinkering with MVC apps in 2007 some of us were researching and publishing and laying the foundation for the nine different open source learning libraries that can't agree on terminology or workflow patterns or which cloud provider they are clandestinely pushing you toward.

That said, I deploy using Python and several of these libraries (and more interesting unmentioned ones). But often, Matlab for exploratory dev! I re-recommend it for getting started.

eli_gottlieb · on April 30, 2016

>While y'all were tinkering with MVC apps in 2007 some of us were researching and publishing and laying the foundation for the nine different open source learning libraries

I was a kid in 2007, and I'd prefer that you didn't insinuate I developed MVC apps!

Right now I'm trying to get transfer-learning to show up correctly in a Hakaru model translated from a Church model so I can do some extra information-theoretic measurements.

thatcat · on April 30, 2016

>more interesting unmentioned ones

care to share more info?

tacos · on April 30, 2016

NN is flavor of the month at HN but these core toolkits don't really do much. They're all running the same Fortran code and MIT FFT libraries Matlab's been using for 20+ years. It's the base that the cool experiments happen on top of.

Once you dig in and try to actually do something, you stumble on the cool stuff. It's often Python throwaway or Matlab spaghetti code in a .edu directory that starts with a tilde. And it disappears when you need it most.

(If it's Python, it's guaranteed to be something wonky that uses 2.7 if you're on 3.x or vice versa, or requires three additional packages that throw obscure build errors on your distro. And if it's Matlab, it inevitably requires you to shell out another $299 for some stupid toolbox just to generate white noise...)

Good times. I won't kill your fun by being overly specific. Dig in!

argonaut · on May 1, 2016

Nobody uses MATLAB in deep learning... 90%+ of the neural net knowledge from the 90s is pretty much obsolete.

tacos · on May 3, 2016

A quick Google pulls up tons of recent papers and GitHub projects.

But more to the point of this discussion: the celebrated online class that builds the first principles that the creator of this article flubbed uses Matlab (and Octave).

deepnet · on April 30, 2016

Octave is a free (GPL) MATLAB alternative.

Andrew Ng's introductory Machine Learning Coursera uses Octave to teach programming a simple Neural Net that recognises handwritten digits (MNIST).

tacos · on April 30, 2016

The course also uses Matlab. Perhaps we can avoid the Octave vs Matlab debacle by linking to this previous discussion.

https://news.ycombinator.com/item?id=10232595

Consensus (I'll use the term loosely) was that you get what you pay for. Either spend the bucks on Matlab or preferably use Python. I'll dare to provide a stronger opinion: Octave just sucks and is one of those unfortunate GNU tools that forever seems stuck 15 years in the past. Many people moved to Python, we're all eyeing Julia, the rest of us are largely stuck in Matlab.

deepnet · on May 1, 2016

I really liked Octave and used it for a lot of prototyping - it was certainly sufficient for my needs and the maintainers are helpful & friendly and were unofficial TA's during the 1st run of Professor Ng's Coursera course.

Given the range of finances of a MOOC class of 160,000, Octave's price (free) was vital for some to participate.

Matlab looks really cool, but I feel it unfair to advocate un-Free software if my students can't take it home. I never got a chance to play with Matlab so can't really comment.

As you suggest I graduated from Octave to Python, sometimes Lua & Torch7 and then the subsets of C for CUDA & OPENCL's GPU parallelism.

Javascript is so universal accessible a platform that I like to give demonstrations using Karpathy's ConvnetJS & ReinforceJS as everyone can explore it straight away using the browser console as REPL.

Professor Ng's MOOC students managed a good fundraiser for Octave development so those that could, paid something.

tacos · on May 1, 2016

If exploring neural nets in a REPL is the spark that finally lights a fire under Octave, I'm all for it! But I think they already lost the battle to Python.

> Matlab looks really cool, but I feel it unfair to advocate un-Free software if my students can't take it home.

Why is it your job to advocate anything? Inform them of what's available. Challenge yourself to make your lessons applicable to a broader range of tools, including ones your students might craft themselves. Let them know they can get the job done with GIMP but there's Photoshop at $9.99/month that might save them time and make them more employable.

$1999 MacBook, $649 phone, but somehow $99 for a critical piece of software is unacceptable. And people wonder why students have debt but can't find jobs.

venomsnake · on April 30, 2016

>Right now he's a chef who obsessively spent six months thinking about knives. But it's about the meat and the heat, not the chef knife.

I can only guess you never learned about sushi. Or any other cuisine in which high precision cuts are the foundation/hallmark of excellence. Good chefs are obsessed over their knives. With good reason.

tacos · on April 30, 2016

Good chefs are obsessed with technique. This transcends cultures--from French to Japanese. Moreover, they actually prepare food.

In contrast, enthusiasts spend six months obsessing about different types of knives. And then write a blog post detailing their thought process. Which, like this post, is often meaningless because it's a guy writing about knives who's never cut into anything.

tobz · on April 30, 2016

Meaningless to who? Professional chefs who already know which knives they need and want to use? No shit, they're professional chefs and this guy is a just an enthusiastic. They wouldn't be reading this post saying "nice, I can finally figure out which knife I need!".

Reading a few of your comments in this thread, it seems like you're heavily involved in this area of technology/research, but it's depressing that you're making comments that range from passive-aggressive to condescending to dismissive. Further, you've yet to actually offer constructive criticism about the post, or even detailed criticism, one way or another.

tacos · on May 1, 2016

I can't fix this guy... he's off in the weeds doing his own thing.

However I can provide constructive input by providing broader perspectives and alternate viewpoints. This technically weak, mis-titled post sat at the top of HN for most of a day. It's now a top hit on Google when you search "Neural Net Compression" -- ranked above classic research papers from the 80s and 90s. That's the depressing part.

pjbrunet · on April 30, 2016

I suppose that one episode of POI (the chain of laptops on ice with Pink Floyd) could have implied a neural network of sorts, manifesting at the hardware level. If you remember the episode, they had to compress the AI to fit in a briefcase.

robbiep · on April 30, 2016

Ok so 3 layers increases the complexity but simplifies the connections. The human cortex has 6 layers.

Now create 6 layers, classify different sets of inputs as 'different' to represent different neurochemicals (you need several excitatory and several inhibitory and then a couple of very small master neurochemicals that have major excitatory and inhibitory responses to represent the dopamine network and a whole system for the amygdala), cluster different groups to either respond to inputs or create outputs, and set it loose on an environment. How close would we come to something that behaves as if conscious?

zamalek · on April 30, 2016

> How close would we come to something that behaves as if conscious?

The human brain is recursive in nature - the neurons and synapses form a cyclic graph. You can't do that with a simple perceptron.

otempomores · on April 30, 2016

If the human brain is recursive... whats stopping it from going in infinite circles ... Chemistry? If you inhibit that reaction..do you crash? can you canbecome a master of a field by having the specialized neurons keep there cycle support ed?

__s · on April 30, 2016

Cyclic graph with distributed/parallel execution means infinite loops don't halt the entire system. Some specialized neurons being tightly recursive don't generate new information, thus wouldn't build expertise, but obsession can be fruitful