Don't use deep learning when your data isn't that big

dbecker · on May 31, 2017

If you are Google, Amazon, or Facebook and have near infinite data it makes sense to deep learn. But if you have a more modest sample size you may not be buying any accuracy

The author explores sample sizes up to 85, and then suggests this is the relevant range except at Google, Amazon, Facebook, etc.

But the VAST majority of people considering deep learning have sample sizes between those extremes. Results on small samples are interesting, but it's disingenuous to market this as typical of the world outside Google.

billf1953 · on June 1, 2017

Yeah what a dog argument. Basically either shows the author has little comprehension of ml or he happened to design a terrible demonstration of small data issues.

minimaxir · on May 31, 2017

> For low training set sample sizes it looks like the simpler method (just picking the top 10 and using a linear model) slightly outperforms the more complicated methods.

This is a very bad argument for the given clickbaity headline. A methodology that works well for one dataset with few observations will not necessarily work well for another dataset.

You can do almost whatever you want with small datasets, it's just harder than with big data (and is necessary if obtaining data is expensive, e.g. medical trials). Specifically, you'll want to do bootstrapping to simulate additional data and reduce the uncertainty due to a low amount of data.

The "almost" is that you can't have hundreds of features if you have a small dataset (Curse of Dimensionality: https://en.wikipedia.org/wiki/Curse_of_dimensionality)

krosaen · on May 31, 2017

The benefits of deep learning have more to do with the number of features than the size of the dataset; e.g when you are dealing with million pixel images, you need a deep net to extract useful higher level features automatically. From there, yes, more data is better, but a better post in this vein would be, "Don't use deep learning your data doesn't have that many features".

cityhall · on May 31, 2017

I'd say the benefits come from the information content of the features or lack thereof. When you have uninformative features like pixel colors or word identities, there's nothing for traditional methods to work with. You have to start with feature engineering and pruning before decision trees or linear classifiers have a chance.

Most of the wins under the "deep learning" umbrella come from extracting meaning from homogenous features like "the pixel at x-2,y+1 has red=123" or "the word at n+1 is 'king'". That's why we see latent variable embeddings like word2vec come from the DL world even though they're not deep.

When you want to include highly informative features in a deep network, it's often better to feed them into a separate logistic model, as shown in the Tensorflow wide-deep tutorials.

ovi256 · on May 31, 2017

Hmm I would refrain from saying DL is useful only when approaching 1 megapixel images.

State of the art performance on MNIST is held by a 6 layer convnet (4 layers convolutional, 2 layers FC). MNIST is just 28 x 28 grayscale images, so 768 dimensions. There are many more datasets on the same order of dimensionality. CIFAR 10/100 (32 x 32 pixel images) is also dominated by DL convnets, AFAIK.

krosaen · on May 31, 2017

Sure, didn't mean to imply megapixel images were the lower threshold, just that it is more to do with the number of features and the need to automatically extract higher level features.

elchief · on May 31, 2017

Of course you need a lot of rows if you have a lot of columns...

curiousgal · on May 31, 2017

>From there, yes, more data is better

Data augmentation is also a thing.

darksaints · on May 31, 2017

I've been baffled by this as well. I can understand why deep learning has done well in fields that can roughly be described as sensory perception, but it has never improved on basic random forests or SVMs for problems in my domain. And we have lots of data (at least more than can fit in an R instance).

Even taking data size out of the picture, functionally it is not there yet for most tasks. Maybe it will be in the future, but the big problem with it is that with n neurons, you have n^n possible topologies, and finding the right neural topology is a major optimization problem that we're only barely learning basic human heuristics for.

I'm willing to bet the deep learning thing is just one more Neat fad that will eventually cause disillusionment at its lack of results, reverting us back to the Scruffy view that intelligence is far too complex to be described holistically by small sets of simple algorithms. The great thing about the Scruffy philosophy is that it isn't derogatory...deep learning will always have a place as a tool in its tool set. It merely doesn't hold unreasonable expectations.

eli_gottlieb · on May 31, 2017

>I'm willing to bet the deep learning thing is just one more Neat fad that will eventually cause disillusionment at its lack of results, reverting us back to the Scruffy view that intelligence is far too complex to be described holistically by small sets of simple algorithms. The great thing about the Scruffy philosophy is that it isn't derogatory...deep learning will always have a place as a tool in its tool set. It merely doesn't hold unreasonable expectations.

Deep learning is already a Scruffy fad. It basically says, "Hey, let's use a really huge hypothesis space of circuits that often includes a heavy prior towards convolutions." Gradient descent is a Neat principle, but the whole point of things like improved training methods, new objective functions, and convolutions was to deal with the exploding-gradient problem.

Deep learning didn't come up with its own Neat principle, it invented Scruffy methods to apply a Neat principle to a really fucking huge hypothesis space, so long as you've got a pretty big dataset.

eachro · on May 31, 2017

What is your domain?

darksaints · on May 31, 2017

Transportation, Logistics, Supply Chain Management, (physical) Operations.

I suspect the reason why deep learning has done so poorly in my domain is that the underlying data is a result of things that are very poorly abstracted as a "function". We have lots of discrete events, stateful buffering, hard non-linearities, discontinuities, numerical bounds, etc. It's more like learning business rules and physical process design than learning a mathematical function. This is part of the reason I don't see deep learning being a holistic solution for self driving cars...once you get past sensory perception and simple 2d path planning, driving is more of a rule based process than anything.

That being said, ML tends to be a pretty niche technique for us anyway. If a process and its components are well known and understood, we tend towards solutions that come from Operations Research over Machine Learning. It is only when things are poorly understood that we use ML (example: predicting product demand fluctuations based on media coverage or predicting truck arrival times given severe weather patterns and traffic backups). PGMs do really well here, but are far more difficult to understand, formulate, and train...for most tasks Random Forests are almost always Good Enough(TM).

mrmaximus · on June 1, 2017

+1 For real world business problems that I most frequently encounter doing consulting, it's hard to beat Random Forests and/or Gradient Boosting. Truth be told, most business problems I encounter turn out to be largely helped by good old linear models.

mswen · on June 1, 2017

Agreed! I have often done more sophisticated analysis and then stepped back and concluded that a simpler analysis was actually better for the business. It moved the business into a better place for informed decisions and gave them simple (analysis backed) rules of thumbs that every manager/director/ VP could understand and use by just checking a couple numbers and doing a simple easily remembered bit of math.

Understandable models with clear intervention points are what most businesses seem to need once you get to digging around in their operations, customer and sales data.

cttet · on June 1, 2017

Is there any example of PGM applying to this field?

bjornsing · on May 31, 2017

It's true that simple models often outperform more complex ones on small datasets. But the comparison seems rather unfair in this case: the "deep learning" model employed seems to be a simple feedforward discriminative classifier, and these are known to perform badly on small datasets. There are other "deep learning" models that would likely perform much better on small sample sizes. I've written a blog post about one idea [1]. If you prefer published per-reviewed research (and you should of course) then e.g. Semi-Supervised Learning with Deep Generative Models [2] is a good starting point.

1. http://www.openias.org/hybrid-generative-discriminative

2. https://pdfs.semanticscholar.org/b6b9/39ffc9920cd8521299a6fe...

AndrewKemendo · on May 31, 2017

Commonly understood in the field^ is that 60,000 examples is the sweet spot for training and validation data, 50k for training 10k for test/validation. This is largely because the MNIST set is exactly that size and is so commonly used successfully. Get very high accuracy and reduces instances of overfitting.

That said you can do a lot with a relatively little set. This 2012 paper puts the range between 80-570 samples [1] again depending on model and required outcomes. Leslie Smith at NRL has been working on this problem as well and showing some great progress on really small sample sets as well.

Major takeaway here is that there is such a thing as too big, and too small of data sets for classification accuracy, but those definitions are rapidly changing.

^Your mileage may vary depending on model, fine tuning, transfer learning etc...

[1]https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307431/

zeroxfe · on May 31, 2017

> Commonly understood in the field^ is that 60,000 examples is the sweet spot for training and validation data, 50k for training 10k for test/validation.

I don't understand how this could be true. Shouldn't the sweet spot be a function of the dimensionality of the data?

amelius · on May 31, 2017

No, the deep learning network should be smart enough to reduce the data to its essence (and that's what it ultimately does).

If DL would need more training data for higher dimensional inputs, then DL would lose against a simple pattern matching (correlation) algorithm at some point.

landon32 · on May 31, 2017

Imagine you are trying to use a neural network to classify single bit data as either being 1 or 0 (I know you obviously don't need a neural net for this, but it's an example). Aside from not needing a very deep network, you would not need much training data.

Then imagine classifying the color of a single pixel as "light" or "dark". There's three dimensions-- red, green, and blue. You would also need much less training data here than if you were trying to train a network to recognize a car, right?

I think this is what zeroxfe is referring to

anurag · on May 31, 2017

> But I’ve always thought that the major advantage of using deep learning over simpler models is that if you have a massive amount of data you can fit a massive number of parameters.

The major advantage of deep learning is not that it works better on more data. It's that it automatically learns features that would otherwise take expert humans a lot of time and energy to figure out and hardcode into the system.

bluetwo · on May 31, 2017

I don't really think of it as learning. I think of it as statistics on crack.

jclos · on May 31, 2017

That's an advantage of convolutional nets. Deep fully connected nets don't do that afaik.

IanCal · on May 31, 2017

They do, at least as far as I understand the statement. Historically the big benefit was training them layer by layer, which was like training a feature detector then a feature of features detector etc. If that's still how they're trained (been nearly a decade for me now) then they discover features rather than you engineering them.

This meant that you could train on large unlabelled data and then small amounts of labelled data.

jclos · on May 31, 2017

Yeah now that I think about it my statement didn't make any sense, since each intermediate layer computes a projection of the previous one, which is technically feature learning. I still disagree with the original comment though, because the intermediate representations of the data computed by a fully connected network are nothing like the ones that would be built by a human doing feature engineering. The ones learned by a convolutional layer would be closer to human-understandable features.

amelius · on May 31, 2017

Loosely speaking, convolutional nets are just a smart way of computing a function that would otherwise take the computational load of a fully connected net.

PeterisP · on May 31, 2017

All kinds of other structures also do that - for example, the "family" of recurrent networks in many non-visual problems.

dragandj · on May 31, 2017

The article is spot on, but also misses a simple thing: like in all hypes, DL hype is built on human irrationality. Most people do not understand DL well (or at all) but they see some high profile teams boasting about their success over and over. So, if I would just use that magical DL tool, just like them, maybe I would do those awesome things. Of course, there is also a simpler, more mundane explanation: beefing up the CV with yet another hyped technology. Hadoop: check; blockchain: check; Deep Learning: check. Maybe even deep learning in the blockchain distributed over a million-node Hadoop cluster. Keep them coming!

andreyk · on May 31, 2017

I am surprised by all the criticism. Sure, this does not make the point perfectly (with various details one could nitpick), but the basic premise seems completely agreeable and even boring - there are ML and statistics techniques that are simpler, more interpretable, and faster than deep learning ones that can often be sufficiently robust for various problems/goals (SVMs and random trees/forests in particular are lovely). At least, that's how it seemed to me. The wording might be improved to alter "when your dataset isn’t that big" to "when you target function isn’t that complex/when your data isn't that complex", sure, but the point stands.

jclos · on May 31, 2017

The main issue is that the VC dimension of a deep network is very high (iirc it grows proportionally to the number of edges of the network, which grows combinatorically with its depth), and for any dataset smaller than that the network can just learn the dataset and achieve 100% accuracy. However regularizing the network usually solves that problem.

mandeepj · on May 31, 2017

If your dataset is small then there are techniques to handle it. I bet author was never enrolled in a deep learning class. He just came up a title borrowed from 'don't use big data if your data isn't big'

aub3bhat · on May 31, 2017

There are so many issues with this post, let me enumerate:

1. Straw man tweet by some non-practitioner which is used to set up the straw-man argument.

2. The whole Digits example is ridiculous, statisticians "love" toy problems to prove theorems & make "arguments" etc. ML is empirical and not just the performance but the entire pipeline from data to application matters.

Let me illustrate: If your aim is to predict 1 vs 0 from images of digits. As an ML researcher I would write a program to synthesize images in all different combinations of fonts/font-color/background color/ location available. The data would easily be more than ~100,000 images. At this point one cannot use LASSO on top 10 pixels (due to jittering), and a Deep Models would be necessary. But in reality my model will outperforms because the thinking process as an ML researcher was not to make an "argument" but to "solve" the problem of detecting 1 vs 0.

3. But the biggest flaw is the following argument """The sample size matters. If you are Google, Amazon, or Facebook and have near infinite data it makes sense to deep learn."""

This is an another issue with Biostatisticians (The author of this post is Bio-Stats professor), is that they are fundamentally unable to recognize importance of programming and ability to collect data. Even if you are not Google, Amazon, Facebook you can easily collect data, even labeled data in scale of terabytes can be collected in within days or a week. Every single PhD student I know is not limited by size of the data but rather computational power and storage available to them. I personally have several terabytes of video and data from YFCC 100M that I would love to process and build models on but I am only limited by the computational power & AWS costs. If you want a concrete example, see the Google PlaNet paper [1] I today have enough data (~5 Tb) to replicate it and build open source geolocation model, the only hurdles are storage and computation costs.

[1] https://arxiv.org/abs/1602.05314

sbov · on May 31, 2017

> 3.

How much of this is students doing research where they already have access to big data, which makes sense if your goal is to do deep learning research, vs being given a problem a business wants to solve? Can you make the same statement for the average problem at your average small-medium sized business? Can you really get big data that is relevant to the local, non-chain coffee shop down the street?

If you can it seems like an amazing business opportunity - to bring Google level insights to businesses that don't directly have Google-level data.

PeterisP · on May 31, 2017

The issue of whether some business like "the local, non-chain coffee shop down the street" has any reason to use machine learning whatsoever seems to be orthogonal to the problem discussed in article which is the choice of approaches if you're going to do some machine learning.

There's a classical quote from Tukey "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." - yes, it's quite likely that an average small-medium sized business has no problems where the possible benefit of ML-driven insights won't match the costs required to analyze whatever data they have.

However, if a small-medium business has some problem with a large enough likely payback to justify making some ML system, it is quite likely that deep learning may be applicable on their data.

A big issue is transfer learning - in many domains while you may have a small amount of data, you'd want a system that has learned to generalize on a huge quantity of similar external data, and just tuned on your data. For example, if a cookie bakery needs analysis of cookie pictures or reviews of cookies, and has limited data samples, it would be reasonable to include e.g. ImageNet data or Amazon review corpus. You'd "teach" the system how pictures/internet reviews/English language/whatever else works on the biggest data available, and just retrain/adapt it to your particular problem afterwards.