Hacker News new | past | comments | ask | show | jobs | submit login

There are so many issues with this post, let me enumerate:

1. Straw man tweet by some non-practitioner which is used to set up the straw-man argument.

2. The whole Digits example is ridiculous, statisticians "love" toy problems to prove theorems & make "arguments" etc. ML is empirical and not just the performance but the entire pipeline from data to application matters.

Let me illustrate: If your aim is to predict 1 vs 0 from images of digits. As an ML researcher I would write a program to synthesize images in all different combinations of fonts/font-color/background color/ location available. The data would easily be more than ~100,000 images. At this point one cannot use LASSO on top 10 pixels (due to jittering), and a Deep Models would be necessary. But in reality my model will outperforms because the thinking process as an ML researcher was not to make an "argument" but to "solve" the problem of detecting 1 vs 0.

3. But the biggest flaw is the following argument """The sample size matters. If you are Google, Amazon, or Facebook and have near infinite data it makes sense to deep learn."""

This is an another issue with Biostatisticians (The author of this post is Bio-Stats professor), is that they are fundamentally unable to recognize importance of programming and ability to collect data. Even if you are not Google, Amazon, Facebook you can easily collect data, even labeled data in scale of terabytes can be collected in within days or a week. Every single PhD student I know is not limited by size of the data but rather computational power and storage available to them. I personally have several terabytes of video and data from YFCC 100M that I would love to process and build models on but I am only limited by the computational power & AWS costs. If you want a concrete example, see the Google PlaNet paper [1] I today have enough data (~5 Tb) to replicate it and build open source geolocation model, the only hurdles are storage and computation costs.

[1] https://arxiv.org/abs/1602.05314




> 3.

How much of this is students doing research where they already have access to big data, which makes sense if your goal is to do deep learning research, vs being given a problem a business wants to solve? Can you make the same statement for the average problem at your average small-medium sized business? Can you really get big data that is relevant to the local, non-chain coffee shop down the street?

If you can it seems like an amazing business opportunity - to bring Google level insights to businesses that don't directly have Google-level data.


The issue of whether some business like "the local, non-chain coffee shop down the street" has any reason to use machine learning whatsoever seems to be orthogonal to the problem discussed in article which is the choice of approaches if you're going to do some machine learning.

There's a classical quote from Tukey "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." - yes, it's quite likely that an average small-medium sized business has no problems where the possible benefit of ML-driven insights won't match the costs required to analyze whatever data they have.

However, if a small-medium business has some problem with a large enough likely payback to justify making some ML system, it is quite likely that deep learning may be applicable on their data.

A big issue is transfer learning - in many domains while you may have a small amount of data, you'd want a system that has learned to generalize on a huge quantity of similar external data, and just tuned on your data. For example, if a cookie bakery needs analysis of cookie pictures or reviews of cookies, and has limited data samples, it would be reasonable to include e.g. ImageNet data or Amazon review corpus. You'd "teach" the system how pictures/internet reviews/English language/whatever else works on the biggest data available, and just retrain/adapt it to your particular problem afterwards.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: