Great article! (author's book on ML are very good, greatly recommended).
Even though Deep Learning on Tabular data is a topic that is picking up interest, I'd try to stay away from it as much as possible. The main advantage of "statistical methods" (in contrast to Deep Learning) is the interpretability.
Most business applications of Machine Learning happen on "tabular" data (explicit features) with a lot of knowledge by the company about the features selected.
A simple Decision Tree gives you AMAZING accuracy and you can still understand what's going on behind the model.
Interpretability in Business ML applications is, IMO at least, the single most important trait of the model selected.
Fully agree! In the case that tabular data means business process records, there's absolutely no need to use anything more complicated than a principled statistical model. Focus on the quality of the data and the business problem, not exotic ML.
In more detail: the business data is prone to: evolving processes / systems / products / markets / customers; errors / omissions / corrections; tail events; hirings / firings; data loss; etc etc. The datasets tend to be small, messy, complicated, subjective. Nothing about this suggests needing a large, complicated model.
This is such an odd dogmatic generalization. There's plenty of business cases where e.g. SHAP is enough for explainability and having a higher accuracy is important.
SHAP is a recent de-facto standard way to get feature importance out of any model, whether you care about how much each column in your dataset contributes to a given decision or say which pixels in an image mattered the most.
> The gap between tree-based models and deep learning becomes narrower as the dataset size increases (here: 10k -> 50k).
I am curious if there is a sample threshold where it's worth exploring deep learning approaches to tabular data. I wonder if there are other considerations (e.g., inference speed, explainability, etc.).
>if there is a sample threshold where it's worth exploring deep learning
Not especially, but there are tasks where DL models seem to occasionally outperform by a little. If you really want to milk extra accuracy it can be worth it to try a DL model, and if it performs as well/better you can use it to make an ensemble along with your GBM or replace the GBM, though it's rare that it is worth it. If you check tabular data kaggle winner writeups most use gbms or an ensemble for a tiny boost over just a GBM.
Assuming limited time to work on the problem, you'd almost always want to focus on further feature engineering first and likely some hyperparameter tuning second.
I worked on a little side project for classification on tabular data, but a really challenging use case where the data was prone to a lot of noise and some randomness in the dependent variable. Tree models couldn't get a high enough accuracy, and when the dataset was under roughly 6k entries deep learning performed even worse (as expected).
What was really interesting was when the dataset had more than 6k or so; the deep learning model was suddenly much more accurate and by a wide gap! At roughly the 10k mark, the DL model was outperforming the tree model easily.
It depends on the "DL model", which is a highly vague term. Both a model with 10K parameters and a model with 10T parameters fit this description equally Well.
Yeah, tree based models are great for tabular datasets that are primarily numeric, with only a few categorical variables. But as soon as you categorical variables have a 1000+ potential values that need 1-hot encoding or if you have any natural language text associated with your rows, deep learning almost always outperforms, especially if you have over 50K instances in my experience.
The major downside of DL is the slow training, and therefore slow iteration feedback loop. Couple that with an exponentially growing number or hparams to tune, and you get something very powerful but costly in terms of time to use.
But if you want the best possible accuracy, and data collection isn't expensive, DL is the way to go. Just expect to spend 10x the amount of time tuning it vs trees to get a 10% to 20% reduction in error.
>categorical variables have a 1000+ potential values that need 1-hot encoding
You typically do not need to 1-hot encode categorical variables as the common implementations like LightGBM and Catboost have native efficient ways to handle them. Googling around I can't easily find cases where people get better results with GBM+one-hot and I haven't either, though I haven't worked with 1000+ values categorical variables much.
>deep learning almost always outperforms
This doesn't the case in the article we are commenting on, nor on Kaggle but given that DL models occasionally (though rarely) outperform I'm willing to believe this is one of those cases. Any recommendation on which DL models in particular I should test this claim?
Even though Deep Learning on Tabular data is a topic that is picking up interest, I'd try to stay away from it as much as possible. The main advantage of "statistical methods" (in contrast to Deep Learning) is the interpretability. Most business applications of Machine Learning happen on "tabular" data (explicit features) with a lot of knowledge by the company about the features selected.
A simple Decision Tree gives you AMAZING accuracy and you can still understand what's going on behind the model.
Interpretability in Business ML applications is, IMO at least, the single most important trait of the model selected.