It's very important to note that this is from 2022. I'm not saying it's not true today but neural models have gotten much better in 2 years.
(I'm personally using NN models for predicting certain values for tabularly structured data and at least for my case, the NN works better than state-of-the art tree models.)
There has been some work on training on lots of different data sets and then specializing on the one you care about. But I think people were trying that approach pre-2022 as well.
Sorry, I don't have references off the top of my head. I just recall coming across it while I was working on something related to timeseries forecasting.
Tooling around embeddings has improved. Creating and fine-tuning custom embeddings for your tabular data should be easier and more powerful these days.
Not the parent, but NNs typically work better when you can't linearize your data. For classification, that means a space in which hyperplanes separate classes, and for regression a space in which a linear approximation is good.
That doesn't look immediately linearly separable, but since it is 2D we have the insight that parameterizing by radius would do the trick. Now try doing that in 1000 dimensions. Sometimes you can, sometimes you can't or don't want to bother.
Note that if linear separability is the only issue you can just use kernel methods. In fact, gaussian processes are equivalent to a single hidden layer neural network with infinite hidden values.
The magic of deep neural networks comes from modeling complicated conditional probability distributions, which lets you do generative magic but isn't going to give you significantly better results than ensemble kNN when you're discriminating and the conditional distribution is low variance. Ensemble methods are like a form of regularization and they also act as a weak bootstrap to better model population variance, so it's no surprise that when they're capable of modeling the domain, they perform better than unregularized, un-bootstrapped neural network model. There are still tons of situations where ensemble methods can't model the domain, and if you incorporated regularization and bootstrapping into a discriminative NN model it would probably perform equivalently to the ensemble model.
That's an advantage over linear models, but GBTs handle non linearly-separated data just fine. Each individual tree can represent an arbitrary piecewise-constant function given enough depth, and then each tree in turn tries to minimize the loss on the residual of the previous trees. As such, they're effectively like a neural network with two hidden layers in terms of expressiveness.
This explanation doesn’t make sense to me. What do you mean by “linearize your data”—tree methods assume no linear form and are not even monotonically constrained. Classification is not done by plane-drawing but by probability estimation + cost function
I assume it's because there are some very complex relationships and patterns that cannot be captured by decision trees. Tree models work better on simpler data at least that is my gut feeling based on previous experiments with similar data.
Interesting. Usually I have better luck with xgboost for tabular data, even when the relationships are complex (which usually means deeper trees). It does fall flat a lot of the time for very high dimensions, though. All data is different, I guess.
There is some work with zero shot (decoder only) time series predictions by google and an open source variant. Curious to see how these approaches stack up as they are explored.
(I'm personally using NN models for predicting certain values for tabularly structured data and at least for my case, the NN works better than state-of-the art tree models.)