It's very important to note that this is from 2022. I'm not saying it's not true...

redox99 · 2024-03-05T16:25:41 1709655941

In what way have models gotten better for tabular data? Can't think of any new technique since 2022.

jeffreyrogers · 2024-03-05T17:17:35 1709659055

There has been some work on training on lots of different data sets and then specializing on the one you care about. But I think people were trying that approach pre-2022 as well.

asdff · 2024-03-05T20:23:36 1709670216

This has to be done with great care. Most datasets are of poor quality.

frituur · 2024-03-05T17:38:32 1709660312

Do you have some good scientific references for that? I'd love to incorporate them in my phd thesis!

jeffreyrogers · 2024-03-05T19:16:21 1709666181

Sorry, I don't have references off the top of my head. I just recall coming across it while I was working on something related to timeseries forecasting.

_pastel · 2024-03-05T19:56:58 1709668618

Tooling around embeddings has improved. Creating and fine-tuning custom embeddings for your tabular data should be easier and more powerful these days.

mjhay · 2024-03-05T14:39:21 1709649561

Do you have any intuition you could share of why NNs work better in this case?

queuebert · 2024-03-05T15:16:53 1709651813

Not the parent, but NNs typically work better when you can't linearize your data. For classification, that means a space in which hyperplanes separate classes, and for regression a space in which a linear approximation is good.

For example, take the circle dataset here: https://playground.tensorflow.org

That doesn't look immediately linearly separable, but since it is 2D we have the insight that parameterizing by radius would do the trick. Now try doing that in 1000 dimensions. Sometimes you can, sometimes you can't or don't want to bother.

CuriouslyC · 2024-03-05T15:49:25 1709653765

Note that if linear separability is the only issue you can just use kernel methods. In fact, gaussian processes are equivalent to a single hidden layer neural network with infinite hidden values.

The magic of deep neural networks comes from modeling complicated conditional probability distributions, which lets you do generative magic but isn't going to give you significantly better results than ensemble kNN when you're discriminating and the conditional distribution is low variance. Ensemble methods are like a form of regularization and they also act as a weak bootstrap to better model population variance, so it's no surprise that when they're capable of modeling the domain, they perform better than unregularized, un-bootstrapped neural network model. There are still tons of situations where ensemble methods can't model the domain, and if you incorporated regularization and bootstrapping into a discriminative NN model it would probably perform equivalently to the ensemble model.

mjhay · 2024-03-05T15:45:23 1709653523

That's an advantage over linear models, but GBTs handle non linearly-separated data just fine. Each individual tree can represent an arbitrary piecewise-constant function given enough depth, and then each tree in turn tries to minimize the loss on the residual of the previous trees. As such, they're effectively like a neural network with two hidden layers in terms of expressiveness.

melondonkey · 2024-03-05T15:46:39 1709653599

This explanation doesn’t make sense to me. What do you mean by “linearize your data”—tree methods assume no linear form and are not even monotonically constrained. Classification is not done by plane-drawing but by probability estimation + cost function

dist-epoch · 2024-03-05T17:25:41 1709659541

A tree split can be considered plane-drawing.

mikkom · 2024-03-05T15:55:05 1709654105

I assume it's because there are some very complex relationships and patterns that cannot be captured by decision trees. Tree models work better on simpler data at least that is my gut feeling based on previous experiments with similar data.

mjhay · 2024-03-05T18:17:19 1709662639

Interesting. Usually I have better luck with xgboost for tabular data, even when the relationships are complex (which usually means deeper trees). It does fall flat a lot of the time for very high dimensions, though. All data is different, I guess.

lerchmo · 2024-03-05T16:24:22 1709655862

There is some work with zero shot (decoder only) time series predictions by google and an open source variant. Curious to see how these approaches stack up as they are explored.

__mharrison__ · 2024-03-05T18:43:15 1709664195

Pretty sure it is still true today. Catboost rules the roost!