Hacker News new | past | comments | ask | show | jobs | submit login
Decision Trees in Python: Predicting Diabetes (statisticallyrelevant.com)
91 points by rbanffy on Oct 6, 2022 | hide | past | favorite | 19 comments



Two things are missing here.

The first the the null model result, if a model simply predicted the most common class, we would get accuracy of 0.649. So the CART model has increased performance by ~14%. A good start.

The second is that in many scenarios in business and especially medical scenarios, you also need to know sensitivity (how good are we at predicting a person with diabetes does have diabetes) and specificity (how good are we at predicting a person without diabetes does NOT have it). Always calculate these metrics and a confusion matrix for binary classification.

If a simple model predicted you had low vitamin-C, a doctor in my country would simply prescribe it even if the model had terrible specificity because the treatment is harmless and cheaper than a blood test. On the other hand if a simple model predicted cancer, they would refer you a variety of scans and tests and not make a diagnosis until they were quite sure.


Take it one step further. A test _never_ has a binary outcome, this is _always_ the result of some score thresholded against something else (could be another score). Sensitivity and specificity are a function of your data to validate the test, as well as this tunable threshold! The tunability of the threshold should be a core concern in the reporting of the performance of a test. Look at ROC or DET curve and their performance measures (EER, False accept @ some given False reject rate, or False Reject @ given False Accept rate, AUC).

This allows applications to understand the costs and benefits of tuning this thresholding parameter, and understanding what happens when you accept a higher false reject rate, than a higher false accept rate.

Just reporting a specificity and sensitivity totally misses the point that the threshold is tunable and different applications can have fundamentally different needs for this parameter.

The vitamin-c case is a case where false accepts (model predicts I have some deficiency where in reality I do not) are essentially cost-free, so you can take the threshold where FAR==100%, and then report on the FRR at this point


> Take it one step further. A test _never_ has a binary outcome, this is _always_ the result of some score thresholded against something else (could be another score).

By test do you mean model/method? Because not all models have a score eg. discrete classifiers.

CART classifiers are simply large if/else statements. There is no tunable threshold and while some people sort of draw something that looks like a ROC, it's not quite as useful.

But of course you are right, in the real world you would be comparing many different models built with different methods and hyper params using a variety of metrics.


Unless you are using some very exotic method i don't know, you are wrong about discrete classifier not having a score. Everything I know extracts an integer from a vector of scores over options, with the most common being argmax over likelihood trained with CrossEntropy


Nope. Decision trees typically output binary (or multinomial) values directly.

Sure, you could tune the node thresholds (or optimize them during tree construction), to vary the trade-off between specificity and selectivity, but this may not be straightforward for deep trees or those with many leaf nodes, and it will never be as simple as simply tuning a single threshold (unless you are using decision stumps, rather than decision trees).


How do you construct your decision tree though? All methods I'm familiar with use a criterion which comes back down to using a score, which also ties down to the set theoretical foundations of probability theory: assign a score and then normalise to to obtain a probability.


> Decision trees typically output binary (or multinomial) values directly.

Is there some Bayesian version of decision trees where you model each decision in the tree as a beta distribution or something? Then you could simply choose the leaf or path with the highest confidence.

I just tried a cursory search about this but came up with a lot of theory justifying the use of thresholds as an approximation to a Bayesian interpretation, but I couldn't find any actual implementations that propagate a confidence measure or probability distribution through the tree, so I don't know if this is standard methodology.

I can imagine having confidence in the final decision to also be useful for model ensembling. Curious what the right way to do this might be, or to know if there are no advantages vs. simple threshold trees.


> Is there some Bayesian version of decision trees where you model each decision in the tree as a beta distribution or something? Then you could simply choose the leaf or path with the highest confidence.

Sure... there's many things you could do "force" a decision tree to output a real scalar/vector (that you can then threshold): for example, you could use a regression tree where leafs are logistic/multinomial GLM.

My point is that most types of decision trees (including CART decision trees) do not work like that by default (and they are considered pretty standard classifiers in ML literature), unlike what was claimed by igorkraw.

> I can imagine having confidence in the final decision to also be useful for model ensembling. Curious what the right way to do this might be, or to know if there are no advantages vs. simple threshold trees.

Sure, you could use ensembles of decision trees (e.g. random forests) to estimate a continuous value that perhaps (but not surely) correlates with "confidence" or "probability", which you can then threshold to obtain a binary value.

But, again, this is beyond what a "vanilla" decision tree is attempting to do (note: a decision tree is usually not explicitly trying to model the probability of each sample belonging to class X or Y).


> (note: a decision tree is usually not explicitly trying to model the probability of each sample belonging to class X or Y).

I realize that. Maybe my question wasn't clear. I was specifically trying to ask if there exist interesting or useful models beyond standard decision trees that might do so.


I think I got your question right (but perhaps I was not totally clear in my answer).

In a nutshell: sure, there are lots of such possibly useful models that use trees (you can even come up with new ones yourself). Random forests or GLM regression trees are two examples.

My point is that, because standard decision trees do not attempt to explicitly model probabilities (they just try to partition the data in "homogeneous groups", in some sense), you have no pre-baked assurances that your "score" will actually be well calibrated (i.e., no assurances that it will actually be a good predictor of the raw probability/confidence).

Sure, you could come up with some "decision tree"-like (i.e., partitioning algorithm) Bayesian method that would (maybe) ensure reasonable calibration of "confidence"... but, at that point, are you even using "decision trees" anymore? Probably not.


The point is that the number of features reduces by one at each node during training, but the features could themselves be scores from a probability distribution. It would be a form of preprocessing. The threshold is selected to minimize the difference in size of the split (training) sets, or to maximize information gain.

During classification, each node has a binary output.


My two most hated statistical words are

1. Significant. "Statistically significant" just means "we think we measured something more than random noise because this wouldn't happen by random chance very often", a usage completely unlike any other usage of the word in English. I think it should have been "statistically suggestive" because that's all it is.

2. "Predict" - 99% of "predictions" are just inferences. They don't actually provide information that we didn't already have.


Yes, people forget that the core to Fischer's approach was to start with a hypothesis, a plausible mechanism which would cause (or at least explain) the outcome. Only then is it fair to use a p-value to show that your measurement is "something that wouldn't happen by random chance very often."

As validation of a hypothesis which has merit on its own.


What are the 1% of predictions that aren't just inferences? I can't imagine


I’m slowly working my way through the FastAI deep learning course right now. In that course, everything is a nail and deep learning is a hammer — which makes sense, because it’s a course about deep learning.

As a total noob, I’m curious: how do you know when to pick a particular AI approach for a problem?

E.g. decision trees. Are they preferred when features must be well-defined? When data is smaller? When individual steps need to be observable? Etc.


The way I understand it as a quick rule of thumb is that if you have tabular-type data, try random forests first (decision tree ensemble) and then NN and see which gives you better results. If you have other type of data (images, sound, text, etc) NN is likely the way to go.


Thank you! That seems like a pretty rational starting point.

I’m enjoying what I’ve learned so far but the ecosystem is large and has its own technical language. It’s often difficult to dig up simple answers like this.


No problem! The fast.ai forums and discussion groups are really friendly and will help if you have questions.

I will admit that fast.ai goes a little ... fast ... but that is kind of their thing. If you do not have any math-heavy background and want to learn ML at a little deeper level there is a book titled "Grokking Machine Learning" by Luis Serrano that I would recommend. The author is a good communicator and uses non-technical language to explain technical terms in a way that is easy to understand. I mostly already knew everything in the book when I read it, but it was so engaging that I read it anyway and did learn some new things.


Tossing out insulin = 0 (which is too vague... hepatic? Supplemented? Total daily dose? Average dose in units per hour?) will bias your dataset to having more diabetics.

- a diabetes researcher




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: