How to visualize decision trees

parrt · on Sept 26, 2018

Decision trees are the fundamental building block of gradient boosting machines and Random Forests™, probably the two most popular machine learning models for structured data. Visualizing decision trees is a tremendous aid when learning how these models work and when interpreting models. Unfortunately, current visualization packages are rudimentary and not immediately helpful to the novice. So, we've created a general package called animl for scikit-learn decision tree visualization and model interpretation.

b_tterc_p · on Sept 26, 2018

This is cool. I like it, and will probably use it in my work, but it feels like there’s a lot going on. I don’t like how some of the final leaf nodes seem to be shown differently than the nodes higher up. Sometimes different chart types, sometimes reversed axes. I would also reccomend use of swarm plots for showing your regression scatter plots. Swarm plots are sexy, but not in the laughably uncomfortably way of the very similar violin plot.

parrt · on Sept 26, 2018

Yep, the leaves are predictor nodes whereas internal nodes are decision nodes. They are doing different things so we figured we should show them using different visualizations.

cschmidt · on Sept 26, 2018

Wow, I wondered why you put a TM on Random Forests. I guess it is trademark of Salford Systems, which is kind of weird. Maybe we can just call them random forests and ignore that.

msla · on Sept 26, 2018

> Maybe we can just call them random forests and ignore that.

Legally, yes, you can, as the use is not mandatory:

https://academia.stackexchange.com/questions/21521/is-it-man...

> Although owners of trademarked names may suggest otherwise, publishers are not obligated to denote the trademark status of a name when that name is mentioned in text. Authors representing trademark owners frequently feel obligated to use the trademark or registered-trademark symbol (™ or ®) after the first mention of their product names but often do not use these symbols consistently to indicate the trademark status of other names not owned by their particular sponsor or employer.

The people who own the trademark may feel obligated to use those marks, but nobody else ever is.

There's a lot of "folk law" (that is, urban legends repeated by the ignorant) surrounding this concept, so if you think I'm wrong, please do yourself and the rest of us a favor and research good cites to show that there's actual law saying I'm wrong. Thanks.

jph00 · on Sept 26, 2018

I'm often guilty of this too - but we really should put the (tm) there. It's nice that they made code of the algorithm publicly available and all they ask is that we respect their trademark in return. I think that's more than fair. :)

(I discussed this a few years with the co-inventor of random forests, Adele Cutler, and she confirmed that this is something that she wants to see happen.)

marktangotango · on Sept 26, 2018

Are algorithms patentable? Last I checked in US they were, copywritable?

kgwgk · on Sept 26, 2018

Not the answer to your question, but in case it helps anyone: trademarks are unrelated to patents. You can use a random forest but you can not call them “random forest”. “Aleatory jungle” is fine, though.

sacado2 · on Sept 27, 2018

"stochastic treeset". Sounds way more scientific, which can be required to convince a pointy-hair boss. "Random" forest sounds... well, I can flip a coin too, how is that going to solve my problem?

For the same reason, "naive" bayes classifier are very hard to sell, to the point I stopped naming them and now just tell "a very fast machine learning algorithm", unless specifically asked.

oscilloscope · on Sept 26, 2018

There's a nice interactive version of a decision tree diagram here, in the section "Growing a Tree".

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

parrt · on Sept 26, 2018

Indeed. They were the inspiration for this visualization. I wanted to do something for my book with Jeremy Howard https://mlbook.explained.ai/ and those guys show the way, but of course it isn't a general library. Love that r2d3.us page.

comboy · on Sept 26, 2018

This is the first time ever when the website is doing something when I scroll and I'm not mad.

anonytrary · on Sept 26, 2018

I was particularly impressed with the continuous scroll which lets you go through the animations frame-by-frame, so to speak.

benmccann · on Sept 27, 2018

I have a model trained with XGBoost Java. Can I take the model file and read it with scikit-learn to visualize it with this library?

jdonaldson · on Sept 27, 2018

Good to see others looking into tree model viz. I've done work with larger scale tree visualizations and found you quickly run out of space. I wound up using interactivity to reveal branch level info, dynamically pruned the tree based on train support, and I used a more sophisticated layout technique to pack more info in. https://www.google.com/amp/s/blog.bigml.com/2012/01/23/beaut...

FWIW visualizing trees like that helps spot problems really quickly. Overfitting behavior typically involves overusing a certain field, or growing long and relatively narrow branches.

parrt · on Sept 27, 2018

Thanks for that link. Super useful. Looks like BigML uses same layout I did for ANTLR parse trees. Really packs stuff in; e.g., https://cdn-images-1.medium.com/max/1760/1*k0mO4kJyQvPCyyev0...

jdonaldson · on Sept 27, 2018

Yeah the general algorithm is "Reingold-Tilford" with some tweaks from Buchehim.

https://en.m.wiktionary.org/wiki/Reingold-Tilford_algorithm

Really good algorithm to have in an arbitrary visualization toolkit.

parrt · on Oct 2, 2018

Just a note that dtreeviz works cross-platform now! Mac, Windows, Linux. "pip install -U dtreeviz" See more at https://github.com/parrt/dtreeviz

beamatronic · on Sept 27, 2018

If you model the decision tree as a directed graph, there’s no reason you couldn’t export it into a Doom/Quake level or even Minecraft or Roblox

parrt · on Sept 27, 2018

Heh that’s a cool idea. Fly through the tree like a maze

nestorD · on Sept 27, 2018

Not sure about the choice of pie chart as the default leaf format (humans are bad at guessing proportions from pie charts) but otherwise it does look great and convey the information efficiently.

parrt · on Sept 27, 2018

howdy! We use a pie chart for classifier leaves, despite their bad reputation. For the purpose of indicating purity, the viewer only needs an indication of whether there is a single strong majority category. The viewer does not need to see the exact relationship between elements of the pie chart, which is one key area where pie charts fail.

swinghu · on Sept 27, 2018

Good,very informtion distilled from the drawing