Hacker News new | past | comments | ask | show | jobs | submit login
Visualizing popular machine learning algorithms (jsfiddle.net)
195 points by ingve on Oct 24, 2015 | hide | past | favorite | 38 comments



For learners it is confusing to see the nonlinear decision boundaries for linear and logistic regression, IMO a note about the feature expansion should be added


Good point, I've updated my post. For linear and logistic regression there's cubic expansion on the features (which is how they can fit curved problems). The relevant Javascript code is on lines 91 and 96.

PS: It can be changed to "linear" or "quadratic" as well.


Yeah, that or label the axes....


How would labeling the axis help explaining what is going on.


Awesome. Would be great to have execution times.

Also what is nerdy.js? I saw it was related to "Carl Edward Rasmussen" but couldn't find another reference on the net


It's a Javascript library I put together a long time ago for dealing with datasets and machine learning algorithms. It was used for some of my own personal projects and hasn't been focused on for release in the wild (although I'm considering it now).

The reference to Carl Edward Rasmussen is because I based my minimize function heavily off of this one: http://learning.eng.cam.ac.uk/carl/code/minimize/


I'd be interested in the library :D


Me too!


I'm intrigued also. Definitely some sort of Machine Learning related library anyway. Found something related to it but it doesn't really have any substantial information on it either:

http://nerdyjs.appspot.com/


Looks like original author is a David Wybiral http://davywybiral.blogspot.ca/2015/10/visualizing-popular-m...


Looks like K-nearest neighbor does pretty well.


Yes, k-nn is theoretically the one of the best ML algorithms in the sense that it will find the closest items in the training set. For classification or finding similar looking items it is great. However, it has pretty poor running times for evaluation of unseen data (http://nlp.stanford.edu/IR-book/html/htmledition/time-comple...). This is contrary to something like neural networks, which take a while to train, but then evaluate very quickly. For real world use the training times matter to an extent, but in a web app or real time application the latency from knn is just impractical.


That's why we developed the "Boundary Forest" algorithm which is a fast nearest-neighbor type algorithm with generalization at least as good as K-NN, while being able to respond to queries very quickly.

It maintains trees of examples that let it train and respond to test queries in logarithmic time with the number of stored examples, which can be much less than the overall number of training samples. It thus maintains k-NN's property of very fast training time, and is also an online algorithm, and can be used for regression problems as well as classification.

See our paper that was presented at AAAI 2015 here: http://www.disneyresearch.com/publication/the-boundary-fores...


It also suffers from the curse of dimensionality, making it weaker as the number of features increase.


I wouldn't call it theoretically the best. It is affected by outliers and doesn't make any generalization at training time. This latter point raises the questions whether it deserves the name learning. I would say linear models are typically a better learning algorithm; I wouldn't know what to call "the best" algorithm, but it might be deep learning nowadays.


These visualisations are great but misleading regarding the performance of these classifiers. In practice you don't have a lot of data in a small number of dimensions (2 in this case). You have a little bit of data in zillions of dimensions. Think of classifying a 100x100 pixel image: that's 3x100x100=30000 dimensional data. You may not even have one training sample per class per dimension. Generalizing from comparatively little data to a very high dimensional space is the true difficulty of machine learning. Unfortunately you can't easily visualize that.


Try the "Island inside an island" test (put a blue cluster inside an orange island). Only k-means and SVM dealt with it satisfactorily.


Shouldn't a neural net with sufficient unit do that too?


Yes, if you increase the hidden layer from 5 to 10 nodes:

http://jsfiddle.net/udb95202/


There is also MLDemos [1] which is open source.

[1]: http://mldemos.epfl.ch/


I'm a little surprised neural network comes up with a straight line and linear regression doesn't, which I thought by definition it would do. (e.g. on 2 normal groups)

Some discussion of methods, ie how many hidden layers/nodes for the neural network, would probably help make some sense of it.

Random forest could be worth adding.


Looking at the code (http://jsfiddle.net/wybiral/3bdkp5c0/light/) it seems they are expanding the features to include all second and third order terms (options.expansion = cubic), that's why linear regression does not come up with a straight line.


Sounds interesting, but I can't see the results with Firefox 38.2.1.


Try this one: https://jsfiddle.net/752pqyvp/embedded/result

It's because of the browser blocking mixed content: The JS libraries are being loaded over HTTP but the JSFiddle is over HTTPS.

The version above loads the libraries over HTTPS via cdnjs.com


I can't see the results either. Chrome 46.0.2490.71 (64-bit)


Can someone please explain this?


It's using the X and Y location of the dots as training data. Each algorithm is being trained on (x,y)->color in an attempt to buildup a rule for predicting what color an unseen (x,y) pair would be. The hypothesis it builds is then used to color the background so that you can see the decision boundary.


There's a bug somewhere.

Refresh, choose dataset: curved, algorithm: k means clustering. You get this:

http://imageshack.com/a/img633/7110/sfteaE.png

If you play around and select different algorithms before selecting k means clustering you can get very different results. :)


I accidentally left k means in there as an option and it doesn't make much sense in the context of this example. So, yeah, it's a bit of a bug. Realistically, linear regression doesn't make sense being included either but it still kinda works.


some content is loaded over HTTP rather than HTTPS so thas why it might display a blank page for some people who have HTTPS forced


any visualization of these algorithms in 2 dimensions (with cubic feature expansion!) is completely misleading if you intend to work on any real problem with many dimensions. Also, for those asking for execution times, these would be horribly misleading as well.


+1!

Are you aware of reasonable high dimensional "visualizations". It cant' be accurate of course. But catpuring essential features would be nice.

E.g. here is a 4d cube: https://commons.wikimedia.org/wiki/File:8-cell.gif


How is there no training time delay? How is training all these classifiers not putting my CPU into a sweat??

edit: I should also mention: these is very cool :)


The dataset is quite small and you have a fast machine. On my laptop, a 7 year old Core 2, there's a slight delay when running some of the heavier algorithms (e.g. running neural net or svm on the island data set).


Note: you can click on the graph to add datapoints.


No sources on GitHub, but why? Nerdy.js looks like interesting component, but I failed to find any relevant information about it.


Wow it is awesome. Can we also play around with classifier parameters ( k of kNN) ?


Only in the code :) Click "Edit in JSFiddle" and look at line 109 in the Javascript section. You'll see: options.k = 5




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: