Hacker News new | past | comments | ask | show | jobs | submit login
Implementing a Principal Component Analysis In Python (2014) (sebastianraschka.com)
151 points by nafizh on Feb 6, 2017 | hide | past | favorite | 26 comments



If anyone is interested, I also wrote a post on PCA this past weekend, but it uh... didn't get any upvotes :(

http://michaeljflynn.net/2017/02/06/a-tutorial-on-principal-...


One more proof that python > R :)

I guess this kind of tutorials in python are more popular since the R / Matlab people come with statistics background and don't really need the tutorials on PCA.



PCA is basically a decomposition of data into a low rank Euclidean space (where the principal axis accounts for the bulk of the variation in the data and each successive axis is orthogonal and accounts for the remaining variation). Notably though, most problems do not lie in a Euclidean vector space — they often exist on a lower dimensional manifold embedded into such a space. More recent work considers the problem of performing low rank decompositions if you assume constraints on the factors, i.e., general priors. The work on this is quite recent and may be of interest to someone. You can essentially perform PCA with constraints using a message passing algorithm (originally derived from physics applications):

https://arxiv.org/abs/1701.00858


If anyone is interested, a much faster technique is using the randomized methods from https://arxiv.org/abs/0909.4061 - see e.g. http://scikit-learn.org/stable/modules/generated/sklearn.dec...


I recently implemented PCA by computing the covariance matrix and using power iteration to obtain the eigenvectors. I found these pages useful for understanding how PCA works:

http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf

https://en.wikipedia.org/wiki/Power_iteration

I've used it to analyse the Voynich Manuscript: http://web.onetel.com/~hibou/voynich/VoynichPagesPCA.html


Mildly curious question I've read a few papers recently which talk about 'Projection to Latent Structures' (PLS) I know PLS and PCA are related but I'm struggling to understand the differences and when you should use one or the other. Are there any good references people could recommend?


I believe PLS is related to PLSR, where the difference is that you are trying to maximize the covariance between the predictor variables and the observed variables.


It's good for educational purposes, but I would say there isn't much to "implement" since PCA is simply the eigen-decomposition of a matrix.


Big difference between knowing the math in theory and actually using it to do something useful with your data.


Agreed, but that Sword of Damocles cuts both ways. The thing you do with your data which starts out useful at first can and will easily erode into infrastructural debt if not built on sound foundations, which cannot be found off the shelf in standard scientific computing libraries that have been around for decades.


That difference approaches zero as programming more symbolic and as implementations become more available.

Libraries like TensorFlow, Theano, etc. are great because then I can focus on what I want (math) instead of spending my time telling a computer how it should multiply things together in an imperative sense.


Or more likely in actual implementation (than direct decomposition of A^T * A), an SVD of A.


There's no such thing as "simply" in numerical linear algebra.


you can skip sklear and other packages and just use numpy (which is supported in pypy now), and just do an svd of the input matrix.

U,d,Vt=svd(X)

D=diag(d)

Xhat= U[:,:2].dot(D[:2, :2])


in addition if you want a way to project new vectors into a reduced svd space:

l=2 # desired vector space dimension

U,d,Vt = linalg.svd(X,full_matrices=False)

P=(U*(1/d))[:,:l]

xnew = dot(x,P)


Does anyone know of a similar understandable introduction to Independent component analyis (ICA)?


One issue is that ICA covers a few different implementations.

My favourite would be Infomax '95 paper by Bell and Sejnowski. Under this setup:

ICA is a single layer neural network that maps N inputs to N outputs. The loss function being maximised is the mutual info between input and output.


ICA is also pretty neat


I agree. I'm surprised with the number of situations where people use PCA when ICA really would be more appropriate. I suppose that's the difference between knowing the tools and knowing the math behind the tools though.


The late David McKay said this about PCA:

"Principal Component Analysis is a dimensionally invalid method that gives people a delusion that they are doing something useful with their data. If you change the units that one of the variables is measured in, it will change all the "principal components"! It's for that reason that I made no mention of PCA in my book. I am not a slavish conformist, regurgitating whatever other people think should be taught. I think before I teach."

https://www.amazon.com/review/R16RJ2PT63DZ3Q


I wouldn't read a book from an author who reasons and writes like that.

The issue he is having with PCA can be solved by normalizing the data.


McKay did foundational research in machine learning; was a chief scientific advisor to the UK government's department for climate change; created eye tracking user interface software for paralysed people; and wrote an excellent text book on information theory. I.e. a proper Olympian. So when you read a statement like that, you should take it seriously and possibly realign your world-view.


I'm sure David MacKay is well aware of the need for normalization — his point is that people use PCA without understanding what it means and how to perform it properly. Memorizing a technique without having any idea of why the technique works often leads to meaningless (or worse, incorrect) interpretations of data.


Let's say I'm building something with naive bayes but some of my inputs are correlated. Could PCA help me find an independent set of inputs?


Your describing collinearity. Yes, The PCA reconstruction components will not be correlated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: