Implementing a Principal Component Analysis In Python (2014)

mjfl · on Feb 6, 2017

If anyone is interested, I also wrote a post on PCA this past weekend, but it uh... didn't get any upvotes :(

http://michaeljflynn.net/2017/02/06/a-tutorial-on-principal-...

TheAlchemist · on Feb 7, 2017

One more proof that python > R :)

I guess this kind of tutorials in python are more popular since the R / Matlab people come with statistics background and don't really need the tutorials on PCA.

masthead · on Feb 7, 2017

Nice read. Here's PCA applied on images

http://www.janeriksolem.net/pca-for-images-using-python.html

Xcelerate · on Feb 7, 2017

PCA is basically a decomposition of data into a low rank Euclidean space (where the principal axis accounts for the bulk of the variation in the data and each successive axis is orthogonal and accounts for the remaining variation). Notably though, most problems do not lie in a Euclidean vector space — they often exist on a lower dimensional manifold embedded into such a space. More recent work considers the problem of performing low rank decompositions if you assume constraints on the factors, i.e., general priors. The work on this is quite recent and may be of interest to someone. You can essentially perform PCA with constraints using a message passing algorithm (originally derived from physics applications):

https://arxiv.org/abs/1701.00858

ajtulloch · on Feb 6, 2017

If anyone is interested, a much faster technique is using the randomized methods from https://arxiv.org/abs/0909.4061 - see e.g. http://scikit-learn.org/stable/modules/generated/sklearn.dec...

DonaldFisk · on Feb 7, 2017

I recently implemented PCA by computing the covariance matrix and using power iteration to obtain the eigenvectors. I found these pages useful for understanding how PCA works:

http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf

https://en.wikipedia.org/wiki/Power_iteration

I've used it to analyse the Voynich Manuscript: http://web.onetel.com/~hibou/voynich/VoynichPagesPCA.html

bigger_cheese · on Feb 7, 2017

Mildly curious question I've read a few papers recently which talk about 'Projection to Latent Structures' (PLS) I know PLS and PCA are related but I'm struggling to understand the differences and when you should use one or the other. Are there any good references people could recommend?

leecarraher · on Feb 7, 2017

I believe PLS is related to PLSR, where the difference is that you are trying to maximize the covariance between the predictor variables and the observed variables.

lottin · on Feb 7, 2017

It's good for educational purposes, but I would say there isn't much to "implement" since PCA is simply the eigen-decomposition of a matrix.

_pius · on Feb 7, 2017

Big difference between knowing the math in theory and actually using it to do something useful with your data.

yowlingcat · on Feb 7, 2017

Agreed, but that Sword of Damocles cuts both ways. The thing you do with your data which starts out useful at first can and will easily erode into infrastructural debt if not built on sound foundations, which cannot be found off the shelf in standard scientific computing libraries that have been around for decades.

jayajay · on Feb 7, 2017

That difference approaches zero as programming more symbolic and as implementations become more available.

Libraries like TensorFlow, Theano, etc. are great because then I can focus on what I want (math) instead of spending my time telling a computer how it should multiply things together in an imperative sense.

thearn4 · on Feb 7, 2017

Or more likely in actual implementation (than direct decomposition of A^T * A), an SVD of A.

GFK_of_xmaspast · on Feb 7, 2017

There's no such thing as "simply" in numerical linear algebra.

leecarraher · on Feb 7, 2017

you can skip sklear and other packages and just use numpy (which is supported in pypy now), and just do an svd of the input matrix.

U,d,Vt=svd(X)

D=diag(d)

Xhat= U[:,:2].dot(D[:2, :2])

leecarraher · on Feb 8, 2017

in addition if you want a way to project new vectors into a reduced svd space:

l=2 # desired vector space dimension

U,d,Vt = linalg.svd(X,full_matrices=False)

P=(U*(1/d))[:,:l]

xnew = dot(x,P)

folli · on Feb 7, 2017

Does anyone know of a similar understandable introduction to Independent component analyis (ICA)?

conjectures · on Feb 7, 2017

One issue is that ICA covers a few different implementations.

My favourite would be Infomax '95 paper by Bell and Sejnowski. Under this setup:

ICA is a single layer neural network that maps N inputs to N outputs. The loss function being maximised is the mutual info between input and output.

platz · on Feb 7, 2017

ICA is also pretty neat

Xcelerate · on Feb 7, 2017

I agree. I'm surprised with the number of situations where people use PCA when ICA really would be more appropriate. I suppose that's the difference between knowing the tools and knowing the math behind the tools though.

nextos · on Feb 7, 2017

The late David McKay said this about PCA:

"Principal Component Analysis is a dimensionally invalid method that gives people a delusion that they are doing something useful with their data. If you change the units that one of the variables is measured in, it will change all the "principal components"! It's for that reason that I made no mention of PCA in my book. I am not a slavish conformist, regurgitating whatever other people think should be taught. I think before I teach."

—https://www.amazon.com/review/R16RJ2PT63DZ3Q

Scea91 · on Feb 7, 2017

I wouldn't read a book from an author who reasons and writes like that.

The issue he is having with PCA can be solved by normalizing the data.

conjectures · on Feb 7, 2017

McKay did foundational research in machine learning; was a chief scientific advisor to the UK government's department for climate change; created eye tracking user interface software for paralysed people; and wrote an excellent text book on information theory. I.e. a proper Olympian. So when you read a statement like that, you should take it seriously and possibly realign your world-view.

Xcelerate · on Feb 7, 2017

I'm sure David MacKay is well aware of the need for normalization — his point is that people use PCA without understanding what it means and how to perform it properly. Memorizing a technique without having any idea of why the technique works often leads to meaningless (or worse, incorrect) interpretations of data.

DennisP · on Feb 7, 2017

Let's say I'm building something with naive bayes but some of my inputs are correlated. Could PCA help me find an independent set of inputs?

eggie5 · on Feb 7, 2017

Your describing collinearity. Yes, The PCA reconstruction components will not be correlated.