Hacker News new | past | comments | ask | show | jobs | submit login
Linear compression in Python: PCA vs unsupervised feature selection (efavdb.com)
77 points by efavdb on Aug 13, 2018 | hide | past | favorite | 16 comments



I found this line confusing:

> The printed lines above show that both algorithms capture more than 50% of the variance exhibited in the data using only 4 of the 50 stocks.

Based on the sklearn PCA documentation [1] this has nothing to do with the coefficients on individual stocks, and for PCA should read more like: "[...] capture more than 50% of the variance exhibited in the data using only 4 components [...]" which is not the same thing.

1. http://scikit-learn.org/stable/modules/generated/sklearn.dec...


Not really. PCA extract components (or factors) using individual items (in this case, each stock). OP points out that those 4 stocks strongly load (large amount of their variation used to calculate the component) on the PCA first component. I would interpret those variances explained more like a way to make sense of the component (maybe stocks of tech companies load on it) rather than a “quality” of extractoin measure. Usually, correlation matrix eigenvalues can be used to get a sense of how well the PCA is performing and how many dimensions are present in the raw data.


The idea (ranking stocks by sum of the absolute value of the coefficients) is valid, but that line of code (variance explained by each component) doesn't do that.


I think your criticism is right. In that line I was thinking about the feature selector, which does pull out 4 of the individual stocks -- these capture more than 50% of the variance in the full set. As you pointed out, that description isn't quite right for the PCA line, which uses four hybrid components.


To expand upon this, in PCA, different components have "loadings" not on single variables but on many variables (a.k.a. columns), which are often highly correlated with each other.


Does it even makes sense to run PCA on the change percentage of a stock. To me it would be make more sense to use it with physical properties of the under lying the company. PCA helps you reduce dimensions of a higher order dimension to lower dimension so you can group stocks together. I am a little confused by what the author is trying to do.


Sure it does. Author could've been clearer about the goals, though. Say that rather than pre-selecting 50 stocks, we ran the analysis on the whole market - then each component intuitively corresponds to some segment of the market. For example, some component might correspond to agro companies whose prices fluctuate with the weather. Another component might correspond to NFLX and other companies who rely on AWS, which fluctuate together based on cloud storage price changes.

This kind of interpretation kind of falls out of the math (the eigendecomposition/SVD/covariance matrix interpretation of PCA in particular).



I wish people were better acquainted with the literature, e.g. https://www.nowpublishers.com/article/Details/ECO-002

(Ed: yeah, that's just a sample of the book but has a large bibliography at the end.)


I can't seem to make the COD reach 1.0

   >>> selector.ordered_cods
   [0.43298218, ... , 0.5068577, 0.5068577]
Would you think this a problem/bug?


Did you use the code for a supervised application? i.e., did you pass both an `X` and a `y` to the selector? If so, then getting less than 1 just means you can't get a perfect fit to `y` with your features. Please let me know if that's not it.

If interested, you can find some detailed examples in a tutorial below https://github.com/EFavDB/linselect_demos


Another technique for unsupervised feature selection is Principal Feature Analysis (PFA): http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf


This dataset could be interesting as it consists of stocks and cryptos https://vectorspace.ai/recommend/datasets


This title seems a bit confusing, since PCA is a form of unsupervised feature selection (or rather, feature weighting).

The title seems like it has the form "<Specific method> vs <Broader category method fits in>".


I tend to think of feature selection as being methods like LASSO that induce conceptual sparsity in the feature space; and use the label "dimensional reduction" for methods which reduce covariant dimensionality without inducing conceptual sparsity -- PCAing 100 features down to 3 principal components may or may not actually lend itself to a simpler interpretation but often it just susbstitutes the problem of labeling loaded principal components in lieu of the original problem.

Not disagreeing with you, just spitballing.


That's fair and was helpful to hear! I had dimensionality reduction in the back of my mind, and now that you mention it, your point about conceptual sparsity definitely seems important here.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: