Hacker News new | past | comments | ask | show | jobs | submit login

This sounds interesting. I am in a field where a lot more focus is on visualizing samples using different metrics with PcOA instead of using regular PCA.

If i just scroll through the Zheng et al arxiv paper it all seems a little arbitrary to me. Selecting a 1000 features, then 50 components. They argue that it is for computation time reasons, but is there any kind of benchmark suggesting this is a better strategy than just plotting the two first components or using MDS which also has the advantage in this scenario of being convex?




There are so many arbitrary choices made when analyzing single cell RNA-seq data. There coverage cutoffs to decide when a gene is expressed, arbitrary QA points to decide when a cell is "good quality", the PC's chosen for t-SNE, the genes identified as more variable than estimated levels of technical noise, etc etc. is very frustrating. This leads to huge issues with reproducibility, almost every single paper uses their own in house custom analysis pipelines and they rarely make them open source.


The main reason is because no one wants to publish technical or "methods" type papers where they assess the technology. There is usually one or two initial big papers introducing the technology that makes a big splash. No one subsequently will want to assess it or improve much on it because it won't publish well and you likely will not get cited for it anyways.


Yes i know the feeling.

Our field has some very arbitrary threshold for noise on single features, sounds like there is some slightly more principled strategy in single cell genomics?


The seurat R package (http://satijalab.org/seurat/) tries to give you more information on how to choose your PCs. But as with anything in biology, it becomes subjective at some point.


Thanks!

I happen to have some super high-dimensional data (~100k-1m dimensions), which takes a huge amount of time to work with because i have to custom write everything, and i notice they claim all their underlying functions use sparse matrix representations. Have you tried it in a very high dimensional context?


With single cell data, gene expression data from each cell is considered a dimension. So you end up with something in the range of 20-30k genes and possibly thousands of cells (~25k x thousands matrix). I don't think the technology is at the scale of hundreds of thousands of cells yet. So I am not sure if this package will handle 100k-1m dimensions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: