> The data science field has been flooded with PhDs with nowhere else to go that...

baron_harkonnen · on Jan 15, 2021

My question is: "Why are you pursuing a PhD if you want to end up as a data scientist?"

I've known a surprisingly large number of people that are mid-phd thinking about data science as a career. Don't pursue 5+ years of learning to master the world of academic research if your goal is to help people sell t-shirts or whatever.

Certainly there are some people pursuing specific PhDs, such as those in computer vision and nlp where there are some industry options that might offer more challenging/interesting research than academia. It makes sense if you're a PhD at NYU or Stanford in CS fields related to neural networks to go work for Yann Lecun at Facebook or Geoffrey Hinton at Google.

But if you're, say a biologist that wants to sell clothes online... why spend 6 years working in academia to do that? Is your dream really to optimize clothing sales? If so don't be a biologist. If your dream is biology, why in the world would you set your course on selling clothes?

I get it if your dream is biology but you can't find a tenure track job and so you pivot to industry... but if you are mid-phd, what are you doing there? If you love your subject, try to find a way to work in that and if you don't, don't waste your time.

Data Science is not a glamorous job, and the vast majority of companies it is literally bullshit. The people solving mind-bendingly hard problems are already in programs specializing in those problems because that's what they are passionate about. On top of that DS is way over indexed at most companies. If you're mid-phd now I would expect a serious contraction in DS jobs in the next 5 years. DS will be a niche job after the next market "correction"

zwaps · on Jan 15, 2021

I mean you have to think about cause and effect here. DS will contract because many/some data scientists simply aren't good enough, and most DS just doesn't do what it is supposed too.

First, like you said, there are the stray PhDs who do it since they know research and some statistical applications. Second, there are hordes and hordes of DS people who "learned" their skill with some bootcamps or online courses, which means they know enough to write notebooks and glue together functions. Their understanding of theory is often shallow. In either case, it is hard to "blame" someone for taking an attractive job. But it isn't good for the discipline.

The appeal of DS is clear for companies. But the problems it promises to solve are much more complex than we collectively recognize - or are willing to admit. In my opinion, doing causal inference is a difficult, unsolved, and deep topic and no single course would equip to you to tackle it. It takes domain knowledge and multiple years of stats/math/ML (all of them, not one of them). And yet, causal inference is what 90% of people want ML to be. A model that works on some dataset is not a model that is useful in light of the true latent DGP. Yet, when we want to sell T-Shirts, what do we really want?

Hence, when I look at the problems that ML is supposed to solve, I think that most people calling themselves DS on linkedin are not really equipped for it. And there is a case to be made that some fields where PhD researchers train to solve such causal inquiries indeed are better equipped to tackle the issue.

For example, if it's about selling shirts, I would take an econometrician with some data engineering skills over a coursera superstar any day of the week. I think if you do a PhD in ML/Stats/Biostats/Econometrics/etc., it is reasonable to pursue a career in DS. It's what statistics _is_ now.

If you have some other PhD and know some Anova, OLS and Stata - or if you have CS background but know some Jupyter and Keras - then it's essentially career change. It might work, but probably not without a hitch.

So I agree with you, but I'd reframe it: It's unclear to me whether we need a contraction, or whether we instead need a quality update.

I disagree with you in one point: I do not think we will make progress in DS (getting it to work in more use cases) by treating it like a solved problem, a skill like milling that needs talent and experience, but not academic education. If we do that, I think DS will contract because it will stagnate in usefulness.

My point here is not to accuse anyone of being a bad DS. I am sure there are many ways to become efficient. But even the theory of causal inference with simple linear models goes far, far beyond what I saw in ML hiring tests, online courses and so forth. And solving the problems it tackles is not accomplished by throwing more layers at it. For other ML algos, we aren't even close yet at understanding these issues on a similar level.

In the end, what we need are actual ML scientists. They should neither be pure statisticians, nor pure subject-matter experts, nor pure computer scientists - as we mostly have now. We also need more than the current ML programs that are mostly clobbered together from other areas. For example, people who publish in ML research are probably very useful in a company that has to deal with that exact problem. Any scientist knows, of course, that even a fairly adjacent question may already require tons of different knowledge. DS is, will remain, and probably should be an academic field, because there are more open than solved problems right now.