Hacker News new | past | comments | ask | show | jobs | submit login

Having to deal with data scientists, I absolutely agree. The thing that I've seen that lands in the "lab" vs production distinction is that these people expect their data to be pristine. They flip out when the world isn't as perfect as their models want. Leads to me as just a normal software developer having to do the data analysis and figure out how to clean it up.

I also end up having to be the one to talk to data vendors to understand their data feeds and essentially translate that for the data scientists. Having to sit in the middle is annoying for me and suboptimal for the business.




The data science field has been flooded with PhDs with nowhere else to go that have no background in engineering, and sadly often have a very poor understanding of both machine learning and statistics.

Companies were in a rush hire "data scientists" and boot camps like Insight were more than happy to pump out very impressive PhDs with just enough understanding to build a Keras model.

I've worked in industry awhile doing DS work and have been astounded at the number of PhDs that both don't know how to write Python that doesn't live in a notebook and throw away years of disciplined experimentation experiences to just throw keras models at data until the needle moves.

There do exist excellent data scientists out there, who are both very solid software engineers and really know their stuff mathematically, but I've found most of these people can't reliably find jobs because the people interviewing them know so little that good data scientists will be penalized for answer a stock question correctly.

The field has been so flooded with amateurs that have no idea what they're doing, that potential mentors have been driven out, and now it's just a mess. To get a job doing DS if you do know what you are doing you have to play a weird game where you guess the incorrect answer the interviewer has in mind.


Not to mention the dark pattern of giving data scientist candidates an unsolved industry problem as their interview take-home task, and then telling them to only spend 4 hours on it. Data science hiring often feels like a competition where the winner is the one who has the most free time and willingness to do other people's work without compensation.

It's kind of a fucked up field right now.


I work at a place with a very high count of PhDs. Some of them write code. All of them view writing code as something menial and unimportant and its shows in the resulting work, which from my experience is atrocious.

Of course I understand that YMV, but I will forever be skeptical of anyone writing code with a PhD after working here.


Are they CS/EE PhDs?


Think back to your own CS professors, were any of them particular good software engineers?

I've found that Physics PhDs tend to have the highest probability of being good coders since a certain subset of them get bit by the software bug when they need to write non-trivial amounts of code to solve research problems.


I got my physics PhD in the early 90s. Physics has had a tradition of interest in programming that goes back decades. We've always had "big data," meaning big relative to the tools available at any given moment. We ran out of problems that could be solved by pencil and paper in the 1930s.

Every physics student at my college had to take FORTRAN, plus programming was assumed in many of the other courses, and we also took an electronics course that included digital techniques. And maybe the main thing was simply that programming was interesting and fun.

We've also had a tradition of learning to do everything ourselves, for better or worse. I had no access to a professional programmer.


The ones who's work I've seen are EE


I do an introductory Python lab course at my university. It's targeted at engineers who still create graphs from Excel and then normally level up to MATLAB, if things get complicated (think insets, ...). I guess about 30% of the people previously did at least some of the YT/Udemy "courses" on datascience. It's really horrifying for me (not being an engineer myself, but imo having a relatively engineering-like mindset) to see these people horrified at simple tasks like writing a variadic function. "What do I need this for?". Well, it's using the programming environment. And then let them code up a simple version of Levenberg-Marquardt. The level of "why do I need to do this" is astonishing again...


why do I need to do this

IMO this is the number one problem of our modern culture around education. Popular culture makes it popular to treat education as pointless, and this even affects students who are pursuing difficult degrees. "Why do I need to study humanities? Why should I learn to code if I think I am born to be someone else's boss?"

On the other hand, many teachers in K12 and early university have no ability to connect the "what" with the "why." "The curriculum is the curriculum. The test is the test."

If we can solve these problems, our societies will be much better off.


If the educator cannot explain why the knowledge is useful, then he is unfit to teach it.


> The data science field has been flooded with PhDs with nowhere else to go that have no background in engineering, and sadly often have a very poor understanding of both machine learning and statistics.

I am a PhD student in a non-engineering field. I've been taking as many math and stats courses as I can, but what other courses should I be trying to take if I want to excel as a data scientist? Software engineering CS type courses?


My question is: "Why are you pursuing a PhD if you want to end up as a data scientist?"

I've known a surprisingly large number of people that are mid-phd thinking about data science as a career. Don't pursue 5+ years of learning to master the world of academic research if your goal is to help people sell t-shirts or whatever.

Certainly there are some people pursuing specific PhDs, such as those in computer vision and nlp where there are some industry options that might offer more challenging/interesting research than academia. It makes sense if you're a PhD at NYU or Stanford in CS fields related to neural networks to go work for Yann Lecun at Facebook or Geoffrey Hinton at Google.

But if you're, say a biologist that wants to sell clothes online... why spend 6 years working in academia to do that? Is your dream really to optimize clothing sales? If so don't be a biologist. If your dream is biology, why in the world would you set your course on selling clothes?

I get it if your dream is biology but you can't find a tenure track job and so you pivot to industry... but if you are mid-phd, what are you doing there? If you love your subject, try to find a way to work in that and if you don't, don't waste your time.

Data Science is not a glamorous job, and the vast majority of companies it is literally bullshit. The people solving mind-bendingly hard problems are already in programs specializing in those problems because that's what they are passionate about. On top of that DS is way over indexed at most companies. If you're mid-phd now I would expect a serious contraction in DS jobs in the next 5 years. DS will be a niche job after the next market "correction"


I mean you have to think about cause and effect here. DS will contract because many/some data scientists simply aren't good enough, and most DS just doesn't do what it is supposed too.

First, like you said, there are the stray PhDs who do it since they know research and some statistical applications. Second, there are hordes and hordes of DS people who "learned" their skill with some bootcamps or online courses, which means they know enough to write notebooks and glue together functions. Their understanding of theory is often shallow. In either case, it is hard to "blame" someone for taking an attractive job. But it isn't good for the discipline.

The appeal of DS is clear for companies. But the problems it promises to solve are much more complex than we collectively recognize - or are willing to admit. In my opinion, doing causal inference is a difficult, unsolved, and deep topic and no single course would equip to you to tackle it. It takes domain knowledge and multiple years of stats/math/ML (all of them, not one of them). And yet, causal inference is what 90% of people want ML to be. A model that works on some dataset is not a model that is useful in light of the true latent DGP. Yet, when we want to sell T-Shirts, what do we really want?

Hence, when I look at the problems that ML is supposed to solve, I think that most people calling themselves DS on linkedin are not really equipped for it. And there is a case to be made that some fields where PhD researchers train to solve such causal inquiries indeed are better equipped to tackle the issue.

For example, if it's about selling shirts, I would take an econometrician with some data engineering skills over a coursera superstar any day of the week. I think if you do a PhD in ML/Stats/Biostats/Econometrics/etc., it is reasonable to pursue a career in DS. It's what statistics _is_ now.

If you have some other PhD and know some Anova, OLS and Stata - or if you have CS background but know some Jupyter and Keras - then it's essentially career change. It might work, but probably not without a hitch.

So I agree with you, but I'd reframe it: It's unclear to me whether we need a contraction, or whether we instead need a quality update.

I disagree with you in one point: I do not think we will make progress in DS (getting it to work in more use cases) by treating it like a solved problem, a skill like milling that needs talent and experience, but not academic education. If we do that, I think DS will contract because it will stagnate in usefulness.

My point here is not to accuse anyone of being a bad DS. I am sure there are many ways to become efficient. But even the theory of causal inference with simple linear models goes far, far beyond what I saw in ML hiring tests, online courses and so forth. And solving the problems it tackles is not accomplished by throwing more layers at it. For other ML algos, we aren't even close yet at understanding these issues on a similar level.

In the end, what we need are actual ML scientists. They should neither be pure statisticians, nor pure subject-matter experts, nor pure computer scientists - as we mostly have now. We also need more than the current ML programs that are mostly clobbered together from other areas. For example, people who publish in ML research are probably very useful in a company that has to deal with that exact problem. Any scientist knows, of course, that even a fairly adjacent question may already require tons of different knowledge. DS is, will remain, and probably should be an academic field, because there are more open than solved problems right now.


Do you have any suggestions for where to start looking for good places to apply that don't suffer from this?


I personally have given up and turned mercenary. Even if you're passionate about statistics, machine learning, or any ds related discipline don't think of work as being a bigger part of your identity than the average star bucks barista does. Find a team/company that pays you well and isn't too opinionated, with low ego (if possible). Don't look for challenging work, the few places it exists already have the people they need, whereas the companies that pretend they have challenging problems tend to be insufferable. Look for a team where you can check-in and check-out without too much stress and get paid well.


Just want to say that while the data science profession definitely includes a wide range of people and skillsets, a good data scientist should be practical and able to work with the available data in whatever state it's in.

No good data scientist should ever expect data to be pristine. And a good data scientist, even if they don't have quite the engineering chops necessary to build a production-quality ETL, should know enough about the process to help guide it. If they aren't a part of that process, they're not being a good DS. They can't expect someone not involved with their problem to know what tradeoffs to make, and if they don't know exactly how their data went from raw form to the ETL-ed form, they're probably going to make bad assumptions, and those assumptions may very well make their architected solution a complete pile of garbage. Not to mention, how can a DS offer suggestions for solutions if they aren't deeply familiar with the raw data that's available?

To me, a good data scientist should, at bare minimum, have several skills.

* They should first and foremost (but not solely) be an in house expert in statistics and machine learning to know what can be done with data, and what can't be done with data. They should arrive with that knowledge. Engineers I think have a tendency to trivialize this, but true expertise in this domain comes only with years of experience.

* They should strive to find modeling solutions that are right for a particular business problem. If they seem to be only applying the hottest research regardless of the tradeoffs for the particular business problem, that's a red flag.

* Their focus should be on integrating themselves with the product/business as much as possible, and with the engineering team as much as possible. If they're expecting to be handed directives, that's a recipe for a ton of wasted time.

DS should never, ever be siloed into their own little DS world. They will be useless without a deeply intimate knowledge of the business goals, the needs of product, and the capabilities of the engineering team.

As they progress, they should become more and more "full-stack", otherwise they are stagnating.


A good data scientist should also be good at science. Otherwise, you can simply hire people with engineering skills - you don't need scientists. If you hire scientists and then are surprised they aren't good at engineering, the hiring process needs a reality check.


Statistics is a science as well. Unfortunately it’s overloaded in business terms and can mean anything from “knows means and regressions” to “has a copy of _Meyn and Tweedie_ on their shelf”.


Instead of sneering at "having to deal with" data scientists, consider that the data scientists themselves would often much rather have data engineers and dev ops people involved in the process.

Data scientists like to quip that 80% of the job is data cleaning, with the remaining 20% divided up arbitrarily among other tasks as suited the joke. In some shops nowadays, it's more like 45% data cleaning, 45% data engineering/ops/programming just trying to make your results available to the rest of your organization, and 10% research.

If I can spend less time learning/doing software engineering and devops and more time doing actual data science, that's great. At a previous job, my team was clamoring for more data engineer hiring, and part of the reason our projects were slipping and starting to fail was lack of data engineering support. Our tooling was shit, our processes were shit, our code was shit, and access to (and trust of) our data sources was especially wet and stinky shit.

It made the daily work of doing data science a miserable slog of ad-hoc duct-tape solutions, and it contributed to us being generally ineffective as a team.

All of this would have been fixed if we had one competent data engineer with some actual real-world data/ML engineering experience and good communication/advocacy skills. Let alone two or three!


If the DE tooling was shit and you couldn't hire more fast enough, why didn't your team members start addressing these problems? Surely spending half the time cleaning up the pipes would increase the value of what you do with the other half?


This implies a lack of rigorous training. In the physical sciences, one wouldn't become an applied scientist without conducting an experiment to test a phenomenon, and the teeth gnashing that goes with making that experiment work.

Those who have been fed pristine data without having to undergo the trials and tribulations of actually having to collect the data have missed a crucial part of scientific training. Like you, I find this lack of rigour is rather common among data scientists. Not all, but quite a few.


That's what I was wondering reading this thread. Much of science is dirty work in other fields and I think that is a good thing.

How ridiculous to assume that a scientist doesn't clean their tools and set up their experiments.

(Surely as one gets more experienced and older, the job likely becomes less manual, more about teaching and coordinating.)


I think that this is more of a problem with the specific people that you have worked with and it isn't inherent to the role of a data scientist.


It’s becoming more inherent, especially as the field is populated with people who have no experience with the “science” part. That is, with the very real and ubiquitous problem of collecting and cleaning data to make it fit for scientific study. Even theoretical physicists, for example, participate in and rely on empirical data collection, and understand deeply how messy and fraught with error it is.

I don’t see the same appreciation or consideration in general in the field of data “science.”


I remember working with some one who has PhD in Physics and who worked at CERN - and one comment I loved "a key skill is knowing how to place the legend, so it obscures that annoying outlier data point"


> with people who have no experience with the “science” part

It's interesting that you put it that way because a lot of the other complaints in this thread are that the people who expect their data to be ready for use are exactly the people with science experience but without the relevant technical background.


Doesn't sound like a modern Data Scientist, sounds more like a statistician with 30+ years of experience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: