Hacker News new | past | comments | ask | show | jobs | submit login
Data science is different now (veekaybee.github.io)
295 points by AhtiK on Feb 14, 2019 | hide | past | favorite | 52 comments



There's something very relevant in this story. Data Science is too glamorous a term. There's an implication that the DS person is some sort of magician who maybe isn't as good at general coding but has special data magic skills, making them more valuable than your average grunt.

In my years in finance, there was a similar problem. One guy in particular I worked with reckoned himself an "ideas guy" and would simply spout out gibberish that he expected the rest of us to implement. He could barely use excel himself, let alone code.

The fact is the best coders I met never fancied themselves as specialists. They could certainly fit some models for you, but they could also write some SQL, set up replication and other maintenance, write cron jobs, set up ssh keys, merge some git branches, and write front and back end code in several different languages, declarative and imperative. I always put it down to a mix of curiosity and humility, giving these people a very good grasp of the fundamentals plus a foothold in almost every area of coding that I could think of.


There's also specialists (mostly epidemiologists, statisticians, etc.) who view their "special data magic skills" as often actively dangerous.


To be fair their special skill can be actively dangerous. In the same way that my devops skills can also be dangerous. Disclaimer: I've periodically been doing data science since before it was called that.


DevOps skills don't help define public policies with gibberrish and baseless assertions disguised as facts extracted from data.

Sometimes some so-called data scientists appear xaman interpreting numerical bones and plotted fire shapes generated from compuational peyote, and people interpret those conclusions as answers to their problems.


Data science as such isn't really a thing that defines public policy. That's more the realm of classical statistical analysis. Althogh there are some major flaws in classical statistical analysis too, such that risks of types 1 and 2 errors are far too high.


I have been a data scientist for the last 4 years.

I think (one of) the problems with the data science career field is that there are a lot of juniors who want to run sklearn and call it a day, following the tutorials that seem to 'just work' that real-world data doesn't without a fight.

To get value out of the work, you have to be methodical, careful, and really dig into the data. The observation that 85% of the time is cleaning doesn't eliminate the need to know what you're doing, what approaches to use, how to judge success, how to communicate results, etc.

Another thing to consider: I've found big, boring companies are usually better to do DS at than small ones. Big, boring companies have better discipline in collecting and managing data. Also, a 1% improvement to an existing process matters a lot at BigCo, and very little at a startup - and a lot of DS models are that sort of incremental progress over rules engines or heuristics.


In my world working on data at a BigCo (Industrial plant in my case) I'd say there are 3 schools of people

1) 'The Old Guard' who are extremely skeptical. They tend to be extremely dismissive of models and predictions, distrust anything but most basic analysis. If they can't do the analysis on an excel spreadsheet it's too complicated and "will never work". These people tend to be Engineers (mechanical and chem type) and Plant Operations roles. A lot of the time there is value in listening to there skepticism but they tend to be extremely conservative by nature (Fortran ort to be enough for anyone...).

2) 'The Optimists' people who think "big data" and "machine learning" is the panacea to every problem in our org. To these people a prediction is a good as a real measurement - they trust forecasting implicitly. They have probably read an article somewhere about machine learning but don't really grasp any of the intricacies. These people tend to be in logistics/accounting/finance type roles and a large part of my job tends to be spent in phone calls with these people explaining why their forecasts did not match the actual results.

3) 'The KPI guy' - usually a manager who is somewhat out of their depth who wants to distill everything he can into a single number which can be displayed on a dashboard. The end result is a dilbert-esque situation where the 'KPI guy' decides that to make his mark in the org he needs to come up with a new metric. You end up with the bizarre situation where people are discussing a 'super metric' made by combing other metrics into a single number. I also spend a lot of time on phone with these guys because they forget what undpin their super metrics and don't understand all the subtleties they've distilled out of the data by focusing so much on higher level metrics. They get angry when you question the value of their dashboard. Whenever someone starts talking about "Yield" "OEE" "DIFOT" good chance they are a 'KPI guy'

Most of my job is balancing out interactions between the three 'customers'. Tempering the optimists enthusiasm, reigning in the KPI guys and nudging the Old Guard.


This is so spot on about Data Science in the "enterprise" or "legacy" organisations (i.e. basically pre-dating the data hype).

Personally getting stuff done with data in this environment is more satisfying than using the latest neural network, I presume you're the same?


I have a thing I do which I like to call "Artificial Stupidity". That is, I like to take a naive implementation and see how far it can get me. Chances are I'll do it in perl. If I need to do some more serious statistics or visualisation with it I'll haul out R. I have not yet had the opportunity to bring some python into the mix, but that's 50% lack of opportunity and 50% that I'v not yet found a normal general purpose computing activity I can't do with perl.


Testify! QFFT.

There are other pathologies, too, but it's amazing how much worldview and the basic behavioral psychology elements of high/low trust and autonomy manifest themselves in what should be "objective" analytics projects.


one thing I think would freshen things up, and its something I am going to try and push, is data scientist 'study abroad' or 'exchange' program. data scientists i know in one org will mostly do time series, and I mostly do anomaly detection. Others NLP, etc, and we would all like to work on all of that stuff, so why not exchange us around from time to time?

I think it would make things much more interesting


According to this, I'm a data scientist. I've done and do everything on that list except for "put python in production" and "Scaling sharing of Jupyter notebooks". I've put R in production (albeit not my code, I am responsible for making sure it runs correctly and surfacing errors to the system/developer of that code). I maintain a data lake, multiple SQL servers, deal with gobs of json, version control my SQL Schemas, vc our data types (admittedly they change quite rarely), etc. etc.

But I'm really just a developer who's good at databases and ETL, along with my regular tasks of writing near-realtime background processing systems, web api's, SQL, etc.

I think the data science industry seems to have been massively overhyped, and now they want people who can use AI and statistical learning methods and all this other stuff I don't know to do plain old data engineer work.

A sad outcome for a discipline that once held so much promise.


> I'm really just a developer who's good at databases and ETL

On the other side of things, this might be rarer than you think.

My experience is that a lot of newish programmers have very little database experience. What they do have is often centered on Mongo or other non-relational stores, used more for persistent storage than as interactive entities. The ability to get info out of a SQL database is pretty standard, obviously. But handling aggregated or joined tables are not entirely standard. (Interviewing for an entry-level backend dev job at a major company, I was pretty startled to have the databases section cap out at 'group by' and 'join'.) And anticipating error sources (e.g. MySQL's rollup handling), reading and responding to 'explain' plans, or knowing about backend issues like InnoDB settings is well outside a lot of developers' familiarity.

I assume part of this is the heavy focus of bootcamps and some college programs on building web apps, and the optional status of databases classes in many college CS programs. But I could imagine a lot of other factors stopping people from picking it up elsewhere, like the changing divisions among DBA/SysAdmin/DevOps/SRE.

So on one end, a data science boom turned out a lot of people with advanced skills in a field with lots of simple work, and at the other there's a gap in developer knowledge which makes it convenient to hire highly-trained people and dump them into roles that are a mix of analyst and DBA work.


> The ability to get info out of a SQL database is pretty standard, obviously.

I wouldn't have a job if this were true. And on a related note, I've found software engineers to be generally poor at writing queries (compared to DBAs/Analysts/Data Scientists).


Not at all, actually.

Cathy O'Neil and Rachel Schutt's book, Doing Data Science (http://shop.oreilly.com/product/0636920028529.do) covers this almost immediately. Because the term developed sort of organically as a cross-discipline approach to solving certain challenging problems, students in their classes were often a mix of scientists, statisticians, and software/database engineers like yourself.

So, you may not consider yourself "a data scientist", and that's fine, but there's certainly a role for your specialization in data science, and that doesn't at all indicate trouble in the field. On the contrary, it's exciting that there's a marriage of these specializations underway.

If I had the time, I'd write a much more in-depth reply along the same lines to many of the criticisms in the article. The lack of a clear definition for "data science" or "data scientist" has caused some confusion, but at the same time, there is new technology available and new approaches to working with large-scale datasets that weren't available before, and that does represent new skillsets.


"Data science", as originally conceived, is something that requires an existing base of engineering infrastructure and scale (and decision-making culture) that's typically only found in a capital-T Tech Company. But 5 years of HBR think pieces and McKinsey studies later, execs all over think they can just hire a bunch of statisticians, put them on their own team far from engineers, and get intelligent software systems out of it.

They, in turn, realize their predicament and set out building their own such infrastructure to the best of their abilities. Sometimes they do okay and squeeze out real value, but they're never going to produce the cutting edge ai and prescriptive analytics that execs think they will.


As originally conceived by whom?


I think they want data scientist to do plain-old-data-engineer work, but not just plain-old-data-engineer work. Getting / cleaning data is part of the job description, IMHO. You can't be a data scientist and entirely reliant on others to do data cleaning, or you'll programming through Slack messages.


I worked in 3 DS roles over ~5 years, and recently made the "official" jump to SWE. I've also interviewed dozens of candidates for several openings during that time.

This post rings extremely true to my experience, and largely aligns with what I've been telling people for the last couple of years. I see so many bootcamp or Masters grads with a wildly skewed understanding of what the job entails. I also see a lot of MBA types diluting the meaning of the DS term as a whole.

A "data science" curriculum as such will basically prepare you only for an analyst role. You're not going to be able to compete with the glut of science PhDs flooding every open role, either. DS may be your title but you will not be doing any of the exciting things you want to be doing. To differentiate yourself you need to specialize, and good engineering skills are a prime way to do that.


MBA-type here, the way I have seen data science described in large organizations is much more like an analyst with a more robust set of tools than excel for deriving information/wisdom from data. Using the tools that other "data scientists" develop to solve business problems.

That's almost certainly diluting the term but it's much closer to the work I do than the title might imply. Since 90ct of business problems can be solved with regression, typically logistic or decision trees, knowing what tools are appropriate to apply to a certain problem is more valuable than being able to actually write those tools. Bootcamps don't spend enough on the why of what we do because I think it's just something you pick up through experience.


I think the data science role was always vague. Depending on the company it could mean Analyst, Data Engineer, Machine Learning Engineer, Machine Learning Researcher or a combination thereof.


Heh, it's like how automakers switched to describe any software position around crappy navigation as "self-driving car" job.


Likewise, every hedge fund is quantitative, and puts “quant” or “quantitative” into as many job titles as it can.

All these trendy terms eventually devolve into noisy marketing to attract talent.


Yes, they are all 'AV' jobs now!

Even at the companies that used to be famous for just making tires.


>> ...in the past 2 years, % of any given project that involves ML: 15%, that involves moving, monitoring, and counting data to feed ML: 85%

As it should be. In order to have confidence in your ML you need to really understand your data and data processing.


Yes. The point I took away from this is that this is not at all a focus of most academic settings. This ends up leaving a huge gap and leaving candidates with an academic DS background woefully unprepared and undesirable.


That seems strange to me. People on forums like this often describe Data Science practitioners as "statisticians that can code". If academic Data Science programs aren't emphasizing data engineering as part of their curriculum, what differentiates a Data Science program from statistics or business intelligence?


> If academic Data Science programs aren't emphasizing data engineering as part of their curriculum, what differentiates a Data Science program from statistics or business intelligence?

In my experience, they're emphasizing software-based data work like machine learning, but not the (vital) peripherals like cleaning/studying/loading data or monitoring and sanity-checking outputs.

A data science student might get a process-first task like making predictions from data using KNN, regressions, t-tests, or neural nets, choosing a method and optimizing based on performance. A statistics student might focus on theory, choosing an appropriate analysis method in advance based on the dataset, and reasoning about the effects of error instead of just trying to reduce it.

But the data scientist could still be training on a clean, wholly-theoretical dataset or a highly predictable online-training environment. The result is a lot of entry-level data scientists who are mechanically talented but stymied by real-world hurdles. Issues handling dirty or inconstant data, for one. But there are a lot of others: a tendency to do analysis in a vacuum, without taking advantage of knowledge about the domain and data source; or judging output effectiveness based on training accuracy, without asking whether the dataset is (and will stay) well-matched to the actual task.

I don't mean that to sound dismissive; there are lots of people who do all of that well, even newly-trained. But it does seem to be a common gap in a lot of data science education.


4th year EE undegraduate student here, taking both "Data Analysis/Pattern Rec" and "Computer Vision" electives this term. My early courses prepared me more for a path focused in circuit design, but I jumped ship through exposure to wonderful, wonderful DSP. A lot of what I'm learning now is very new to me, so, I appreciate comments like yours that give a sense of potential gaps in my learning. Thank you.

I'm currently working on an assignment for CV in which we extract Histogram of Oriented Gradient features from the CIFAR-10 dataset using python, then use them to train one of three classifiers (SVM, Gaussian Naive Bayes, Logistic Regression). I had asked about preprocessing, but was told it was outside the scope of this assignment, so we're just using the dataset as-is. :(

The nice bit is, I have a research internship coming up in a lab that will have me working on actual datasets, rather than toy examples. And, there's a data science club on campus that has an explicit focus on cleaning data which I plan on regularly attending. So... hopefully I'm on the right track!


Don't worry, when you have real problems you will have time to learn. Most of the time is not even data cleaning, but debugging, getting into the details of the data or code written by somebody else to understand why something is not working (and there's always something that's not working :) ). The main differentiator is whether you have interest / patience for that or not.


I'm not familiar with academic Data Science programs but I've worked with statisticians for over fifteen years and they are usually very involved on the data engineering side. If they aren't running the systems then they are working closely with those people to test and confirm inputs and outputs before running analyses.


> they are working closely with those people to test and confirm inputs and outputs before running analyses

In terms of data science training, at least, this is often a missing element. It's easy to create classroom tasks that focus on teaching how to do analyses and neglect practical aspects like validating data and sanity-checking results. People pick it up on the job, of course, but I wouldn't be surprised if statisticians get a better academic grounding from things like reasoning about uncertainty.

(It's not a problem specific to data science, either. I've heard plenty of complaints about new engineers who are so used to made-up problems that they don't balk at ludicrous data or results when they start doing real work.)


> It's not a problem specific to data science, either. I've heard plenty of complaints about new engineers who are so used to made-up problems that they don't balk at ludicrous data or results when they start doing real work.

This is one of the reasons I think we need to better integrate technology (and general data analysis follows the same reasoning) across the curriculum: an increasing share of work (and a more rapidly increasing share of good paying work) is knowledge-based work that involves both data analysis and working with people who are doing automation, on top of that which is primarily automation or data analysis. But we don't trash other knowledge skills in relation to automation and analysis, which leaves people specialized in automation and analysis and people specialized in domain skills talking to each other over a wide gap too often, with a lot falling through the cracks.


The differentiator is breadth over depth. Not getting much into theoretical underpinnings.


Couple of notes.

Be prepared for most of your data scientist work to not be data science. Adjust your skillset for that.

Same in real science - for every minute you spend thinking about what nature might be doing, you spend tens of hours carrying things around, mixing things, checking things, repeating things, etc. This is how all real work is.

Most modern languages are procedural: Java, Python, Scala, R, Go, etc.

If someone has a friend who does Scala, can they read them this quote and film the reaction? Thanks.


Reading the quote in context:

> Isn’t SQL a programming language? It is, but it’s declarative. You specify the outputs you want (i.e. which columns from your table you want to pull), but not how those columns are actually returned to you. SQL abstracts a lot of what’s going on under the covers of a database.

You want a procedural language, one where you have to specify how and where the data is selected from. Most modern languages are procedural: Java, Python, Scala, R, Go, etc.

The author is trying to contrast fully Turing complete languages with a declarative domain specific language like SQL. (Yes, I know that some extensions provided by various database implementations make SQL Turing-complete.) Unfortunately, the word she chose to express this is already a term-of-art in the programming language world which means something different. Luckily, we're all charitable readers, so we can correct on the fly and understand what she meant.


Oh, absolutely, the meaning is perfectly clear! I just want to see a Scala programmer cry.


As someone working in DS for the last 4 years this is pretty accurate.

If you have a good academic background it can be possible to enter a DS role immediately but often you will be doing work far more towards the Business Intelligence end of things rather than deploying Deep Neural Nets in production or whatever.

I have friends who transitioned into Data Engineering and it does seem like the outlook is better there.

It's an excellent post.


There's nothing wrong with the data science industry becoming different, as long as expectations are managed. Specifically, as this article notes, the probability of getting hired due to the increased competition, and the realities of the real-world job.

Both are currently not transparent enough for the data science newbies; which is why on my end I try to be transparent as possible whenever the topic comes up (I wrote a post similar to the OP last year: https://minimaxir.com/2018/10/data-science-protips/).


The market value (i.e. the big bucks) I think will shift into Data Engineering and the role that's abstractly called "Machine Learning Engineer".

Reliably getting any data science analysis or model running in a real world setting is a demand that's naturally going to follow from the Data Science glut.


When I started my first data-science role, the role description of my company sounded a bit like "software engineer who happens to know stats and ml". The description was fairly specific on the fact that data scientists would build and deploy models and services. Nowadays it seems not to fall under the software engineering umbrella. And I do think the change started with the deep learning craze. It distorted a lot in the field. Nowadays I see so many overfitting and complicated models that cannot be operated in production. But they sure make impressive slides and reports.


totally agree. Have been consulting in the data world for some years now. Most companies want to do data science but they have so many low hanging fruit that it makes no sense to do any ML. If they actually manage to get a senior data scientist hired then they typically torture them with boring BI dashboard creation.


BI dashboard creation is torture in the same sense that creating web applications might be torture for a software developer. In other words, if you find it to be torture you are probably in the wrong field. At the end of the day I've found myself much happier as an engineer when I am less focused on the tools I am using and instead the impact I can have on the business I am serving. If a BI dashboard provides that value, then that is fun because I am making an impact. Maybe more so than if I was using a complicated RNN somewhere when it provided little value.


I have been dispatching the same arguments for the last 3 years. Schools have all engaged in Data Science programming flowing the market with statisticians reconverted into data science with basic programming skills, even lighter notion of data engineering, DevOps tooling and operational understanding. In 2015, our Big Data major was renamed Data Science, no matter if we are still teaching NoSQL, Hadoop, Spark... I've been careful to never engage Adaltas on the road of DS not because we didn't like it but because of the hype around it and the created market distortion. I tell my customers that we have Data Engineering who can excel in Machine Learning if needed, placing their models in streaming processing with Spark or Flink and pushing it into production with the expectation of operational constraints. Lately, we just engaged a young Data Scientist consultant with the right resume supporting it, first we did was to place him on a 4 months diet to teach him about how to deploy and secure a platform as an InfraOps and how to write data ingestion as a Data Engineer.


“In those early years[2012], there was no real formalized way to learn “data science,”

Yeah they were called quants (aka mathematics/statistics graduates).


I'm not a data scientist but I've worked with a few over the past 10 years and I strongly agree with this article that the work has changed a lot over that time.

The first generation machine learning experts were proper scientists with proper Ph. D. degrees, academic track records, etc. that would typically be very opinionated on what algorithms (and quite possibly wrote a few of their own) to use but not necessarily experienced engineers. I saw a lot of clumsy engineering and convoluted testing and evaluation processes.

This explains a lot about the current state of the art which involves a lot of tools that are aimed at people who are not primarily engineers and need to be shielded from complex infrastructure and code but do know a lot about statistics, machine learning algorithms, and all the stuff that first generation machine learning experts would know.

The second generation of machine learning experts is basically riding an ongoing commoditization boom. They use toolkits from Google, Facebook and others pretty much as is. These tools are easy to use for them but not necessarily for non expert engineers that know a lot about pumping data around but not necessarily about machine learning algorithms. This is getting a lot easier. I've heard of high school kids getting ML jobs with no college training whatsoever and just high school math and a bit of online training. My impression is that you can get nice results with a little effort.

The next generation of machine learning engineers won't be scientists and they'll indeed mostly work on manipulating data. All the machine learning algorithms will be provided in the form of black box libraries and tools that will mostly work in a fully automated mode. IMHO the whole point of deep learning is that the algorithms figure things out by themselves. Even the job of picking the right algortithms and configuring them is ultimately going to be something that machine learning algorithms will be better at than a junior engineer with no relevant scientific background.

Or indeed an experienced software engineer with a classic computer science background, like myself. I have no clue what e.g. a tensor is. articles on the topic seem to be very math heavy and tend to give me headaches. But should I even have to care to be able to configure some black boxes that process data and produce models that I can plug into my runtime? My pet theory is that we're already past that point and that lots of companies are getting decent results not having to care about the underlying algorithms already.

I went to a great meetup at Soundcloud last week about how they used off the shelf machine learning tooling to improve their saerch ranking in elasticsearch. It was all about the training data, the parameters in the search query that they wanted to machine learn, their tooling for evaluating model performance in terms of being able to rank real queries against real data, tooling for annotating training data, integrating models with their software, the devops for retraining the models, etc.

My experience working with the machine learing team search group in Nokia Maps (now Here) eight years ago was that the tools were an obstacle to getting results fast and that iterations on model improvements were measured in months. A lot of engineering went into things like feature extraction, model tuning, and other stuff that scientists do as well as building essentially all of the tools from the ground up so that models could actually be generated evaluated, and integrated. Only problem: many of these people weren't experienced engineers so the tools were kind of clunky and there were lots of integration headaches, insanely long integration cycles, and lots of missed opportunities to fix (rather obvious) data problems due to a bias towards endless tweaking of algorithms instead of applying pragmatic fixes to the data. It kind of worked and the search wasn't horrible but the biggest problem was that the underlying data wasn't great to begin with (mis-categorized, full of duplicates, incomplete/stale, etc.).

The people at Soundcloud got it down to iterating in hours with a few months of engineering. That's from idea to proof of concept to having code in production that outperformed a manually crafted query.

That sounds like something I could do but it also sounds like a greenfield for proper tools to emerge that make all of this a lot less painful than it currently is. The next generation hopefully won't have to build a lot of in house tooling and reinvent a lot of wheels while doing so.


> articles on the topic seem to be very math heavy and tend to give me headaches

Of course. Academic papers (and a disturbingly large number of Wikipedia pages) are not meant to explain things, they’re meant to emphasise just how smart the authors are.

> I have no clue what e.g. a tensor is.

Well, even Einstein struggled with Tensors. In the context of TensorFlow they’re just multi-dimensional arrays.


Yeah - tons of traditional analyst jobs (Business Intelligence / Analysis, Marketing analyst, etc.) have been re-labeled as Data Science.

I'd be amazed if even 10% of the people are able to do anything more than just import scikit-learn, and train a classifier through tutorials.

This is IMO no different than when the software dev. craze started, and people with 3 weeks of coding experience started applying for entry-level jobs. You start interviewing them, and they can't even explain the difference between a for or while loop-

In the end, there's just more noise. You need to find a good way to cut through this noise, both qualified candidates and employers


fast.ai youtube lesson view numbers:

1. Lesson: 355k

2. Lesson: 144k

7. Lesson: 34k

Surprisingly close to those 7%.


This is pretty honest and acute description of the industry landscape and prediction going forward.

I think DS has been abused by some people as an umbrella to not produce qualify code, yet they somehow they put themselves in higher regards in the value chain.

However I do see there is a real position for DS in the industry, but it should be a specialization of senior SDE when they decide to further their career, not its own job family. Otherwise it should be renamed as data analyst for clarity.


I think DS has been abused by some people as an umbrella to not produce qualify code, yet they somehow they put themselves in higher regards in the value chain.

Hit the nail on the head here. I worked in an DevOPs/ETL team across from the data science team, all they did was write SELECT * FROM sales and complain Teradata was slow and when they got the result set they'd use "R" to SUM the column and display it with GGPLOT.


I loved the tone of this article because it's fairly relevant, and with a small facelift, could have been advice to web developers circa the early 2000s.

Data science is still a thing, and it's maturing in the way that applied sciences do when they get to the point of needing a little more engineering background. Tech. just is never that glamorous, but the dirty secret is that only people in tech. seem to really get that, so we have this hype cycle every few years.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: