In "What barriers are faced at work?", I really wish they broke down the "dirty data" response into more categories. In particular, I'd love to know if people are dealing with data quality issues, feature engineering issues, or something else all together.
In my opinion, this is representative of the problems with data science tools today. There is so much focus on the machine learning algorithms rather than getting data ready for the algorithms. While there is a question that lets respondents pick which of 15 different modeling algorithms they use, there's nothing that talks about what technologies people use to deal with "dirty data", which is agreed to be the biggest challenge for data scientists. I think more formal study of data preparation and feature engineering is too frequently ignored in the industry.
Completely agree. Data quality issues was a big part of our motivation with Kaggle Datasets (an open data platform where the quality of the dataset improves as more people use it) and Kaggle Kernels (a reproducible data science workbench that combines versioned data, code, and compute environments to create reproducible results).
Two examples of this: Kaggle Datasets supports wiki-like editing of metadata (file and column descriptions) and makes it easy to see, fork, and build on all the analytics created on the data so far.
We're just getting started with each of these products: we want Kaggle Datasets to support a fully collaborative model around working with all your data in the future, and Kaggle Kernels to support every analytics and machine learning usecase.
Of course everyone agrees that "cleaning data" is difficult and boring, and it's always mentioned, but what I don't really understand is what kind of tools people expect for this beyond what are already available. E.g. pandas is pretty good at merging tables, re-ordering, finding doubles, filling or dropping unknowns etc. There are also tools for visualizing large amounts of data, look for outliers, etc. Beyond the basic tools it seems to me that each dataset requires decisions to be made that can't be automated. (e.g. do I drop or fill the unknowns?) I don't see how this could be improved, as every decision has a solid, semantic implication related to whatever is the overarching research question.
So statements like "getting data ready for the algorithms" seem kind of meaningless to me, in the sense of general methodologies. How could you possibly "get the data ready" without considering what it is, how it will be used, etc. How can it possibly be generalized to anything beyond the specific requirements of each problem instance?
I'm just really curious what you are imagining when you say that better tools are needed here.
I am the lead contributor of a python library called Featuretools[0]. It is intended to perform automated feature engineering on time-varying and multi-table datasets. We see it as bridging a gap between pandas and libraries for machine learning like scikit-learn. It doesn't handle data cleaning necessarily, but it does help get raw datasets ready for machine learning algorithms.
We have actually used it to compete on Kaggle without having to perform manual feature engineering with success.
I'm starting to build up various utilities to help with this kind of thing, but I fully agree. The decisions require understanding the business requirements (do I use source X or Y for field 1, what errors are OK, what types of error are worst, etc), but the process of finding some of these could be better.
One simple one is missing data. Missing data is rarely a null, I've seen (on one field, in one dataset):
N/A
NA
" "
Blank # literally the string "Blank"
NULL # Again, the string
No data
No! Data
No data was entered for this field
No data is known
The data is not known
There is no data
And many, many more. None can be clearly identified automatically, but some processes like:
Pull out the most common items, manually mark some as "equivalent to blank" and remove.
Identify common substrings with known text (N/A, NULL, etc) and bring up those examples.
Are useful, I'd like to extend with more clustering and analysis to bring out other common general issues but rare specific issues. Lots of similar things with encodings, etc. too.
Other things that might be good are clearer ways I could supply general conditions I expect to hold true, then bring the most egregious ones to my attention so I can either clear out / deal with them in some way. A good way of recording issues that have already been analysed and found to be OK would be great too.
Yes, completely agree that each dataset requires decisions to be made that can't be automated, but there are huge opportunities for tools to assist users in understanding what cleaning decisions they might want to make and how those decisions affect the data. Most data cleaning tools do a very poor job of helping the user visualize and understand the impact cleaning has on data - they're usually very low level (such as pandas).
As an example of a tool: Trifacta (disclaimer I work here) https://www.trifacta.com/products/wrangler/. We're trying to improve data cleaning with features such as suggesting transforms the user might want, integrating data profiling through all stages to discover and understand, and transform previews so the user can understand the impact.
I think there's a huge opportunity for better tools in the problem space.
That's precisely the problem of Kaggle. The data is mostly cleaned for you. This is most of the job of a DS in industry. Cleaning your data improves performance way more than working hard on optimizing your ML algo.
> There is so much focus on the machine learning algorithms rather than getting data ready for the algorithms.
Generally, once a problem at work has come to the point of being a "kaggle problem", it's trivially easy. The main problem is unstructured data, with infinite ways of specifying similar ways to measure the same attribute, and lots of leeway to build an unmaintainable data pipeline between the data generation process and the model at the end.
I disagree that a "kaggle problem" style problem is trivially easy, but I strongly agree with the sentiment that dealing with unstructured data is often a much bigger, deeper, and broader problem than the choice of a particular algorithm or ensemble of them.
The ability to efficiently and effectively derive insights from such data is scarce.
Right, by "kaggle problem" I mean the general case where we roughly know what we're going to want to have on the right hand side of the model we're going to run (plus or minus some feature engineering, model choice and other hyperparameter specification, etc.)
Dirty data is not as much as a problem for me than human-biased data. Dirty data engineering, like modeling, will soon be largely automated.
Let's say you are predicting store sales. You create a feature that holds the store sales of one year back. The feature works really well and you are happy with your evaluation. But you captured bias: The previous model the store used was "predict today's sales by looking at last year's sales". Store managers fitted their sales tactics to this model (when the model predicted too much sales, the store managers do their best to get rid of the surplus inventory, for instance: by adding discounts or moving the products to a more prominent spot in the store).
So in the end you end up with a model with good evaluation, but you actually have (over)fitted to previous policies/models. You have not created the best possible sales predictor. How to ever find this out, without a costly intimate deep-dive in the data and data generation processes?
>Dirty data engineering, like modeling, will soon be largely automated.
I don't agree. For every modern tech company that collects data that lends itself to automated data cleaning, there's a 40+ year old company that defined what data to be collected in 1990, designed an "automated system" in 1995 and has been shoehorning improvements on that system since then.
At my last job I was given access to a database with 150+ tables with no data dictionary. The person who wrote the load process and ETL (the output was a lot of summaries) had left 10 years before and nobody truly understood how anything actually worked or the downstream dependencies. It took me a week of digging just to find out which of those 150 tables were just temp tables for one of the many queries that executed on that system.
It's going to be a while before somebody figures out how to clean that data automatically, or even find issues in that data. That's the reality of the world of data for many organizations.
It seems to me you were given three jobs: database admin, data engineer, and data scientist.
When I am talking about automated data cleaning, I am talking more about preprocessing text, dealing with missing variables, discarding duplicates, noisy/uninformative variable and outlier removal, spelling correction, feature interactions and transformations. All of these can be (and are being) largely automated. [1] [2]
A data lake with 150+ undocumented tables is garbage in-garbage out, both for machines and humans. I'd almost label that as the barrier: "Data not available", not: "Dirty data". While a reality for some companies, such a company really needs a DB admin or data engineer, not try to shoehorn an (expensive) data scientist in these roles.
If I understand you correctly, the way you'd address this is by using counterfactuals. See this course[1] for an overview and this paper[2] which talks about the bias problem in the context of movie recommendations.
Yes, counterfactual inference is relevant to this. But it is not so much about answering "what would have happened if?", but more about control theory and feedback loops: Your model never being a static function, but a node inside a giant recursive net composed of other models and humans.
Another example (this time on the output-end): You build a model to route emails to sets of experts inside an organization. Your proxy loss is multi-class logistic loss on topic classes. You are interested in improving response times (which you can more or less measure in aggregate) and quality of response (which is harder to measure, if at all).
You build a first iteration of the model and response times improve. Then you create new features and modeling techniques and you improve logistic loss, but when you deploy this model, response times go way down. What happened? Maybe the experts fitted/adapted to the model output: They learned how to quickly answer a specific type of email because it keeps getting routed to them. The new model does better matching topics to emails, resulting in those emails now being send to another expert. While this expert in the long-term may become better at answering emails closer to his/her topic expertise, in a faster and more informative manner, in the short-term he/she will be slower and of lower quality, as they need to adapt to the new types of emails they are getting, and lack the priors for dealing with ambiguous emails.
Both on the input and the output of models there are all sorts of these nasty human-feedback loops that are very hard to even identify and harder to solve.
1. more people learn data science and ML from MOOCs than university courses
2. Tensorflow the tech people most want to learn in the next year
3. 40% of people survey spend >1-2 hours per week searching for another job. Surprising given all companies complain about the difficulty in finding data scientists/machine learners.
>40% of people survey spend >1-2 hours per week searching for another job. Surprising given all companies complain about the difficulty in finding data scientists/machine learners.
I've hired data scientists in the past. One thing I found is that a lot of interviewees want to talk about all the algorithms (e.g. Gradient Boosting) they've used and are not able to describe how they thought through the problems before they applied the algorithms. It's easier to find somebody who downloaded some mostly clean data, then copy/pasted some code than a person actually thinks through the quantification of a problem. There are a lot of buzzword artists out there.
This is important because in a lot of organizations the business problems have not yet been quantified in a way that lends itself to getting meaningful and valid results from an algorithm. The Data Scientist has to be able to work with others to quantify a problem. Or at a minimum, recognize that there are issues with the current way the problem is quantified and think of ways to improve it. It's much easier to teach somebody to run a data algorithm than it is to actually understand a business problem.
There are issues with people doing the hiring as well. In my last job (not a software company), the VP of the group had pushed to get headcount for a data science team and was fearful of making the wrong hire because he didn't want to say "We hired a data scientist at 2X-3X the cost of a Business Analyst and that was a bad hire." The end result was a massive amount of paralysis, an insanely long and convoluted job description, and complaints about the hiring pipeline.
I agree with your assessment that a lot of times the business problems have been put into a form that lends itself to exploitation by machine learning. Sometimes a company has a lot of data that's actually useless.
Most of the time, I've found that business people do not understand the value of data. Often I hear, "we have this data set, let's unleash the data scientist on this to tell us something." or "we have this data set but what are the so-whats here?".
I spend a lot of my time explaining that there must first be a business objective, a key question, or hypothesis that can then be understood through data. I cannot take a haystack and find the needle that is interesting to you. And if I do find that needle, many times there are no resulting changes made to our strategy.
I think we're still in a place where the value in a data scientist is not that she knows how to write:
fit <- lm(target ~ ., data = customers)
The value exists when she can take a problem from the business, understand how to find a solution with data, and then convey that back to the business in a meaningful way that allows them to easily understand how they can make changes to positively impact the bottom line.
>I spend a lot of my time explaining that there must first be a business objective, a key question, or hypothesis that can then be understood through data. I cannot take a haystack and find the needle that is interesting to you. And if I do find that needle, many times there are no resulting changes made to our strategy.
IMO a number of data science positions should be considered partly research positions. You are hiring somebody think critically about how to generate high value/impact from data. This includes exploring if there is a different way to think about a business problem than it has been formulated in the past. This may include defining and collecting data when you discover the existing (or non-existent) data isn't appropriate. As with any research, you'll sometimes realize the path you are on is wrong and a correction is needed.
The "find all the needles in this haystack" is a totally different worldview and throws a lot of critical thinking out the window. I think this really plays into the idea that an organization can hire a person who is going to do immediate "magic" with algorithms and zero effort beyond that. You can slice/dice and p-hack your way into a million thoughtless and useless "insights."
The organizations I have seen that do best at this have teams of data scientists collaborating with devs/engineers and business analysts... there need to be a lot of different research activities going on most of which are working off the same data/compute infrastructure but with some people dissatisfied and pushing the edge of course. Also regarding hiring pipelines I would discourage hiring based on technology keywords as anyone that is a good fit should be intelligent and curious enough to pick up their new employer's tech stack relatively quickly.
>3. 40% of people survey spend >1-2 hours per week searching for another job. Surprising given all companies complain about the difficulty in finding data scientists/machine learners.
Not too surprised about this. I think a lot of people who go into the field want to do more interesting things than what they find being used in the field. I don't think pay is necessarily the gap here, as others have pointed out, so much as interesting work (or at least the intersection of interesting work and pay).
All of these things coupled with the top complaints data make a lot of sense if one views the items you list as identifying a set of people who are mostly self-taught, have a perhaps insufficiently broad and deep quantitative background, and are looking to merely catch the "data science wave".
Yeh. The median salary for a machine learning engineer, which is definitely higher than what most companies are used to paying (even for software eng roles).
My argument is that machine learning also higher leverage than most roles. One algorithm written by one machine learner can generate a huge ROI. Think of an algorithm to predict loan defaults or customer churn for a bank. That algorithm in the hands of a great machine learner can generate a huge ROI.
Anyone else find it weird that when you click "other" for gender that the data looks more like garbage?
I was trying to actually compare male and female salaries out of interest but have a hard time believing so many people earn <$20k/yr. Even when you switch the filters around. The best I could find is just filtering for the US, but the number of respondents are so low, ~1k total (~200 Females, ~800 males), that it becomes difficult to make accurate comparisons ($22k diff but women had more masters degrees and similar PhDs, by percentage).
Has anyone sorted through this data and tried to account for these factors? I'd be interested at the uncertainty and how the information was gathered.
I pulled a few gender stats here. http://bit.ly/2zjrSJD
Accounting for country, education, and industry you really reduce the population you're sampling from but those deviations are huge. You need to account for industry especially.
Well this really doesn't discuss the error associated with the data. Which is what I was trying to get at. There seems to be a lot associated with it, which makes accurate predictions difficult to make.
Next it'd be interesting to see Python 2k vs. Python 3+. My own experience tells me that the majority of top Kagglers still use Python 2k, despite Kaggle Kernels being Python 3+ exclusively.
I also am quite amazed with the predominant use of Logistic Regression. I wonder if that is less about interpretability / ease of engineering, and more about the barriers that data scientists face when using more complex methods: lack of data science talent, lack of management support, results not used by decision makers, limitations of tools.
If Kaggle results are anything to go by, all businesses that care about best performance on structured data, should be using a form of gradient boosting.
With the rise of Tensorflow and sklearn, the strong Python showing makes sense.
However, I wish Python had a solid IDE for interactive work like RStudio. Jupyter notebooks are fine but being able to easily inspect variables is super convenient.
Spyder doesn't cut it. Y-hat's Rodeo was still a bit buggy last time I tried it. Any other suggestions?
Interesting that the most common models being used are the simpler ones, logistic regression and decision trees. This is despite all the hype for the more complicated techniques like neural nets and GBMs. Is it just because these models are faster to train and easier to interpret or something else?
in my experience, doing deep learning is a lot harder than building simpler ml models. training times are killer, need lots of data, overfitting is a challenge, hard to interpret results, lots of things can go wrong. deep learning is the future from a mathematical standpoint (with neural nets you can essentially learn arbitrary functions in some borel space or something whereas simpler ml models are basically a special case of deep learning) but it's definitely harder.
In my opinion, this is representative of the problems with data science tools today. There is so much focus on the machine learning algorithms rather than getting data ready for the algorithms. While there is a question that lets respondents pick which of 15 different modeling algorithms they use, there's nothing that talks about what technologies people use to deal with "dirty data", which is agreed to be the biggest challenge for data scientists. I think more formal study of data preparation and feature engineering is too frequently ignored in the industry.