Dirty data is not as much as a problem for me than human-biased data. Dirty data engineering, like modeling, will soon be largely automated.
Let's say you are predicting store sales. You create a feature that holds the store sales of one year back. The feature works really well and you are happy with your evaluation. But you captured bias: The previous model the store used was "predict today's sales by looking at last year's sales". Store managers fitted their sales tactics to this model (when the model predicted too much sales, the store managers do their best to get rid of the surplus inventory, for instance: by adding discounts or moving the products to a more prominent spot in the store).
So in the end you end up with a model with good evaluation, but you actually have (over)fitted to previous policies/models. You have not created the best possible sales predictor. How to ever find this out, without a costly intimate deep-dive in the data and data generation processes?
>Dirty data engineering, like modeling, will soon be largely automated.
I don't agree. For every modern tech company that collects data that lends itself to automated data cleaning, there's a 40+ year old company that defined what data to be collected in 1990, designed an "automated system" in 1995 and has been shoehorning improvements on that system since then.
At my last job I was given access to a database with 150+ tables with no data dictionary. The person who wrote the load process and ETL (the output was a lot of summaries) had left 10 years before and nobody truly understood how anything actually worked or the downstream dependencies. It took me a week of digging just to find out which of those 150 tables were just temp tables for one of the many queries that executed on that system.
It's going to be a while before somebody figures out how to clean that data automatically, or even find issues in that data. That's the reality of the world of data for many organizations.
It seems to me you were given three jobs: database admin, data engineer, and data scientist.
When I am talking about automated data cleaning, I am talking more about preprocessing text, dealing with missing variables, discarding duplicates, noisy/uninformative variable and outlier removal, spelling correction, feature interactions and transformations. All of these can be (and are being) largely automated. [1] [2]
A data lake with 150+ undocumented tables is garbage in-garbage out, both for machines and humans. I'd almost label that as the barrier: "Data not available", not: "Dirty data". While a reality for some companies, such a company really needs a DB admin or data engineer, not try to shoehorn an (expensive) data scientist in these roles.
If I understand you correctly, the way you'd address this is by using counterfactuals. See this course[1] for an overview and this paper[2] which talks about the bias problem in the context of movie recommendations.
Yes, counterfactual inference is relevant to this. But it is not so much about answering "what would have happened if?", but more about control theory and feedback loops: Your model never being a static function, but a node inside a giant recursive net composed of other models and humans.
Another example (this time on the output-end): You build a model to route emails to sets of experts inside an organization. Your proxy loss is multi-class logistic loss on topic classes. You are interested in improving response times (which you can more or less measure in aggregate) and quality of response (which is harder to measure, if at all).
You build a first iteration of the model and response times improve. Then you create new features and modeling techniques and you improve logistic loss, but when you deploy this model, response times go way down. What happened? Maybe the experts fitted/adapted to the model output: They learned how to quickly answer a specific type of email because it keeps getting routed to them. The new model does better matching topics to emails, resulting in those emails now being send to another expert. While this expert in the long-term may become better at answering emails closer to his/her topic expertise, in a faster and more informative manner, in the short-term he/she will be slower and of lower quality, as they need to adapt to the new types of emails they are getting, and lack the priors for dealing with ambiguous emails.
Both on the input and the output of models there are all sorts of these nasty human-feedback loops that are very hard to even identify and harder to solve.
Let's say you are predicting store sales. You create a feature that holds the store sales of one year back. The feature works really well and you are happy with your evaluation. But you captured bias: The previous model the store used was "predict today's sales by looking at last year's sales". Store managers fitted their sales tactics to this model (when the model predicted too much sales, the store managers do their best to get rid of the surplus inventory, for instance: by adding discounts or moving the products to a more prominent spot in the store).
So in the end you end up with a model with good evaluation, but you actually have (over)fitted to previous policies/models. You have not created the best possible sales predictor. How to ever find this out, without a costly intimate deep-dive in the data and data generation processes?