Correct. If you build your instrumentation correctly, then you don't really need to do any "cleaning."
Doesn't mean you might not need to do transformation for different uses but ideally wouldn't need to, for example change data types like turning a bool into an int.
The problem is that data engineers that are geared towards analytics very very rarely control the systems that create the data. If you're lucky, you have the task of hounding a team within your company to get their data management practices in order. And the conversation there is whether they should make their job harder in order to make your job easier.
Unfortunately, data engineers rarely deal with purely in-house data. You're gonna be pulling data from a variety of data sources. I can assure you that if you're pulling from government data sources, you're gonna have a hell of a time. Speaking from direct experience, my team is probably going to spend $10M/year just trying to keep a government dataset in order, because they won't do it themselves. I'm talking lawyers, legal analysts, data engineers, data scientists, data entry personnel, etc.. just to fix data that should have never been broken in the first place.
It shouldn't be a shock that cleaning the data is the path of least resistance for many.
Hence why I said DE need to be involved as early as possible. Aspirational sure, but that's what I've seen work the best and repeatably. It's the only scalable solution IMO otherwise you're perpetually playing catch-up.
On the point about the govt I literally built a completely new contract type and civilian hiring practices for the DoD to bring in Data Engineers so they could do exactly what I describe to make your life easier.
Do data engineers have good analysis skills? Do business analysts have good engineering skills? I don't think either of them can fill the data scientist role.
The scientific training and mindset (scientific method, hypothesis, experiment setup, etc.) to even create an accurate model is an undervalued skill here no? Even if data cleaning is automated, these skills cannot be easily learned.
There is a reason why so many PhDs get into the field, because they were trained in the exploratory/research mindset that no engineering or analytics skills can fill. Correct me if I am wrong.
> Do business analysts have good engineering skills?
Depends on the analyst.
> I don't think either of them can fill the data scientist role.
> The scientific training and mindset (scientific method, hypothesis, experiment setup, etc.) to even create an accurate model is an undervalued skill here no? Even if data cleaning is automated, these skills cannot be easily learned.
It's not about replacing data scientists with data engineers, it's about both roles working together to make everything more efficient.
The hiring rate for data scientists has plateaued. The industry doesn't need any more of them. Why? Because data scientists often can't solve problems fast enough. It's a commonly quoted statistic that 70% of any data science task is data cleansing and/or etl. A data engineer's job is to take that 70% and turn it into 10%. The data engineer saves the data scientist time, meaning they can focus on what they're supposed to do -- build models.
If we only had to use 1st party data, that might be easier. But then again, if you’re building your product incrementally, you’re still going to have instrumentation holes that you may or may not be able to partially backfill.
Doesn't mean you might not need to do transformation for different uses but ideally wouldn't need to, for example change data types like turning a bool into an int.