Good to see that data cleaning was #1 on that list. Whenever I do work on a side project, it takes way way more time to get and structure the data than it does running the algorithms. Granted, that's because I have to go out and get the data in the first place, and then make sure it's useable and in the correct format.
Like the recent project I'm doing trying to classify country music songs based on their topic on the data blog I write on (https://bigishdata.com), the amount of time it's taking to scrape lyrics, remove duplicate / incorrect songs, and then do manual classification for training data is taking far longer than running the ml algorithms in the end one I've gone through that process.
I've been looking for jobs recently, and I've seen only one job posting that mentions data cleaning as a necessity, whereas the rest only talk about data science and algorithm knowledge, or overall ETL design on the data engineering side. Seems like data set knowledge should be emphasized more.
I agree, and would add: data cleaning's importance to the quality of the result is also often underemphasized compared to the much bigger focus on the quality of the algorithms. A single bad decision on data cleaning can have a large effect on the end result (in many cases, more than choosing between algorithms, assuming you pick some vaguely reasonable algorithm). Especially any choice that ends up producing non-random effects, like deduplicating things in a way that ends up biased: it's common that missed duplicates in an automatic deduplication process aren't randomly distributed. Or a scraping process that ends up with biased samples. You can correct for these kinds of things in various ways (e.g. incorporating an estimate of the bias in a statistical model), but people who don't consider data cleaning a "real" part of the whole statistical modeling pipeline in the first place usually don't.
Hundred times this. You see Qlik/Cognos Analytics/PowerBI/Alteryx/whatever sales guys making demos that make executives drool over the seeming easiness and wow-factor these tools are capable of producing. When the time comes to plug those over your production operative systems, CRM, whatelse, there comes "the now wait a minute" moment especially if your systems and their data models happen to be even slightly on the more complex side.
Edit: The basis of succesful implementation of these tools is to have the data in digestible format and I feel that transforming the data to that business usable format is where the big job is.
In my opinion well done ETL and DW are not going anywhere, even though in some circles they are said to be things of yesterday. Then there's a huge difference between an OK ETL/DW and a Brilliant ETL/DW. Designing a good ETL process is as large parts business and context knowledge as it is a application of data engineering skills. For example, it requires business knowledge AND data engineering knowledge to determine what kind of granular level advanced metrics could or should be calculated during ETL. Service level metrics and service level categorization for different kind of customers/claims/orders/... would be a perfect simple to understand example problem - there could be attributes and value ranges behind multiple relations that probably need to be taken into account and understood.
Edit 2: I've been involved in both sales and execution of so called data discovery sprints, which are a 4-6 week periods where we bring a data engineer, a subject matter expert and client key personnel working together and let them go "fishing". The key thing is that this provides an low cost way for the clients to possibly gain insight on the potential their data could provide. On the other hand, many prospective clients just have so messy data that this data discovery job can't be recommended, which leads to other possible opportunities (MDM, ETL, DW).
Yes correct. In a way it's a combination of philosophy, agreed practices and techical solutions. I think this one is a good introduction: The What, Why, and How of Master Data Management https://msdn.microsoft.com/en-us/library/bb190163.aspx
Yes! Came here to write this too. I've been spending the last few years thinking, inquiring and talking about this, and I think there definitely is a field emerging here. There are already some companies trying to think about this stuff, and classes give it lip service but I don't think we've even seen the tip of the iceberg.
It's interesting to talk to different people about data quality and what they think it means, or how they choose to deal with it, and it's all over the place. Some people just mean open and consistent formats, some people have stylistic preferences for data shape, some people talk about accuracy of values, etc etc.
In some ways it's an extension of the thought that the world is inherently noisy, and we've been thinking about that one already, it's just that it turns out you don't need sensor data a la robotics to get noisy data - it's already in the datasets we know and love, and you accumulate more of it, the more sources you pull into your analysis.
Aye, but the locked-down nature of data governance actually fights against using it for insights. I've seen a situation where a small two-way frequency table was requested by Team 1, and it took six _months_ because Team 2 had access to data but no access to metadata, Team 3 had metadata and could help design a query but didn't understand the technical details or run the query, Team 4 had to approve the process, and Team 5 had to review Team 3's query before it could run. In the mean time Team 2 was reorged and Team 1, the original requestors, found upstream sources.
Data in a regulatory regime can be excruciatingly difficult, and lend itself to "gut instinct" being used because fear of risk and regulation lock things down too tight to be useful.
I agree with this. It's fine to teach machine learning using the iris dataset, but there is rarely, if ever, a section dedicated to "real" problems. It was a shock to me just how high a percentage of time is spent cleaning data. It is a fundamental skill that is not only underestimated but "undertaught".
>> I've been looking for jobs recently, and I've seen only one job posting that mentions data cleaning as a necessity, whereas the rest only talk about data science and algorithm knowledge, or overall ETL design on the data engineering side. Seems like data set knowledge should be emphasized more.
Actual data cleaning, usually in an automated sense, is more 'data engineering' than 'data science' or applied statistics. Feature engineering and 'massaging' training data is more related to DS but it's understood that this data being consumed by the DS is already in decent shape.
I'd hesitate to call a lot of the work I do cleaning data "engineering".
I think perhaps the problem here is the term science covers a lot of disciplines.
I propose harder stats be data theoretical physics, with data biology and similar referring to cases with harder messy real world complications. I'm sure we can come up with a full spectrum.
Any specific resources you'd recommend on data cleaning, verification, etcetera? I've just started reading this: https://www.amazon.com/Accuracy-Economic-Observations-Oskar-... . I've seen a few other books on the subject which I'm planning to get into, but I'd be interested if anyone has specific recommendations.
Do you use R and the "hadleyverse"? (Or "tidyverse" I think as he prefers?)
I'm a programmer by trade but I use R because the people who actually work with data use it, and they write good tools for it... I think there is some confusion in the programming world about this. Programmers work with data, but they don't do it nearly as much as "professionals".
Tidy data is a good intro if you're not familiar with it:
No, I don't use R, thanks for the reference. Most of what I've done lately has been as part of software development process, so along the lines of validating the effectiveness different techniques for solving a known problem with a smallish test dataset -- typical engineering style optimization. I'm looking to impose more structure on the process.
Honestly, as another commenter pointed out, seems like an emerging field. Best practices and processes are just being figured out, and I haven't seen any great resources online talking about what to do, especially since most of what you need to do depends on the data set and how you're storing the data, and that can vary widely.
Like I recently dealt with finding duplicate song lyrics in my 5000 set of lyrics, and to do that, I just had to google around for StackOverflow answers or random blog posts before I found something that I could adopt and chance for what I had.
Agreed! However, data cleaning is a pretty hard problem. My previous company stored merchant credit card transactions, and these transactions were large unwieldy beasts whose data model had changed many times over the course of the company. Old data was completely invalid, and yet we couldn't remove it because the dollars and cents had to add up. The cleaning significantly hindered new development.
Validations when storing new data definitely help, but changes to the data model are tough to reconcile with old data.
I agree - I have a project that has similar problems (cataloging standalone lectures - https://findlectures.com). The biggest advantage of it being a side project is there's no pressure to get the data cleaning done, but in a work environment with time pressure this type of project is a huge pain.
I'm currently preparing a lecture on the topic of logging for my students. Part of the lecture is of cause how to use various logging frameworks, but the main part is what to log and how to structure logs.
Basically we're trying to get them to pre-emptively do data cleaning, so their logs will actually be useful for potential future data projects.
I don't get why companies would hire data scientists to do ETL jobs. These are properly left to engineers with expertise in data warehousing. From what I see though, this is a pretty common occurrence in Silicon Valley.
In spite of assurances from business process owners that the underlying data sources are clean ... they almost certainly are not.
Multiple legacy systems with no consistent cross reference to unambiguously identify the same customer. Assured that systems have been gone through and all the names made consistent. Consistency for a human is not consistency for a computer. "Commers Ltd" is not the same as "Commers Ltd." And, isn't it lovely when a salesperson decides to add a location to a customer name. Now we have "Commers Ltd Dallas" as a unique customer. Business process discipline is often lacking and will mess you up.
Subscription data sources that change their schema with no notification to paying customers. And, when you are scraping data from websites you need to constantly be checking that your scrapers are still working properly. Source websites change regularly.
Crazy processes like entering a negative invoice to indicate a refund to customers but forgetting to zero out the cost of goods related to the invoice. We may have refunded the money but we didn't do the work twice. Arggh! Errors abound.
> And, isn't it lovely when a salesperson decides to add a location to a customer name. Now we have "Commers Ltd Dallas" as a unique customer.
Good luck modeling data to not need this. I have caught myself actively telling people to do this more than once, and I'm really thinking about replicating some data so those changes are less disrupting.
That "Commers Ltd Dallas" probably has differing billing and delivery addresses, points of contact, invoice formats, customers representatives and preferred sellers, product selection, and probably everything else you have on your DB.
It's really the only option if your input/managing software can not model the complexity you require "after the fact". E.g. Company has multiple offices. Delivery address is per office, not per company, etc.
The real problem is when there is chaotic, and organic mixing, matching and re-purposing. I've seen it many times with "non-technical" individuals. They don't know what their software can do. E.g. Redmine. So the support individuals just log everything under the same "IssueType. They then "categorize" it using Category custom field, instead of the standard category which has enumerations. And then they then use that Category field to drive reports/process. Instead of using a different IssueType or Tracker, which is what it was designed for, and has tools that help you leverage/manage the complexity of different standardized processes.
Then, they decide to to add "Sub-categories" into the category field, instead of using a project-hierarchy or something. Then they want to do billing reports from the time logged per X and of types A,B,C, and at that point it's a giant mess and I stop caring. If they want to not use the software as intended, then do "fixing" by filtering and fiddling with Redmine CSV exports in Excel afterwards, that's their problem. Oh, and they ask that everyone has permissions to everything, allowing all users to change the status of each IssueType as they please, without any process.
I just feel sorry for the poor individual that get's a raw extract of that data and has to use it for something.
That is fine but don't later ask why the system has 10 different customers for Commers Ltd
I ended up building another table and logic just to do roll-ups and account for name variations. But I told the client that they really need to invest in a serious cross-referencing middleware that tracks identities across all these systems and uses ID numbers to coordinate all the legacy systems.
But yes trying to develop models or even simple aggregations that rely on this kind of data can be quite frustrating.
The big one that's missing: There is nothing you can conclude from your data. It's clean, it makes its way properly to the analyst, and yet, there's just nothing there...
A great example of this was reported to me from a head of data science at a (UK) national newspaper. The business set a task of predicting subscriber numbers using online user behaviour from the previous month as features.
But it turned out that most subscribers converted within three days of first hitting the site. The previous month of data was almost completely worthless. Like you say, the data just didn't contain the information that it 'should' have.
At least in online contexts, this often means 'there might be an effect, but it is smaller than our experiment could have predicted. Let's keep adding samples.'
That's why it's called "data mining" - you keep digging until you find something. And with lots of data, you can always find something if you look hard enough - which leads to things like http://tylervigen.com/discover.
Absolutely agree that data cleaning should be at the top -- how someone prioritizes data cleaning is for me, the main litmus test to how effective they are at real-world data problems. I also agree with how the author summarizes the issue, but he also runs into the same issue I have: data cleaning is such a broad term that it obscures how difficult and important of a problem it is.
For example, some people think data cleaning is "Convert 12-FEB-2012 to 2016-02-12" type problems, and can't believe that such a task would be 80 to 90% of the difficulty in data work (compared to say, learning enough ggplot2 to make a nice chart).
On the other side of the equation, you have people who want to do a JOIN-GROUP-BY aggregate so they can calculate how much "evil" Wall Street money goes to each political candidate, a la OpenSecrets's calculation [0], only to find that the FEC does not classify campaign contributions by industry type or company, nor is the "employer" field filled with normalized entries such as "Evil Wall Street Company" that would lend itself to easy GROUP BY calls. For fucks sake, I've found that executive-level/professor folks can't even spell "Goldman Sachs" and "Berkeley" correctly (even on a typed form)
And that doesn't even scratch the surface of how little this person knows about the data question the purport to answer, or about how the FEC, the American political system, and real life works. Among the data cleaning problems they will have to mitigate are also the 2 hardest problems in computer science (how things are named/classified, and how up-to-date the data is).
I don't have any better ideas at the moment for how to break apart the category of "data cleaning" that reveals the many facets of the problem but also still preserves the interelatedness of the facets. But it's possible to be very good at some of the parts of data cleaning without knowing the rest.
Besides, cleaning implies some entry level gig, something for a QA hack not someone experienced in complex systems. Marketing types love this kinda twist as a way to maintain control (and ensuring failure) of the project. It's like a scapegoat. Just got out of meeting where marketing claims that data "hygiene" is gonna be a priority in 2017.
I have spent a lot of time talking with customers/prospects about this topic, but I use "data remediation" which I feel brings more accurate and precise connotations, to wit:
* Implies that the data is deficient/falls short of expectations.
* Implies that the shortcoming currently makes it ineligible to graduate to the next level.
* Implies that with hard work and additional time likely it can be made sufficient though still not ideal.
* Implies that someone failed to help the data to meet expectations.
* Implies that you need special outside expertise, namely someone with the knowledge needed to assess the shortfall, possibly help you clarify your standards, design steps that when followed should result in "good enough" data, and who is able to articulate the remaining weakness(es) which need to be accounted when assessing future suitably of that dataset for a given purpose.
* Implies that your data will be stuck in school all summer while their friends are out having so much fun.
A lot of data scientists these days (me included) are former academics with backgrounds in numerical simulation in fields like chemistry, physics, mechanical engineering etc.
They live and breath numerical linear algebra and are comfortable reading advanced theoretical books or papers.
It's easy for them to pick up the basics needed to pass interviews and find a data science job. How would they go about adding some rigor to their understanding of ML and statistics?
I wouldn't expect the majority of data science jobs to be particularly focused on the math behind the algorithms. Rudimentary understanding of probability and how to translate the jargon into your academic background's jargon is more important than deep understanding for these jobs. Passing the interviews for these jobs is one thing. Unless you're specifically looking for jobs that focus on generating new modeling techniques or algorithms for computational statistics, expect to be far removed from even basic linear algebra in actual practice. Source: me. I fall in your described bucket and have worked in data science/machine learning jobs in both contexts (new modeling techniques/stats versus application of off-the-shelf tools).
One more interesting thing I have observed in data projects failing: organizations culture around data and the gap between data science team and engineers. Say, you have 2 top notch data scientist who know enough (stats, markov chains, algorithms and so on..). But let us say an average engineer in the organization doesn't know even a bit about A/B testing or difference between building a machine learning model Vs. obtaining predictions from already built model. Then no matter how good your so called data scientist are, the end result in terms of product or solution delivery is always sub-optimal. If the engineers and data science teams can't speak a common language, the result is always disastrous. Note that the gap is specifically about understanding data analysis as a domain.
The efforts to narrow down this gap must be driven by the lead data science member or CTO. Something like 'data bootcamp' mandatory for every new joinee can help. I had read about Facebook having such a bootcamp mandatory.
I enjoyed the conciseness of the article, but I don't have a big stake in this field so I can't comment on the content itself. I wanted to make a small formatting note: having key parts of the article only as images (not reproduced or described in the text) is bad for accessibility, search engines, etc. an article should still make sense if you take out all the images
I think point 7) needs work. Often times people use words like interpretable to avoid having to think about the data - usually in the context of linear or logistic regression. The model seems "interpretable" because the coefficients are "meaningful" - but often times the model is just as much a black box as other models, for instance regression coefficients depend on what other features you include and the scale of those features. Similarly regression p-values are very easy to misinterpret. I think you should use the data to determine what the model is doing regardless of the model you use.
In summary, it is important to iterate quickly and to validate your results. Using complex models, like gradient boosted decision trees, can often iterate much more quickly than simple models because you don't have to do extensive data preparation. Many analysts are stuck in the mode of using linear or logistic regression for every problem, when there are better tools out there.
I think this is great, but point #4 is way off the mark. It's pretentious and it's gatekeeping. It's not true that a strong analytics leader as you've described is incapable of understanding selection bias, measurement bias, Simpson's paradox, and statistical significance. I would posit that a great analytics leader needs to understand all of those. It's not true that only a "data scientist" with a bunch of fancy programming and modeling experience is the only one capable of understanding bias.
Also you misspelled "breathe". As in "live and breathe".
This is me. I work for a non-profit that is stuck in the stone age--not for lack of money, mind you, but because the IT Director is an incompetent megalomaniac who views "security" as a reasonable justification to refuse any and all requests, and treats everyone like an enemy.
I haven't been allowed to use Python or R. In fact, the only programming language I have access to is VBA (for applications, not the stand-alone variant). Of course that's a huge mess because the IT director disables macros once a month, generally right after another crypto attack makes the news. Thankfully, he didn't even realize that it was possible to use VBA from inside any office application until after I had already used it to create several Access applications which made the jobs of the most important people in the organization easier. So when he breaks VBA every director in the organization yells at him and the functionality is restored nearly instantly.
Of course he could restrict the applications to run only signed macros, but he won't give me permission to sign things because he is (literally) afraid I might hack something.
On top of that, my computer is a Core 2 Duo from 2007 or so with 4 gb of ram. He bought over 100 of them used from a computer recycler about 2 years ago. For the first three months at this job I had a Pentium D, which literally couldn't run Excel and Firefox at the same time. I'm not allowed to get a better computer, because the employee handbook states that every computer needs to be the same for "security" reasons. If my director used our budget to purchase a computer I wouldn't be granted access to any of the databases containing our data because of "HIPAA compliance." (For the record, we don't have any medical data whatsoever. We only have names, addresses, and donation amounts. We don't even know the birthdays of our constituents.)
The worst part is that we randomly started losing data after all of our network drives were moved offsite at one point to provide "redundancy." I created several tickets about this issue, and each time I was told that it couldn't have possibly happened, and there was no record of the file ever existing. I created a script that created a log file each hour with a list of files and their attributes from each directory to try record proof of this happening. After I recorded about a week of files disappearing randomly overnight, he reported me to HR for hacking.
Once I proved nothing I did was wrong he amended the "IT security" section of the employee handbook. Several of these measures were impossible to follow because of restrictions he had placed on the computers/network. I brought this up with HR, and they removed these measures from the handbook. Once this happened, he sent an email to me cc'ing my boss and HR accusing me of trying to frame him by deleting files. I don't know how that accusation even made sense, because the files would still have to show up in transaction logs.
Despite all this, I KNOW my director and HR aren't going to believe me when I tell them I'm quitting because our IT director is an incompetent tyrant. From their perspective, IT issues are something that can be solved by compromise, just like everything else. So IT has to let me use VBA, and that should be enough.
Anyways, long story short, anybody hiring in Chicago?
Is the nonprofit you work for affiliated with a particular University on the south side?
I can sympathize with having to deal with VBA. I'm working in a lab that deals with lots of questionnaire data and uses Access as the main tool for gathering said data because that's the way things have been done in the past, despite the fact that nobody in the lab really knows how it works.
I can also sympathize with having your network drives go down and render everything inoperable. Everything we use in the lab is stored on an offsite network drive, probably because of HIPAA compliance, and said network drive has dropped out twice in the past couple weeks. Once for almost an entire day, and once for an hour or two.
It is not. I've heard working there is hit or miss depending on who your boss is.
I don't want to hate too much on Access. It's really an amazing program. You should see the processes some of my coworkers managed to create despite not knowing an iota of SQL.
Thanks for the well-wishes. I'm being very methodical because I really want to make the move count. I'm okay with waiting for the right opportunity because I really enjoy my job outside of the horrible IT situation: my boss is awesome, the people I work with are awesome, there is a lot of variety in the role, and I get to make a lot of decisions. Regardless, the IT situation is limiting my growth, so I'm on the lookout for the next thing.
To be honest your IT director sounds like an idiot but the real problem are YOUR managers. They should have protected you and told the IT director to f...k off a long time ago.
I've had the same thought. I actually met with my boss and HR about this, and they said there was nothing they could do. The issue is the company structure. I work for a 501c3 affiliated to another organization. The parent company provides all of our admin functions like HR, IT, etc. We share a board, and the director of each department reports directly to the board. The IT director is really well liked by the board because he never spends any money.
The average age of our board is somewhere north of 80 (not kidding), and they don't understand what IT is. True story--our parent organization didn't have a website until 2003, because the IT director thought websites were a fad. He was forced to buy the domain after someone else bought it and used it to post stuff the board found unappealing.
The only way to describe the entire situation is Kafkaesque.
In the end the only way to fix this is probably to leave the company if you can. If senior leadership doesn't see a problem there isn't much you can do.
I've thought about putting all my code on github, just for kicks. My code is a nightmare though, because I don't have version control, and Office VBA doesn't allow inheritance. As such, the amount of code that has been copy-pasted is bonkers.
That being said, the amount of VBA I've seen that doesn't work with Option Explicit is a little staggering, so maybe I should be so self-conscious.
I am with you on everything but the complaint about
> my computer is a Core 2 Duo from 2007 or so with 4 gb of ram
That's a super freaking powerful machine, you have to be efficient in your programs and good on your algorithms. My machine was a Pentium III 800MB RAM for the longest time. There is a lot you can do on that. Use algorithms that need to load data in chunks, exploit memory mapping and generate native code if you can. They go a long way, likely much further than some may think.
I can make my programs work, but it's not a good allocation of resources. Two hours of my time costs as much as that computer. The computer that would allow me to do my job with the most efficacy is equivalent in cost to about 10 hours of my time.
The question is whether the better computer save more than 8 hours over the course of a year? The answer is yes. The amount of time it takes to load stuff in from the disk slows down programs immensely. If my computer had 16 gb of memory I'd be able to store and manipulate all my data in memory.
That's not even counting how much time is wasted optimizing code that I run rarely. That time is much better spent doing other things.
An honest question. If I can make really good money helping businesses make sense of out "smallish" data in Excel, why would I subject myself to the miasma that is "data science"? Will I be able to charge lawyer-like rates?
A big fallacy seems to be that it is meaningful to just use existing data in whatever cruddy non-normalized form it comes in, and let the 'algorithms' sort it out.
There needs to be strong mgmt support for getting outcomes, because the production team are going to have to change to support it, and they usually like doing things their own way. Typically, without analytics, sensible logging formats or a clue as to why the outside world behaves the way it does.
This is great advice. I'd add that the culture around data engineering projects tend to be very different.
I've seen companies that treat data projects as if they were this great unknown projects where the developers could get away with using bad or no patterns and not follow patterns that other applications in the company use.
Technologies like Spark have made more common and easier to develop big data applications and implement design patterns that regular engineers can understand and follow.
Couple a great data engineer that with great data scientist using tools like Spark, R, H2O, Alluxio, Parquet, etc. and companies can truly exploit their large sets of data effectively.
The problem is DevOps and bridging the gap between a scientist's environment and a production environment and keeping both as flexible and testable as possible.
We started a company to bootstrap companies into this culture by providing DevOps services and UIs which simplify the deployment of Kubernets, Spark, Druid, H2O, etc. clusters. We also provide tools and services for simplifying and automating ETL pipelines with which models can be trained.
If you are interested in finding out more about these services contact us at: miguel@zero-x.co.
What I have seen in my limited practice with machine learning and big data projects is that it is easy to fool yourself that your methods work. And the problem is that the people who are good at this get promoted, and those who find the mistakes are shunned.
"Somebody heard: Data is the new Oil: No it isn't. Data is not a commodity, it needs to be transformed into a product before it's valuable"
Err...Oil needs to be transformed and refined before it can be called a product (like gasoline, plastics). So the analogy is good and even supports #1!
I recently was working with a startup that is trying to tackle some of these issues. Feel free to check them out and provide some feedback! (https://datablade.com)
This is related to point 7, solution in search of a problem. I myself have been guilty of this when I wanted to use deep learning models just because I could. My much more experienced boss gently dissuaded me and I ended up with a 'boring old' logistic regression, which was completely adequate for the job.
Another time I was working with an engineer who built a neural net to predict something. Turned out it was a really poor choice as interpretability was important for the problem and the neural net's predictive power was actually worse than more traditional models.
Like the recent project I'm doing trying to classify country music songs based on their topic on the data blog I write on (https://bigishdata.com), the amount of time it's taking to scrape lyrics, remove duplicate / incorrect songs, and then do manual classification for training data is taking far longer than running the ml algorithms in the end one I've gone through that process.
I've been looking for jobs recently, and I've seen only one job posting that mentions data cleaning as a necessity, whereas the rest only talk about data science and algorithm knowledge, or overall ETL design on the data engineering side. Seems like data set knowledge should be emphasized more.