Of course everyone agrees that "cleaning data" is difficult and boring, and it's...

kmax12 · on Oct 30, 2017

I am the lead contributor of a python library called Featuretools[0]. It is intended to perform automated feature engineering on time-varying and multi-table datasets. We see it as bridging a gap between pandas and libraries for machine learning like scikit-learn. It doesn't handle data cleaning necessarily, but it does help get raw datasets ready for machine learning algorithms.

We have actually used it to compete on Kaggle without having to perform manual feature engineering with success.

[0] https://www.featuretools.com

ScottBurson · on Oct 31, 2017

Wow, this looks very cool!

IanCal · on Oct 31, 2017

I'm starting to build up various utilities to help with this kind of thing, but I fully agree. The decisions require understanding the business requirements (do I use source X or Y for field 1, what errors are OK, what types of error are worst, etc), but the process of finding some of these could be better.

One simple one is missing data. Missing data is rarely a null, I've seen (on one field, in one dataset):

    N/A
    NA
    " "
    Blank # literally the string "Blank"
    NULL # Again, the string
    No data
    No! Data
    No data was entered for this field
    No data is known
    The data is not known
    There is no data

And many, many more. None can be clearly identified automatically, but some processes like:

Pull out the most common items, manually mark some as "equivalent to blank" and remove.

Identify common substrings with known text (N/A, NULL, etc) and bring up those examples.

Are useful, I'd like to extend with more clustering and analysis to bring out other common general issues but rare specific issues. Lots of similar things with encodings, etc. too.

Other things that might be good are clearer ways I could supply general conditions I expect to hold true, then bring the most egregious ones to my attention so I can either clear out / deal with them in some way. A good way of recording issues that have already been analysed and found to be OK would be great too.

philvb · on Oct 31, 2017

Yes, completely agree that each dataset requires decisions to be made that can't be automated, but there are huge opportunities for tools to assist users in understanding what cleaning decisions they might want to make and how those decisions affect the data. Most data cleaning tools do a very poor job of helping the user visualize and understand the impact cleaning has on data - they're usually very low level (such as pandas).

As an example of a tool: Trifacta (disclaimer I work here) https://www.trifacta.com/products/wrangler/. We're trying to improve data cleaning with features such as suggesting transforms the user might want, integrating data profiling through all stages to discover and understand, and transform previews so the user can understand the impact.

I think there's a huge opportunity for better tools in the problem space.