I'm starting to build up various utilities to help with this kind of thing, but I fully agree. The decisions require understanding the business requirements (do I use source X or Y for field 1, what errors are OK, what types of error are worst, etc), but the process of finding some of these could be better.
One simple one is missing data. Missing data is rarely a null, I've seen (on one field, in one dataset):
N/A
NA
" "
Blank # literally the string "Blank"
NULL # Again, the string
No data
No! Data
No data was entered for this field
No data is known
The data is not known
There is no data
And many, many more. None can be clearly identified automatically, but some processes like:
Pull out the most common items, manually mark some as "equivalent to blank" and remove.
Identify common substrings with known text (N/A, NULL, etc) and bring up those examples.
Are useful, I'd like to extend with more clustering and analysis to bring out other common general issues but rare specific issues. Lots of similar things with encodings, etc. too.
Other things that might be good are clearer ways I could supply general conditions I expect to hold true, then bring the most egregious ones to my attention so I can either clear out / deal with them in some way. A good way of recording issues that have already been analysed and found to be OK would be great too.
One simple one is missing data. Missing data is rarely a null, I've seen (on one field, in one dataset):
And many, many more. None can be clearly identified automatically, but some processes like:Pull out the most common items, manually mark some as "equivalent to blank" and remove.
Identify common substrings with known text (N/A, NULL, etc) and bring up those examples.
Are useful, I'd like to extend with more clustering and analysis to bring out other common general issues but rare specific issues. Lots of similar things with encodings, etc. too.
Other things that might be good are clearer ways I could supply general conditions I expect to hold true, then bring the most egregious ones to my attention so I can either clear out / deal with them in some way. A good way of recording issues that have already been analysed and found to be OK would be great too.