I'm starting to build up various utilities to help with this kind of thing, but ...

I'm starting to build up various utilities to help with this kind of thing, but I fully agree. The decisions require understanding the business requirements (do I use source X or Y for field 1, what errors are OK, what types of error are worst, etc), but the process of finding some of these could be better.

One simple one is missing data. Missing data is rarely a null, I've seen (on one field, in one dataset):

    N/A
    NA
    " "
    Blank # literally the string "Blank"
    NULL # Again, the string
    No data
    No! Data
    No data was entered for this field
    No data is known
    The data is not known
    There is no data

And many, many more. None can be clearly identified automatically, but some processes like:

Pull out the most common items, manually mark some as "equivalent to blank" and remove.

Identify common substrings with known text (N/A, NULL, etc) and bring up those examples.

Are useful, I'd like to extend with more clustering and analysis to bring out other common general issues but rare specific issues. Lots of similar things with encodings, etc. too.

Other things that might be good are clearer ways I could supply general conditions I expect to hold true, then bring the most egregious ones to my attention so I can either clear out / deal with them in some way. A good way of recording issues that have already been analysed and found to be OK would be great too.