Hacker News new | past | comments | ask | show | jobs | submit login

From a pragmatic viewpoint, the CSVs that I get from finance (usually saved as .xlsx) have the same issues for parsing the data as a CSV. But since the issues are consistent, I can automate conversion from .xlsx to CSV, then process the CSV using awk to eliminate errors in further parsing the CSV (for import, analysis, etc.). Sure, I'm essentially parsing the CSV twice but, because the parsing issues are consistent, I can automate to make the process efficient.

Obviously that wouldn't work for CSVs with different structures, but can be effective in the workplace in certain scenarios.




As long as a human didn't generate the file, all things can be automated.

However, if you ever have the misfortune of dealing with human generated files (particularly Excels) then you will suffer much pain and loss.

I once had to deal with a "CSV" which had not one, not two but 6(!) distinct date formats in the same file. Life as a data scientist kinda sucks sometimes :shrug:.


Before 2010 and UTF-8 everywhere , I regularly had the misfortune of dealing with multi encoding CSVs. Someone got CSVs from multiple sources and catted them together. One source uses ISO 8859-1, another -15, another UTF-8, sometimes a greek or russian or even ebcdic was in there. Fun trying to guess where one stopped and the other begun . Of course, none of them were consistent crlf or escape wise.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: