Hacker News new | past | comments | ask | show | jobs | submit login
Data Analysis: The Hard Parts (mikiobraun.de)
7 points by SanderMak on Feb 17, 2014 | hide | past | favorite | 1 comment



Great article outlining some of the issues that make really sound data analysis hard work even though tool sets have progressed substantially.

One thing I do is to remind myself to walk through the basics of a new set of data before I start constructing models.

These include: 1) Measurement - manually examine data in the rawest form possible. You may find for example that, 14% of the data that was collected at stage 1 has not made it to stage 2. Why? To look for measurement error you need to actually dig into how your data is being captured. And, ask yourself to what extent do these measured variables capture the relevant business reality? 2) Data fusion - Manually examine individual cases after raw data has been fused into a dataset/frame for analysis. Can you verify that the merge actually worked properly? In particular sort your data to find cases where one or more of the variables are extreme. Extremes often reveal either errors or a glimmer of the predictive gold you are mining for. 3) Run frequency distributions on all your variables, looking for out of bounds data, oddly shaped distributions etc. 4) Run pairwise comparisons on individual variables. Create scatterplots, run basic correlations and think about the relationships between the measured variables. This will start to give you a basis for creating more sophisticated models.

Obviously there is more, but if I discipline myself to do these basics before moving on to building predictive models I save myself a lot of grief. The danger as the post points out, is that easy-to-use tools for the model building stages of analysis tempt one to forgo much of that foundational work and go straight to the fun of building models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: