Hacker News new | past | comments | ask | show | jobs | submit login

Maybe one problem is that some projects naively try to start at "fit a model to whatever data we've got" when perhaps they would be better off starting with a statistical experimental design perspective, and thinking about how exactly the data would need to be collected https://en.wikipedia.org/wiki/Design_of_experiments

Here's one detailed anecdote of how projects can fail:

> Imagine you want to design some algorithm to detect cancer. You get data of healthy and sick people; you train your algorithm; it works fine giving you high accuracy and you conclude that you’re ready for a successful career in medical diagnostics.

> Not so fast …

> Many things could go wrong. In particular, the distributions that you work with for training and those in the wild might differ considerably. This happened to an unfortunate startup I had the opportunity to consult for many years ago. They were developing a blood test for a disease that affects mainly older men and they’d managed to obtain a fair amount of blood samples from patients. It is considerably more difficult, though, to obtain blood samples from healthy men (mainly for ethical reasons). To compensate for that, they asked a large number of students on campus to donate blood and they performed their test. Then they asked me whether I could help them build a classifier to detect the disease. I told them that it would be very easy to distinguish between both datasets with probably near perfect accuracy. After all, the test subjects differed in age, hormone level, physical activity, diet, alcohol consumption, and many more factors unrelated to the disease. This was unlikely to be the case with real patients: Their sampling procedure had caused an extreme case of covariate shift that couldn’t be corrected by conventional means. In other words, training and test data were so different that nothing useful could be done and they had wasted significant amounts of money.

-- https://blog.smola.org/post/4110255196/real-simple-covariate...




True, training and test data are usually variate but still we can't say that it wasn't possible. Perhaps in this particular case training data and variavles used were not sufficient




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: