So, essentially, this is a contest to make a way to predict who is most at risk ...

alextp · on Jan 29, 2011

You should read the new yorker story http://www.newyorker.com/reporting/2011/01/24/110124fa_fact_... . It answers your questions, mostly. For (1), if the health insurer is forced to treat those patients and acknowledges who they are they can spend a bit of money on preventive and follow-up care and save a lot of money on hospitalization, surgeries, etc. (2) This is true, but if the algorithm is retrainable (and it should be, as it's machine learning) there's the possibility that all you have to do is a bit of domain adaptation to keep things going; if this doesn't work, another contest 5 years from now will probably pay for itself.

The problem with your proposed solution is precisely that there seems to be far too little data points and far too many variables. Not only that but I expect most of the information to be in the interactions between variables and clever features that cover that. Most ways of learning bayesian networks don't work very well when you have to model interactions. I'd bet on the usual winning approaches for this sort of thing, which is clever boosting, matrix decomposition, and random forests, all of which can model interactions and somewhat deal with incomplete data.