So knowledge like: Did the passenger have kids on board? Was the passenger nobil...

dschiptsov · on Oct 15, 2016

Correlations does not imply causation. There were many more relevant but "invisible" variables, which, probably, related to some genetic factors, like ability to sustain exposure to the cold water, ability to calm oneself down to avoid panic and self-control in general, strong survival instinct to literally fight the others, etc. The variables you have described, except the age of a passenger, are visible but irrelevant. And pure luck must have a way bigger weight and it is, obviously, related to the genetic favorable factors, age, health and fitness.

BickNowstrom · on Oct 15, 2016

This challenge is not about causal inference. I do agree it is more of a toy dataset, to get started with the basics, and that there are a lot of other variables that go into survivability. But to say these variables, except for age, are irrelevant is mathematically unsound: You can show with cross-validation and test set performance that your model using these variables generalizes (around 0.80 ROC AUC). You can do statistical/information theoretical tests that show the majority of these variables is a significant signal for predicting the target.

In real life it is also very rare to have free pickings of the variables you want. Some variables have to substituted with available ones.

The Titanic story is to make things interesting for beginners. One could leave out all the semantics of this challenge, anonymize the variables and the target, and still use this dataset to learn about going from a table with variables to a target. In fact, doing so teaches you to leave your human bias at the door. Domain experts get beaten on Kaggle, because they think they need other variables, or that some variables (and their interactions) can't possibly work.

Let the data and evaluation metric do the talking.

YeGoblynQueenne · on Oct 17, 2016

>> Domain experts get beaten on Kaggle, because they think they need other variables, or that some variables (and their interactions) can't possibly work.

That sounds a bit iffy. A domain expert should really know what they're talking about, or they're not a domain expert. If the real deal gets beaten on Kaggle it must mean that Kaggle is wrong, not the domain expert.

Not that domain experts are infallible, but if it's a systematic occurrence then the problem is with the data used on Kaggle, not with the knowledge of the experts.

I mean, the whole point of scientific training and research is to have domain experts who know their shit, know what I mean?

BickNowstrom · on Oct 18, 2016

The people who win Kaggle competitions are consistently machine learning experts, not domain experts.

Notably: https://www.kaggle.com/c/MerckActivity

> Since our goal was to demonstrate the power of our models, we did no feature engineering and only minimal preprocessing. The only preprocessing we did was occasionally, for some models, to log-transform each individual input feature/covariate. Whenever possible, we prefer to learn features rather than engineer them. This preference probably gives us a disadvantage relative to other Kaggle competitors who have more practice doing effective feature engineering. In this case, however, it worked out well.

http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it...

> Q: Do you have any prior experience or domain knowledge that helped you succeed in this competition? A: In fact, no. It was a very good opportunity to learn about image processing.

http://blog.kaggle.com/2016/09/15/draper-satellite-image-chr...

> Do you have any prior experience or domain knowledge that helped you succeed in this competition? I didn't have any knowledge about this domain. The topic is quite new and I couldn't find any papers related to this problem, most probably because there are not public datasets.

http://blog.kaggle.com/2015/09/16/icdm-winners-interview-3rd...

> Do you have any prior experience or domain knowledge that helped you succeed in this competition? M: I have worked in companies that sold items that looked like tubes, but nothing really relevant for the competition. J: Well, I have a basic understanding of what a tube is. L: Not a clue. G: No.

http://blog.kaggle.com/2015/09/22/caterpillar-winners-interv...

> We had no domain knowledge, so we could only go on the information provided by the organizers (well honestly that and Wikipedia). It turned out to be enough though. Robert says it cannot happen again, so we’re currently in the process of hiring a marine biologist ;).

http://blog.kaggle.com/2016/01/29/noaa-right-whale-recogniti...

> Through Kaggle and my current job as a research scientist I’ve learnt lots of interesting things about various application domains, but simultaneously I’ve regularly been surprised by how domain expertise often takes a backseat. If enough data is available, it seems that you actually need to know very little about a problem domain to build effective models, nowadays. Of course it still helps to exploit any prior knowledge about the data that you may have (I’ve done some work on taking advantage of rotational symmetry in convnets myself), but it’s not as crucial to getting decent results as it once was.

http://blog.kaggle.com/2016/08/29/from-kaggle-to-google-deep...

> Oh yes. Every time a new competition comes out, the experts say: "We've built a whole industry around this. We know the answers." And after a couple of weeks, they get blown out of the water.

http://www.slate.com/articles/health_and_science/new_scienti...

Competitions have been won without even looking at the data. Data scientists/machine learners are in the business of automating things -- so why should domain knowledge be any different?

Ok, sure it can help, but it is not necessary, and can even hamper your progress: You are searching for where you think the answer is -- thousands are searching everywhere and finding more than you, the expert, can.

taeric · on Oct 16, 2016

How does this not violate [1]? That is, this seems specifically anti-statistical. The best you can come up with on this is a predictive model that you then have to test on new events. In this case, that would likely mean new crashes.

[1] https://en.wikipedia.org/wiki/Testing_hypotheses_suggested_b...

BickNowstrom · on Oct 18, 2016

Because we are not doing hypothesis testing, we are doing classification on a toy dataset. Sure, one could treat this as a forecasting challenge, but then one would need another Titanic sinking in roughly the same context, with the same features... That demand is as unreasonable as calling this modeling knowledge competition meaningless.

And if you see classification as a form of hypothesis testing, then cross-validation is a valid way of testing if hypothesis holds on unseen data.

taeric · on Oct 18, 2016

I think that is a rub. With the goal just being to find some variables that correlate together, it is a neat project. But, ultimately not indicative of a predictive classification. If only due to the fact that you do not have any independent samples to cross validate with. All samples being from the same crash.

This would be like sampling all coins from my pockets and thinking you could build a predictive model of year printed to value of coin. Probably could for the change I carry. Not a wise predictor, though.

BickNowstrom · on Oct 18, 2016

You are right, but only in a very strict, not-fun, manner :). Even if we had more data on different boats sinking, the model would not be very useful: We don't go with the Titanic anymore and plotted all icebergs. Still, if a cruise ship were to go down, I'd place a bet on ranking younger women of nobility traveling first class higher for survivability than old men with family traveling third class, wise predictor or no.

This dataset is more in line with what you are looking for: https://www.kaggle.com/saurograndi/airplane-crashes-since-19...

taeric · on Oct 18, 2016

Makes sense. And yes, I completely meant my points in a pedantic only manner. :)

dschiptsov · on Oct 16, 2016

> You can show with cross-validation and test set performance that your model using these variables generalizes (around 0.80 ROC AUC).

It shows only that given set of variables (observable and inferred) could be used to build a model. The given data set is not descriptive, because it does not contain more relevant hidden variables, so any predictions or inferences based on this data set are nothing but a story, a myth made from statistics and data.