Lol. Predicting the survival of passengers on the Titanic is meaningless and misleading - there is literally no connection to reality, despite the framing of the task which suggests a certain connection. There is absolutely nothing that could be predicted. It is just a simulation of oversimplified model which describes nothing, but an oversimplified view of a historical event. It is as meaningless as the ant simulator written by Rich Hickey to demonstrate features of Clojure - it has that much connection to real ants.
If you work through the data, you'll find things like women, children and first class passengers had a higher survival rate than men with lower class tickets[1].
This matches exactly the stories of what happened: Staff took first class passengers to the lifeboats first, then women and children. Then they ran out of lifeboats.
So the data shows correlation, and eye-witness accounts shows causation. That's close to the ideal combination: eyewitness accounts can be unreliable because we can't know how widespread they are, and correlation doesn't show causation.
But the combination of them both is pretty much the best case for studying something which can't be replicated.
This is only one of many aspects of that event. The data reflects that the efforts of organized evacuation in the beginning were efficient.
But any attempt to frame it as a "prediction", an accurate model of the event or adequate description of reality is just nonsense.
To call things by its proper names (precise use of the language) is the foundation of the scientific method. This is mere oversimplified, non-descriptive toy model of one aspect of historical event, made from of statistics of partially observable environment. A few inferred correlations reflects that there was not a total chaos, but some systematic activity. No doubt about it. But this is absolutely unscientific to say anything else about the toy model, let alone claim that any predictions based on it have any connection to reality.
There is clear correlation between gender and survival rates. Given the data, a decent prior would absolutely take that into account.
Yes, there are other factors. But the foundation of statistical models is simplification, and descriptive statistics are an important foundation of that.
In any case, it isn't exactly clear that there are magical hidden factors which predicted survival. It appears you maybe unfamiliar with the event, because basically those who got into a lifeboat survived, and those who didn't, didn't survive.
To quote Wikipedia:
Almost all those who jumped or fell into the water drowned within minutes due to the effects of hypothermia.... The disaster caused widespread outrage over the lack of lifeboats, lax regulations, and the unequal treatment of the three passenger classes during the evacuation..... The thoroughness of the muster was heavily dependent on the class of the passengers; the first-class stewards were in charge of only a few cabins, while those responsible for the second- and third-class passengers had to manage large numbers of people. The first-class stewards provided hands-on assistance, helping their charges to get dressed and bringing them out onto the deck. With far more people to deal with, the second- and third-class stewards mostly confined their efforts to throwing open doors and telling passengers to put on lifebelts and come up top. In third class, passengers were largely left to their own devices after being informed of the need to come on deck.
Even more tellingly:
The two officers interpreted the "women and children" evacuation order differently; Murdoch took it to mean women and children first, while Lightoller took it to mean women and children only. Lightoller lowered lifeboats with empty seats if there were no women and children waiting to board, while Murdoch allowed a limited number of men to board if all the nearby women and children had embarked
All this behavior matches exactly what the model tells us about the event.
I'd be very interested if you can point to something specific that is wrong about it.
I think you are making the point for him. If you look at the predictive models people make on these, they make a big deal about your sex and status being the main indicators of who survived. The reality is that the main causal indicator for survival was access to a lifeboat.
Now, it so happens that that correlated heavily with class. But, not as much as with sex. Though, there were some places where being male hurt your chances (as you point out in the one officer not allowing men on boats), by and large these were secondary and correlated with success, not predictors of it.
The Titanic's sister ship (the Brittanic) was torpedoed during WW1 and sunk. However, the lesson of the Titanic (too few lifeboats) had been learnt, and only 26 people died.
I don't know what point you are trying to make - yes, I agree that history never repeats, but lessons can be learnt from it, and they can be quantified and they can be useful.
>> The Titanic's sister ship (the Brittanic) was torpedoed during WW1 and sunk. However, the lesson of the Titanic (too few lifeboats) had been learnt, and only 26 people died.
This happened because they made a _statistical_ model of the Titanic disaster, and learned from it? Like, they actually crunched the numbers and plotted a few curves etc, and then said "aha, we need more boats"?
I kind of doubt it, and if it wasn't the case then you can't very well talk about a "model", in this context. It's more like they had a theory of what factor most heavily affected survival and acted on it. But I'd be really surprised to find statistics played any role in this.
This happened because they made a _statistical_ model of the Titanic disaster, and learned from it?
No - statistics as the discipline that we think of today wasn't really around until the work of Gosset[1] and Fisher[2] which was done a few years after this.
I'm sure you noted that I was very careful with what I claimed: "the lesson of the Titanic (too few lifeboats) had been learnt".
These days we'd quantify the lesson with statistics. Then, they didn't have that tool.
Instead, we have testimony[3] relaying the same story: Just one question. Have you any notion as to which class the majority of passengers in your boat belonged? - (A.) I think they belonged mostly to the third or second. I could not recognise them when I saw them in the first class, and I should have known them if there were any prominent people. (Q.) Most of them were in the boat when you came along? - (A.) No. (Q.) You put them in? - (A.) No. Mr. Ismay tried to walk round and get a lot of women to come to our boat. He took them across to the starboard side then - our boat was standing - I stood by my boat a good ten minutes or a quarter of an hour. (Q.) At that time did the women display a disinclination to enter the boat? - (A.) Yes."
So yes, I agree - it was a theory, which our modern modelling tools can show matched well with what the statistics showed happened.
My whole point is that this is very useful, unlike the OP who dismissed it as useless.
OK, tell me, please, what it is that you can predict? That some John Doe, having the first class ticket in a cabin next to the exit would survive the collision of the next Titanic with a new iceberg? That being a woman gives you better chances to secure a seat in a lifeboat? What is the meaning of the word "predict" here?
There is a very important notion from The Sciences of the Artificial book by Herbert A. Simon, that the visible (to an external observer) behavior of an ant (its tracks, if you wish) is not due to its supposed intelligence, but mostly due to the obstacles in the environment.
Most of the models mimic and simulate (very naively) that observable behavior, not its origin.
When people cite "the map is not the territory" they mean this. Simulation is not even an experiment. It is mere an animation of a model - a cartoon.
It is swarm intelligence: How does the system keep finding successful paths in a changing environment? Can we take inspiration from this behavior to create better optimization algorithms?
Why not. I remember a paper which compares the behavior of foraging ants (they send more or less ants according to the rate of returned with food) to adjustment of the window size based on the data rate in TCP.
Simulations are not experiments. It is an animation of a formalized imagination, if you wish.
Because otherwise it should be called machine hallucinations?
The process of learning could be defined as a task of extraction of relevant information (knowledge) about reality (shared environment) not mere accumulation of a fancy nonsense or false beliefs.
So knowledge like: Did the passenger have kids on board? Was the passenger nobility? Was the passenger travelling first class? Where was the passenger located on the ship after boarding? And how do these factors influence survivability?
And reality like: The actual sinking of the Titanic?
If your model concludes that nobility, traveling first class, close to the exits, without family, has a higher chance of surviving, then this is fancy nonsense or a false belief?
Correlations does not imply causation. There were many more relevant but "invisible" variables, which, probably, related to some genetic factors, like ability to sustain exposure to the cold water, ability to calm oneself down to avoid panic and self-control in general, strong survival instinct to literally fight the others, etc. The variables you have described, except the age of a passenger, are visible but irrelevant. And pure luck must have a way bigger weight and it is, obviously, related to the genetic favorable factors, age, health and fitness.
This challenge is not about causal inference. I do agree it is more of a toy dataset, to get started with the basics, and that there are a lot of other variables that go into survivability. But to say these variables, except for age, are irrelevant is mathematically unsound: You can show with cross-validation and test set performance that your model using these variables generalizes (around 0.80 ROC AUC). You can do statistical/information theoretical tests that show the majority of these variables is a significant signal for predicting the target.
In real life it is also very rare to have free pickings of the variables you want. Some variables have to substituted with available ones.
The Titanic story is to make things interesting for beginners. One could leave out all the semantics of this challenge, anonymize the variables and the target, and still use this dataset to learn about going from a table with variables to a target. In fact, doing so teaches you to leave your human bias at the door. Domain experts get beaten on Kaggle, because they think they need other variables, or that some variables (and their interactions) can't possibly work.
Let the data and evaluation metric do the talking.
>> Domain experts get beaten on Kaggle, because they think they need other variables, or that some variables (and their interactions) can't possibly work.
That sounds a bit iffy. A domain expert should really know what they're talking about, or they're not a domain expert. If the real deal gets beaten on Kaggle it must mean that Kaggle is wrong, not the domain expert.
Not that domain experts are infallible, but if it's a systematic occurrence then the problem is with the data used on Kaggle, not with the knowledge of the experts.
I mean, the whole point of scientific training and research is to have domain experts who know their shit, know what I mean?
> Since our goal was to demonstrate the power of our models, we did no feature engineering and only minimal preprocessing. The only preprocessing we did was occasionally, for some models, to log-transform each individual input feature/covariate. Whenever possible, we prefer to learn features rather than engineer them. This preference probably gives us a disadvantage relative to other Kaggle competitors who have more practice doing effective feature engineering. In this case, however, it worked out well.
> Q: Do you have any prior experience or domain knowledge that helped you succeed in this competition? A: In fact, no. It was a very good opportunity to learn about image processing.
> Do you have any prior experience or domain knowledge that helped you succeed in this competition? I didn't have any knowledge about this domain. The topic is quite new and I couldn't find any papers related to this problem, most probably because there are not public datasets.
> Do you have any prior experience or domain knowledge that helped you succeed in this competition? M: I have worked in companies that sold items that looked like tubes, but nothing really relevant for the competition. J: Well, I have a basic understanding of what a tube is. L: Not a clue. G: No.
> We had no domain knowledge, so we could only go on the information provided by the organizers (well honestly that and Wikipedia). It turned out to be enough though. Robert says it cannot happen again, so we’re currently in the process of hiring a marine biologist ;).
> Through Kaggle and my current job as a research scientist I’ve learnt lots of interesting things about various application domains, but simultaneously I’ve regularly been surprised by how domain expertise often takes a backseat. If enough data is available, it seems that you actually need to know very little about a problem domain to build effective models, nowadays. Of course it still helps to exploit any prior knowledge about the data that you may have (I’ve done some work on taking advantage of rotational symmetry in convnets myself), but it’s not as crucial to getting decent results as it once was.
> Oh yes. Every time a new competition comes out, the experts say: "We've built a whole industry around this. We know the answers." And after a couple of weeks, they get blown out of the water.
Competitions have been won without even looking at the data. Data scientists/machine learners are in the business of automating things -- so why should domain knowledge be any different?
Ok, sure it can help, but it is not necessary, and can even hamper your progress: You are searching for where you think the answer is -- thousands are searching everywhere and finding more than you, the expert, can.
How does this not violate [1]? That is, this seems specifically anti-statistical. The best you can come up with on this is a predictive model that you then have to test on new events. In this case, that would likely mean new crashes.
Because we are not doing hypothesis testing, we are doing classification on a toy dataset. Sure, one could treat this as a forecasting challenge, but then one would need another Titanic sinking in roughly the same context, with the same features... That demand is as unreasonable as calling this modeling knowledge competition meaningless.
And if you see classification as a form of hypothesis testing, then cross-validation is a valid way of testing if hypothesis holds on unseen data.
I think that is a rub. With the goal just being to find some variables that correlate together, it is a neat project. But, ultimately not indicative of a predictive classification. If only due to the fact that you do not have any independent samples to cross validate with. All samples being from the same crash.
This would be like sampling all coins from my pockets and thinking you could build a predictive model of year printed to value of coin. Probably could for the change I carry. Not a wise predictor, though.
You are right, but only in a very strict, not-fun, manner :). Even if we had more data on different boats sinking, the model would not be very useful: We don't go with the Titanic anymore and plotted all icebergs. Still, if a cruise ship were to go down, I'd place a bet on ranking younger women of nobility traveling first class higher for survivability than old men with family traveling third class, wise predictor or no.
> You can show with cross-validation and test set performance that your model using these variables generalizes (around 0.80 ROC AUC).
It shows only that given set of variables (observable and inferred) could be used to build a model. The given data set is not descriptive, because it does not contain more relevant hidden variables, so any predictions or inferences based on this data set are nothing but a story, a myth made from statistics and data.