I see what you mean, and it's indeed quite likely that texts containing such hypothetical scenarios were included in the dataset.
Nonetheless, the implication is that the model was able to extract the conditional represented, recognize when that condition was in fact met (or at least asserted: "The queen died."), and then apply the entailed truth.
To me that demonstrates reasoning capabilities, even if for example it memorized/encoded entire Quora threads in its weights (which seems unlikely).
If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.