The short term solution was a configuration change to force a 1s SYN retransmit timeout, even if that's a 99.9th percentile handshake RTT in this environment. In the longer term we need to evaluate the timeouts in the most widely deployed systems, so that we can offer a calibrated set of completely safe / reasonable / aggressive values.
For example one of the links in the post suggests that OS X has a 700ms SYN retransmit timeout. If that's true (haven't tested yet), and the same is true for iOS, then a 700ms timeout might be a pretty reasonable setting rather than an aggressive one. Because if a piece of kit fails for iOS traffic, it's not going to remain in service for long ;-)
One of those cases where you can't do much more than follow the herd.
I'm glad to learn that your first solution matched exactly what I assumed it would be. Had the solution been present in the article, I certainly wouldn't comment that it's just "look how others on the internet are wrong."
Thank you for sharing these new details. Together with them I have much better feeling for getting involved in reading it and discussing it. In my view the approach to solution as you stated now matches with the concepts I've stood for in my comments here. I wish you luck in your endeavours.