I'm suspicious about the IP 169.150.221.147
My guess: there is some misconfigured bogons IP filter and instead of 169.254.0.0/16 (rfc3927) there is something like 169.0.0.0/8 configured to be blocked on some firewall
I once was a customer of an ISP that mistakenly blocked the whole 192.0.0.0/8 net, which caused some confusion, but they fixed it after I pointed it out.
But then why would the ICMP echo/reply (ping) be allowed through? And how is the initial syn/ack and reply getting through? It's only the second ack that's getting (apparently) blocked.
ICMP (the protocol ping uses) is a totally separate protocol from TCP and UDP. Blocking ICMP can break of lot of things and offers no real benefits outside of a handful of specific edge cases.
BTW your assumption "a successful ICMP ping = TCP and UDP work" is an extremely common one that I too had before I was taught otherwise.
I did not assume. The comment to which I was responding suggested it was the destination IP that was the problem. Generally (but not always) an IP filter would be applied irrespective of protocol. I also pointed out that the initial SYN and reply SYN/ACK are getting through the hypothesized bogon filter and those are part of TCP. I don't think the bogon filter is a hypothesis that fits the evidence.
AWS doesn't decide or even care about this, customers configure security group rules for their own services. Nothing is allowed by default, so if you want ICMP you would need to allow it, most font bother because it's not that helpful in a cloud environment (can just monitor the TCP port instead and get similar information).
It's been a very long time since I've diagnosed something like this, but I've had problems in the past when the MTU is smaller than the default and ICMP is blocked (interfering with path MTU discovery). Often IPSec or some other tunneling was involved. The initial packets got through but as soon as a full packet was sent it was dropped.
EDIT - I've now scrolled down in HN and saw that this was ruled out.
Actually I think you might still be right. Ping uses ICMP, which is almost never blocked in my experience. I learned that because early in my career I too assumed a successful ping = TCP and UDP also work.
Similar happened to me not so long ago. One day a junior admin asked me to diagnose who ssh to a box started hanging. After a bit of diagnosis it became evident TCP's 3 way handshake got through, and then it all stopped. No normal network behaves like this.
My answer was "some network admin is having fun with a middleware box", you will have to speak with them. They did, and the response that came back is "we are moving to a network with real security". Access was restored. It was a Palo Alto NGFW as others have mentioned.
IMHO the industry looking for way to move high priced gear, and they convince someone with a sparrow problem with barely knows how to handle an air rifle to buy bazooka's. A bit of collateral damage is to be expected....
And like what does DNS have to do with packets being dropped? The name is already resolved to an IP address at this point and we're seeing a SYN and SYN+ACK, which tells me that it's not a routing issue. The fact that it happens at the start of a TLS connection(Client Hello) makes me think that it's some kind of web application firewall or reverse proxy or some other intervening firewall that's causing this.
My guess is it either got some boilerplate response from L2 instead of actually going to a network engineer or it did go to a network engineer but they're connecting from a different network with different traffic management and don't see the issue.
At my old uni, L1 were paid students, L2 were paid staff, and L3 were the actual netops/sysadmins so sometimes L2 would try to close something out that needed escalated.
In addition, they had resnet (residential network) and pronet (professional network) where the former was for student housing and the latter everything else. Resnet had more restrictions and traffic shaping such that pronet traffic was prioritized. In addition, resnet wireless had a different NAT setup whereas resnet wired used public IPs with inbound traffic blocked. This lead to all kinds of caveats like online gaming using uPnP only working on wireless despite wired having public IPs.
Regardless of which network they connect from, I would expect that a network engineer knows that if a TCP handshake with the web server (i.e. after DNS lookup) fails at the 3rd step, then it's not DNS. The fact that the TCP handshake begun is evidence that DNS works.
Think I was able to reproduce it. I configured my router to drop established connections for IP 169.150.221.147 in my policy attached to my wan interface for outgoing traffic (important detail, inbound would drop the syn/ack instead). For reference its an Ubiquiti Edgerouter that uses iptables to filter traffic.
In the linked picture [0] I have packet #436 selected, its a retransmission of the handshake syn/ack with seq=0 ack=1, repeating a few times later, same as OP.
So as others suggested, likely a misconfigured BOGON rule with 169.0.0.0/8, but also matching outbound established connections rather than new/any state for some reason.
Good find, that fits the symptoms perfectly and is more likely than not a problem with the firewall on the source end (the campus network). Did you email the author?
As a network engineer it piqued my interest (unemployment is booring) as there were no completely satisfying answers, though some were close. Thought it was the old MTU problem at first but as it was the ack of the handshake being retransmitted it wasn't likely. So tried a few things with my router.
This is how you get NOCs to help you quickly, give them not only the problem but the root cause as well. Its not that they (or me) are lazy, its just that it can be so many things that can be a potential cause of problems, especially when you only have incomplete information to go on.
I'm rather surprised that Berkeley Student Tech Services would keep people around who either don't know how DNS works or know, but who make up excuses to dismiss a problem.
The problem really should be escalated and the nonsense answer pointed out, because if they care (and they should), they'll want to educate the person who gave that response.
You’d think that. But having spent time higher for operations/support in higher ed, it’s really hard to attract people that have a quality foundation of knowledge.
Feels like some stateful device within someone's network mishanding the connection state, like the author guesses.
It's interesting that your side thinks the three-way handshake worked, but the remote side continues to resend the [SYN, ACK] packets, as if they've never received the final [ACK] from you.
Had a hellish time troubleshooting a similar problem several years ago with F5 load balancers - there was a bug in the hashing implementation used to assign TCP flows to different CPUs. If you hit this bug (parts per thousand), your connection would be assigned to a CPU with no record of that flow existing, so the connection would be alive, but would no longer pass packets. Would take a long time for the local TCP stack to go through its exponential retries and finally decide to drop the connection and start over .
99% MTU size. Had this recently specifically with TLS due to large initial packets containing certificates. Results could even depend on user agent, some fail some will work.
try to reduce MTU on client, 1280 is a good starting point.
Indeed, it's right there in the packet capture screenshot. The ack has payload length 0.
I've debugged a lot of TCP/IP issues over the years but this one has me scratching my head. The author has done reasonable troubleshooting: tried from different devices and operating systems, HTTP and HTTPS, over wired and WiFi, and to different destinations. The common denominator is the wired network.
It can't hurt to reduce the MTU, but I see nothing in the evidence presented that this is likely to be the cause.
I once had a destination firewall blocking packets from Linux but not OS X and it turned out to be that Linux was an early adopter of ECN and the destination firewall rejected any packets with the ECN bits set. I've also had frame relay networks with MTU limitations, NICs with corrupted checksums, overflowing NAT tables, asymmetric ARP tables, misconfigured netmasks, and stuff I'm sure I've forgotten.
But we don't know the full story of http as no capture was provided. Typically when you have an mtu issue you would get stuck on the tls handshake, as we are in this case for Https, so in the http capture we should see a 301 redirect if it's an mtu issue.
Agreed, my best guess it's due to a smaller MTU between the CDN and your device. They are probably replying with TLS Server Hello which would typically max a standard 1500 byte packet. It's also likely why HTTP isn't working either since they would ACK the connection, you would probably be able to issue the GET / but you would never get a response back due to the HTTP response payload being larger than a single packet.
A few ideas to test this theory:
1) Find an asset on their server that is smaller than 500-1000 bytes so the entire payload will fit in a packet. Maybe a HEAD would work?
2) Clamp your MSS on this IP to something much smaller like 500 instead of the standard 1460. This should force the server to send smaller packets and will work better in practice than changing your MTU. See: https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.cookbook.mtu-...
I believe this is relatively easy to test as I think you can gradually increase the size of the ICMP packet until it stops responding.
I have done something along those lines in the past but it was a long time ago.
Edit: on reading a few more comments, I think this is probably all wrong...
The TLS Client hello is not that big (the client sent FIN is seq=518), and the server is only sending packets with SEQ=0. As others pointed out this likely means that the server that received the SYNs is not receiving the final ACK and data packets.
From what I can tell, the example IP is not broadly anycast. From my test hosts in Seattle, traceroute takes me trhough transit to San Jose, and then either
vl201.sjc-eq10-dist-1.cdn77.com or vl202.sjc-eq10-dist-1.cdn77.com and finally
169-150-221-147.bunnyinfra.net
I'm not sure how easy it is to run a traceroute with tcp with different flags. But if the OP can run a traceroute with only the SYN flag, and again with only the ACK flag, that might be pretty interesting. I suspect this is an issue inside BunnyCDN's network where packets from this user/network with SYN go to one server host, and with ACK go to another. Maybe there's an odd router somewhere that's routing these differently, but if they both make it to Bunny, they should both work.
With
$ traceroute --version
Modern traceroute for Linux, version 2.1.2
Copyright (c) 2016 Dmitry Butskoy, License: GPL v2 or any later
I can specify to do a traceroute with syn or ack with
traceroute 169.150.221.147 -p 443 -q 1 -T -O ack
or
traceroute 169.150.221.147 -p 443 -q 1 -T -O syn
Wrong answer about MTU below for posterity:
Yeah, that would be my bet too. Especially with a after 60 seconds things start to work, I think that's the timeout for windows to do PMTU Blackhole probing (which is painfully slow; iOS and I think MacOS do it much sooner; I think even Android has gotten around to doing it in a reasonable amount of time)
But, if it's really only happening with BunnyCDN, it's possible that most of their routes are 1500 MTU clean (or have working path MTU) and only the routes to get to BunnyCDN aren't. Of course, a lot of popular services intentionally drop their advertised MTU and allowed outbound MTU to work around the many broken networks out there, so service X and Y works doesn't really mean the path is clean.
ClientHello isn't that big but ServerHello that's in the reply can be quite large and since TCP packets have DF flag set, some middleware box may toss it if PMTUD didn't work correctly.
I had seen this exact issue with Fastly a few years ago.
Yeah, I expected a large ServerHello, but then I would expect the server to send Seq=[LargeNumber] packets. Often you'd get an ACK for the ClientHello, then a missed packet or several, then the final packet of the ServerHello which is often small. Or at least an ack from the resend of ClientHello with a large sequence number.
I guess I've seen pmtud issues way too often in my life, and I just jumped ahead. :D
A raw packet capture would be useful to look deeper. Actually 2. One of the IP in question and one of any other site. Both from the problem source network. I would wager one of these things is not like the other but I need the .cap files as there is not enough information in the screenshot. The output of ss -emoian as text and not a screenshot may also be useful to grab just after the connections are attempted to both destinations.
My guess would be something related to your campus having more than one external connection available.
Maybe from the server's point of view the SYN and ACK are coming from distinct addresses and this is tripping them up ?
I have 2 internet connection in my home and would encounter some strange bugs whenever I used both connections at the same time. I never debbuged theses cases but they always disappeared when I just used 1 connection and left the second as a backup.
I wouldn't be surprised if someone (your Uni) is mistakenly blocking some 169.x.x.x data since 169.254.0.0/16 is used for local IPs. Someone put the wrong subnet mask in a firewall rule or ACL someplace.
First off, the HTTP HTTP 301s to the HTTPS site, so HTTPS is still the likely trigger.
Second, I see that whatever client he's using is specifying a very old TLS 1.0. If its not MTU (which others have mentioned), then my guess would be a firewall with a policy specifying a minimum TLS version, and dropping this connection on the floor.
Certainly weird that wireshark shows TLSv1 while curl shows TLSv1.3. That shouldn't happen unless something interfered with the Client Hello. (or the wireshark version is outdated)
If a TLS handshake is aborted partway through, Wireshark will label it “TLSv1”. It actually retroactively labels the 1.0 TLS packets as 1.3 after a successful TLS 1.3 handshake finishes.
This makes sense because a TLSv1.3 handshake actually starts as 1.0 and then upgrades to 1.3 only with IIRC the Server Hello response to the ClientHello.
The following links document this behavior, in case you or your organization’s security team is nervous TLSv1 is actually being used:
Oh, indeed, that's quite surprising. A TLSv1.3 Client Hello always contains the supported_versions extension, which should allow wireshark to label it correctly, regardless of whether or not the handshake actually finishes. Though, tbf, it does say TLSv1 and not TLSv1.0. I wonder how it would look had TLSv1.3 been named TLSv2.0 after all...
My guess is that your original SYN did not go to the target, but was redirected somewhere close by. I'd look at the TTL value in the IP header of your first SYN-ACK, and play with such things as traceroute.
Such redirection is often done on a specific port basis, so that trying to access different ports might produce a different result, such as a RST packet coming back from port 1234 with a different TTL than port 443.
There is so much cheating going with Internet routing that the TTL is usually the first thing I check, to make sure things are what they claim.
Sounds to me as if they have a Palo Alto NGFW at the edge, filtering the traffic. UC Berkeley appears to be running a Palo Alto for at least part of their infrastructure.
Wow. What an embarassing answer by the "Berkeley Student Tech Services"...
That is on the same level as e.g. the customer hotline at a phone company ("did you try turning it off and on again?"), I would have thought that Berkeley of all university has higher standards than that
It's not like they know anything about the Internet there or ever created anything for it that's still in use... ;)
It's indeed sad how more and more unis outsource all their IT. Like they've become too stupid to manage the tech they created. A friend of mine just told me how his old college is currently moving their email to Google, and are also looking to move all the web hosting somewhere else. What's next, have the whole network managed by Comcast? Pay per connected device?
The symptoms match my experience with a mid-network firewall/router that is not aware of TCP window scaling stripping out the scaling factor while leaving the window scaling feature enabled. See https://lwn.net/Articles/92727/
Maybe it’s not a network issue at all - might be related to a purposeful action taken by a network device (ips or web filter etc) that is killing the connection based on some rule set.
It's possible but the way the connection is blocked is surprising. If you're blocking based on an IP you'd just drop the first syn and the client would never receive the syn-ack. If you're blocking based on the SNI you would be waiting for the first TLS client-hello, but in that case packet are droped before the client-hello is sent.
I mean, yes, that was my instinctive response based on just the title. It's always the MTU. But in this case the packet that's being lost is a pure ACK.
No. To repeat: the client is sending a pure ACK with no data. You can see that packet with a 0 byte payload length in the screenshot. That packet is also getting lost; if it weren't getting lost, we'd be receiving a SYNACK.
Lots of good things to investigate already in the thread. I would throw in the potential for an anycast routing issue. TCP is stateful and if there is asymmetric routing, maybe the packets are coming from one anycast device, but the returning packets are routing to a different one.
Would suspect some of the other responses first though, but if they don't help this could be a possibility if they are using anycast.
I don't think the IP shared is anycast. All of my personal test nodes are Seattle based, and they all see the same basic path to the IP that was shared; transit to San Jose, then two hops in BunnyCDN's network. Additionally, I get a different IP when I lookup the test hostname, that traces to Seattle.
It does feel like maybe a different server/network path getting the SYN+ACK vs the ACK, but probably in BunnyCDN's equipment --- but maybe something weird in Berkeley's (wired) network causes weird behavior for BunnyCDN? Hard to really know without pcaps from both ends, which are hard to get. Something funky in the load balancer seems like a good guess to me.
Some 10 years back I was working for a solar company doing SCADA stuff (monitoring remote power plant equipment, reporting generation metrics, handling grid interconnect stuff, etc).
We had a big room with lots of monitors that looked like a set in a Hollywood film, no doubt inspired by them. You could see all the solar installations all around the world that we monitored. The monitoring crew put out a call for engineers, stat, and as I walked into the monitoring room I could see perhaps 1/10th of the power plant icons on the wall we red "lost communication", one plant went green to red right in front of me.
This started a shitstorm with all hands be summoned. Long story short, somebody decided the best way to get an external IP for one of our remote gateways was to use curl command to a whatismyip.com type service, but instead of targeting Google (or you know, a server under our control), it hit some random ISP in Italy. The ISP most have eventually realized they were getting ping on by thousands of devices 24/7, so they decided they would drop some percentage of incoming requests silently, and of course the curl call was blocking without timeout. When the remote gateway's was dropped, it blocked indefinitely.
I skipped a lot in between but it was definitely a fun firefighting session, it was particularly hampered by a couple engineers that were quite high up on the food chain getting lead in the wrong direction (as to the root cause) at the beginning and fighting particularly hard against any opposing theories. It was the one time I basically got to drop the "I'm right and I bet my job on it." Fun times.
are we ruling out content filtering? any content filter that is going to filter HTTPS without SSL decryption is going to look at the esni, which is in the client hello.
I once was a customer of an ISP that mistakenly blocked the whole 192.0.0.0/8 net, which caused some confusion, but they fixed it after I pointed it out.