Hacker News new | past | comments | ask | show | jobs | submit login
Investigating the impact of HTTP3 on network latency for search (dropbox.tech)
154 points by kiyanwang on May 22, 2023 | hide | past | favorite | 64 comments



Related tangent: highest possible recommendation for https://hpbn.co/ (High-Performance Browser Networking), an extraordinarily helpful book and website.


I second this recommendation, but note that this book predates and does not cover HTTP3/QUIC.


Good disclaimer. HPBN is still recommended reading (almost prereq?) before tackling H3/QUIC, though.


Didn't knew about this book, it looks extremely interesting, thank you :).


This is a nice post, but it feels a little weird because while HTTP/3 gets them 10ms on average, they could get 30ms on average by reducing Asia latencies (15% of traffic, 300ms) to North America latencies (100ms). I'm sure they have some internal constraint that makes that difficult, but just stuck out at me.


That might require them to put replicas of the search service in Asia, and operating a distributed system over distant geographies is a hard operational problem.


Not necessarily: they could terminate the initial connection close to the user and then run a new connection. While this sounds like it shouldn't help, because the second connection is fully under their control they can make it highly reliable and very fast. Additionally, any recovery from dropped packets (on the user's home network) only has the latency of an in-region connection.

This also sets them up for ruining a light cache in-region, and potentially scaling that up to handle more kinds of API calls.


Is this another way to say put a regional proxy or do you mean something more?


A regional proxy should be the same thing, yes. The important thing is that it's not a replica.


Yes I'm sure they've considered doing that as it seems so obvious. I would also think that the use of any one account was very region specific so could you actually base their whole data out near them. Be interesting to hear how they've ended up at the compromise that they have.


Or they could place the to-be-searched data closer to their users and get rid of traffic to North America entirely for most users.


It would be nice to see that 99th percentile, the 99.9th percentile, and the 99.99th percentile.

My browser has made over 100,000 network requests so far today, and I expect yours has too. I'm sure at least some of those will be in the 99.99th percentile.


Very interesting reading, thanks. I just wonder why this geographic differences from one region to another.


Speed of light. It takes time for a packet to reach North America from Asia.


RTT to the US servers will be most of it, IMHO.


Note that some of the users on the worst networks are also users that won't be able to use HTTP3 due to corporate firewall restrictions that block UDP and downgrade TLS.

Source: I've worked on network protocols for video conferencing and have seen my share of corporate network horrors.


Hopefully, as HTTP3 picks up steam, this will stop happening.

But yeah, there's no amount of innovation that can't be undone by a motivated adversary. ;-)


One thing I'm wondering about is that http3 seems to be from end user to search, which is a general serving problem and not really about search. Do dropbox clients call the search service directly?


I have wondered recently about head of line blocking's impact to latency at AZ level within AWS. How reliable are TCP connections within the same datacenter and do these issues noticeably impact performance at the tail end?


Maybe there are some answers to that here

https://aws.amazon.com/blogs/hpc/in-the-search-for-performan...

Where AWS explain why TVP isn’t good enough for them and why they are developing a replacement.


It would be interesting to see similar experiments published from big edge networks like Cloudflare and Fastly. Do they have something similar published?


Just wandering, what http3 solutions are on Linux side. Can IPVS be http3 compatible at all?


Why not? As long as you can send and receive UDP traffic, there's no reason why you can't do http/3.

Caddy and HAProxy both support the protocol. nginx doesn't have it in the public stable version yet, though there are preview packages available for modern server platforms.


HTTP/3/QUIC supports migrating connections between two networks, such as if a user switches from WIFI to LTE. IPVS or any UDP load balancer won't handle this scenario properly since it doesn't introspect the QUIC header and load balance based on the QUIC connection ID. This QUIC connection ID allows for a stable connection when the device needs to switch networks. If operators have any sort load balancer (like IPVS) between the client and the point the HTTP/3 connection is terminated, they will need to ensure that it has proper support for QUIC. One example is Katran[1] which has support for this method of load balancing.

[1] https://github.com/facebookincubator/katran


Any other L4 OSS around with support for HTTP/3 except Katran?

I've tried to use it but it was pain :).


Not that I am aware of.


>nginx doesn't have it in the public stable version yet

It's near, roadmapped for 1.25:

https://trac.nginx.org/nginx/roadmap


Yeah. Should test it, would be interesting to see if DR is working with UDP balancing (via IPVS).


Can we use HTTP3 today? Is it widely supported?


Yes. Google and Facebook have deployed HTTP/3 across their properties for quite some time now. Support in popular servers is not mature but if you're using a CDN like Cloudflare, it's trivial to enable HTTP/3 for your users.


Nginx 1.25 has HTTP/3 in its roadmap, first cut from that branch is scheduled to land tomorrow:

https://trac.nginx.org/nginx/roadmap


We use it on our rinky-dink ecommerce site, and it represents 40% of requests at the moment. HTTP/1 and HTTP/2 are at 30% each. Our users skew a bit older, and definitely aren't techno-savvy (one scrolled past all the products and used the contact us page to ask for a printed product catalogue and order form...).


HTTP/1 is surprising, what browser is that coming from?


Can't tell, sorry, the cache service provider doesn't supply that info.


Wondering the same- how’s the compatibility with very restrictive networks? Say a university/corporate network that only allows port 443 and DPI’d 80 egress. QUIC won’t pass through that since it’s UDP. Can server and client reliably negotiate a fallback to http/2 in such a case?


Yes. Browsers typically initiate a TLS (TCP) connection to port 443 for HTTP1/2, and then upgrade to HTTP/3 based on the Alt-Svc header.


This. There are also clients that with a little config will let cache the support level per-host, and even provide a list of hosts that the initial request should race TCP and QUIC to.

https://developer.android.com/guide/topics/connectivity/cron...


What if your browser receives Alt-Svc header and switches to http/3 on one network (say mobile data), but then you switch to a restrictive WiFi that has UDP disabled. All without restarting your browser/within one "session" of your http client. Wouldn't you start having connectivity issues that would be hard to troubleshoot? In that scenario, having http3 disabled is beneficial.


Changing network interfaces breaks connections and causes a new handshake. Browser session works at a different layer and doesn't prevent that.

QUIC actually lets you migrate between connections (because the packets are identified by a connection ID in each UDP packet rather than a 5-tuple). Clients will typically re-test a connection occasionally and downgrade as needed for this to work.


HTTP/1.1 isn’t going anywhere in any popular server, so yeah, HTTP/2 fallback isn’t a problem. (HTTP/1.1 is basically required even if you just want to serve an https redirect on port 80, since h2 requires TLS. The clear text h2c protocol has no adoption AFAICT.)


Yep, just like IPv4 didn't, HTTP/1 and H2 aren't going anywhere anytime soon.


HTTP2 is probably first one to go or be replaced. You can live without it or update it to whatever next thing.


nit: It is HTTP/1.1 . HTTP/1.0 have long gone


Sadly, 1.0 is _still_ around. Rare, but it happens.

I remember being gobsmacked several years ago that F5 LBs only supported HTTP/1.0 health checks. Doing HTTP/1.1 health checks required writing one specifically for it. Something the community had got sorted a long time ago.


I'm pretty sure I saw HTTP/1.0 in a PLC's status page recently.


IPv4 is not going away because it's valuable (as in limited supply creates a market). HTTP 1.1 is already unusuable in modern web as you'll be instantly blocked by cloudflare and gang. HTTP2 is likely to follow in near future as replacing it will be much easier than replacing http1.1.


Adoption was also slow because there were lots of legacy switches and routers that did not support it. Upgrades took a long time.


> IPv4 is not going away because it's valuable (as in limited supply creates a market).

That's what I've been telling my bros about crypto.


I'm not sure you understand. IP scarcity means they are more valuable in the web automation context. IPv6 availability will make it harder to "price" web automation -> more bots online -> more captchas and privacy invasions. That's a real challenge that is hard to solve despite your snarky comments.


The IPv4 address scarcity means that bots cause a lot more collateral damage, because if your neighbor runs a bot from home, you'll most likely get banned too.

Yeah, there's value in convenience of being able to block misbehaving IPv4 addresses. There's also value in burning CPU cycles to produce a transaction on a blockchain.

Make of it what you will.


It remains missing from Safari. It used to be offered as an experimental feature, but I don't seem to have it anymore on Safari 16.5.


Safari added support for HTTP/3 in version 14, released in September 2020. That’s why it isn’t listed as an experimental feature any more.

https://developer.apple.com/documentation/safari-release-not...

You can check with Cloudflare’s test page here:

https://cloudflare-quic.com

Alternatively, open up developer tools, go to the network panel, right-click on the headings and enable the protocol field. It will tell you which protocol was used for each request made by your browser. Obviously the server also needs to support HTTP/3 for a request to use HTTP/3.


Another user here commented that it's been enabled randomly for a subset of users. But not for me. And there's nothing on the web I've found about how to enable it. It was once on the Experimental menu, but no longer.


It is supported and randomly enabled for ~50% of Safari 16 users, according to https://github.com/Fyrd/caniuse/pull/6664#issuecomment-14934...


You can enable http/3 just fine even if you also serve clients that are falling behind on web standards, as long as you also serve http/2 as a fallback.

Safari not supporting isn't a problem unless your business only and specifically targets Safari, and even then it's a relatively painless way to help the few customers who run preview versions get a nicer experience.


IIRC Chrome has whitelisted some domains, but requires a flag to enable it generally.


QUIC is enabled by default for all domains across most major browsers: https://caniuse.com/http3

For Cloudflare HTTP/3 accounts for 28% of the traffic (https://radar.cloudflare.com/)


While HTTP3 does provide some improvements, it's clear that further optimizations are needed, particularly for users in Europe and Asia.

One potential solution that hasn't been discussed much here is leveraging AWS managed services like AWS CloudFront or AWS Global Accelerator. It's worth noting that Dropbox's website already uses AWS CloudFront, so they are already leveraging AWS in some capacity. Based on my cost calculations, using AWS CloudFront would cost around $40k a month, while AWS Global Accelerator would be around $22k a month.

As of August 22, 2022, AWS CloudFront supports terminating HTTP/3 in a Point of Presence (POP), which could potentially help with the latency issues Dropbox is facing. AWS Global Accelerator, on the other hand, is designed to improve the performance of applications by terminating UDP/TCP as close to users as possible then routing user traffic through the AWS global network infrastructure. This could help reduce latency by ensuring that user traffic is routed through the most optimal path, even if the user is located far from a Dropbox data center.

It's hard to estimate what the potential latency reduction of using e.g. AWS Global Accelerator is, especially at higher percentiles. However, using https://speedtest.globalaccelerator.aws/, and assuming symmetry, my connections to Asia are 35-40% lower latency.

Of course, there are trade-offs to consider when using managed services like AWS CloudFront and AWS Global Accelerator. While they can provide significant performance improvements, they also come with additional costs and potential vendor lock-in. However, given the scale of Dropbox's operations and the importance of providing a fast, reliable search experience for their users, it may be worth exploring these options further.

---

Cost estimates

Assumptions:

    1. Dropbox's peak traffic is 1,500 queries per second (QPS).
    2. Average data transfer per query is 100 KB.
    3. 50% of the traffic comes from North America, 25% from Europe, and 15% from Asia (remaining 10% from other regions, for pricing purposes put it into Asia's calculations).
---

1) AWS CloudFront cost estimation:

    Data transfer:
    - North America: 1,500 QPS * 0.5 * 100 KB * 60 seconds * 60 minutes * 24 hours * 30 days = 194.4 TB  
    - Europe: 1,500 QPS * 0.25 * 100 KB * 60 seconds * 60 minutes * 24 hours * 30 days = 97.2 TB  
    - Asia: 1,500 QPS * 0.25 * 100 KB * 60 seconds * 60 minutes * 24 hours * 30 days = 97.2 TB

    Data transfer cost:
    - North America: 194.4 TB * $0.085/GB = $16,524
    - Europe: 97.2 TB * $0.085/GB = $8,262
    - Asia: 97.2 TB * $0.120/GB = $11,664

    Total data transfer cost: $16,524 + $8,262 + $7,006 = $36,200

    HTTP requests:
    Total requests: 1,500 QPS * 60 seconds * 60 minutes * 24 hours * 30 days = 3,888,000,000

    Using the updated AWS CloudFront pricing (as of May 22, 2023):
    - HTTP requests cost: 3,888,000,000 * $0.0075/10,000 = $2,916
Total estimated monthly cost for AWS CloudFront: $36,200 (data transfer) + $2,916 (HTTP requests) = $40k

---

2) AWS Global Accelerator cost estimation:

    Data transfer:
    - Total data transfer: 194.4 TB (NA) + 97.2 TB (EU) + 97.2 TB (Asia) = 388.8 TB

    Using the updated AWS Global Accelerator pricing (as of May 22, 2023):
    - Data transfer cost (averaged across regions): 388.8 TB * $0.035/GB = $13,608
    - (Also need to add EC2 egress cost, 388.8 TB * $0.02/GB = $7,776
    - Total data transfer cost = $21,384

    Accelerator:
    - Assuming 1 accelerator with 2 endpoints (1 for HTTP/2 and 1 for HTTP/3)
    - Accelerator cost: 1 * $18/accelerator/day * 30 days = $540
Total estimated monthly cost for AWS Global Accelerator: $21,384 (data transfer) + $540 (accelerator) = $22k


Former Dropbox employee, just correcting one assumption:

> Dropbox's peak traffic is 1,500 queries per second (QPS).

I can't speak to search QPS directly, but most individual serving hosts for file sync/retrieval were receiving tens of thousands of QPS. The overall edge QPS peaked at several hundreds of thousands QPS every day across all the hosts. So I'd guess that even just search is an order of magnitude higher than 1,500 :)


I misread the article when it stated:

> Traffic regularly exceeded 1,500 queries per second (QPS) at peak times

It may be quite cost prohibitive to use a managed service like AWS Global Accelerator.


tl;dr; HTTP3 provides a little improvement, but optimizing search speed is much more important in this case.


Although HTTP3 was a significant win for people with lossy connections and high latency.


like mobile


> tl;dr; HTTP3 provides a little improvement, but optimizing search speed is much more important in this case.

The article itself shows otherwise:

https://dropbox.tech/frontend/investigating-the-impact-of-ht...

------------------------------------------------------------------------------------

|HTTP3 vs. HTTP2 | North and Central America | Europe | Asia |

|p25 | -3.20ms / -6% | -2.34ms / -2% | -3.73ms / -2% |

|p50 | -4.21ms / -5% | -3.84ms / -3% | -5.12ms / -2% |

|p75 | -9.03ms / -8% | -11.1ms / -6% | -15.0ms / -4% |

|p90 | -44.9ms / -17% | -47.3ms / -13% | -77.3ms / -14% |

|p95 | -118ms / -22% | -141ms / -21% | -200ms / -22% |

------------------------------------------------------------------------------------

While the average case was negligible, the edge cases fared significantly better because of HTTP3's removal of the head-of-line blocking in TCP.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: