Hacker News new | past | comments | ask | show | jobs | submit login
Shared Cache Is Going Away (jefftk.com)
381 points by luu on Nov 3, 2019 | hide | past | favorite | 146 comments



I always felt like this was mostly a dream anyway due to the diversity of libraries, versions, and CDNs across the web. Everything would have to line up perfectly, within the TTL, to get the performance advantage of loading from cache. And even then it was only really an advantage on the first page load of a site visit; subsequent pageloads would hit cache anyway from the first page.

And speaking of privacy... if everyone across the web is loading resources from one CDN, that seems like an interesting stream of data for that CDN.


I think it may have had more benefit in the jQuery heyday. Things are much more fragmented now.


Absolutely. And resources like disk space and bandwidth have gotten much cheaper in the 13 years since jQuery was invented. Fewer cache hits, lower cache value, and less cost savings all point in the direction of retiring this feature.


You say this, but for people who have shit internet are acutely aware how cdn's no longer help things from the users' perspective beyond the mere "cdn's are better at delivering some assets than joe website."

It doesn't help that relative to everything else the churn in websites is immense, making the chance you'll have to pull in things more likely. And relative to everything is quite a statement, as churn in software is pervasive.

EDIT: that is, I'm just complaining, not claiming the status quo (or what was before) was better, obviously.


That’s what I found as well: every time I measured the benefits were much lower than hoped, and especially where you wanted it the most on mobile the cache sizes were small enough that they churned frequently. Way back in the day, the low per-host connection limits were a consideration but that era is firmly dead.

The other side was that people notice slow performance more than fast, and the failure modes were always worse than the savings when some fraction of connections would take, say, 2 seconds to connect to Google’s CDN even though their time to yours was much better. You don’t have an easy option for those slow clients hitting your property but you can at least reduce the number of dependencies to that one service.


One way to prevent a CDN from seeing every request for a shared resource could be a new attribute that allows serving the resource locally if the resource is not in the shared cache.

For example, <script src="/jquery-3.4.1.min.js" try-shared="https://code.jquery.com/jquery-3.4.1.min.js" try-shared="another-src">

@src can be locally hosted. If it's not in cache, the browser can try each @try-shared attr (without loading the resource from CDN). If no match, the browser downloads @src from your own domain.

Of course, this doesn't solve the Shared Cache issue raised by the article. Suppose the only way to solve that would require adding resources to the shared cache explicitly. The most effective way (I assume) would be a header provided by the CDN of a shared resource, eg, X-Shared-Cache: true, that a browser would recognize... Then @src/@try-shared could still get the benefits of the shared cache and developers don't have to worry about it.


If everyone would do it like that, the file would never be pulled from the CDN at all (since a cache miss pulls from the local src), therefore the try-shared attribute would never be useful.


Sure, if everyone decides to host their own files, a CDN would never be used. Although, there are benefits to using a CDN, so it's likely the try-shared value could exist and would be useful.

Note the proposed scheme doesn't prevent anyone from using a CDN as the src while also trying other CDNs to increase the chances of a cache hit, which has its own benefits.

For example, <script src="https://code.jquery.com/jquery-3.4.1.min.js" try-shared="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.mi... try-shared="another-src">


The better way to do this is to specify the vendored location and hash of the resource.

That way, you can safely leverage any version regardless of its downloaded location.

Content digests are already used in the `integrity` HTML attribute; it could be used for a cache key too.

It still has the version fragmentation problem, but you don't have to worry about picking a popular CDN.


> if everyone across the web is loading resources from one CDN, that seems like an interesting stream of data for that CDN

And if big brother isn't balls deep in CloudFlare I'll eat my hat.


You don't even need to be considering outside/government intervention for this to be concerning. How many sites rely on JS from the Google JS CDN? (https://developers.google.com/speed/libraries/)

For the vast majority of people, the negative effects of Google tracking them is probably more concerning than the government tracking them.


Ironically the change described in this article will give the CDN more information as it now receives a request for each site and not just the first.


For many websites first load is all that matters. If user is hooked, he'll wait for subsequent loads.


If your perspective is only that of a developer, which is fine, this is a developers' website, then in fact it doesn't matter at all, because the whole space has this experience, so no one actor is at a disadvantage anymore unless you like self-host on your home PC.

For users on the other hand...


> I'm sad about this change from a general web performance perspective and from the perspective of someone who really likes small independent sites, but I don't see a way to get the performance benefits without the [privacy] leaks

Maybe I'm missing something, but the obvious solution to me would be more cache-control headers.

The only notable case where shared cache is useful are resources on public CDNs hosting libraries and other common resources. These could just send a "cache-control: shared" header, or "cache-sharing: true" if adding new values to existing headers breaks too many existing implementations. This puts them in a shared cache, everything else gets a segmented cache.


I think the page that loads the resources itself would need the 'cache-sharing' header since other websites could still perform a timing attack if it loads a CDN asset that specifies 'cache-sharing true'. Even then, enabling cache sharing would still make you open to a timing attack and the effectiveness of a shared resource cache would dwindle as less and less sites share that cache.


If Google Fonts serves Roboto with cache-sharing true that is unlikely to leak any data. Sure, you can detect that I at some point visited some site that uses Roboto, but that's vague enough to be useless.

There is some potential for leakage with uncommon assets. Maybe only a handful of websites use JQuery 1.2.65 or Helvetiroma Slab in font weight 100. It's a less severe vector than just testing if someforum.example/admin.css is cached, but still it's leaking data. The CDN could mitigate that by only sending a cache-sharable header on sufficiently popular assets, but depending on others going out of their way to preserve privacy is probably a bad idea.


If a website uses 10 common assets, that's often an uncommon combination. And if you have 100 websites on your "targets list" (let's say, fetish websites, or LGBT communities) then you could get a positive match on some of them.


The ten common assets have to be uniquely uncommon for this to be a risk. Tinymodeltrains.com might have a distinct combination of ten assets, but if my browser caches two of them from my visit to reddit, three from hackernews, two more from imgur, and the last three from pornhub, your tracking data will be meaningless.


Not entirely meaningless; it's kind of like a Bloom filter. False positives exist, but false negatives are unlikely. Combined with other data in the style of the Panopticlick, one can obtain a target set to which to apply closer scrutiny.


Then maybe have the browser enforce "common assets only" by tracking how many unique first party websites use a particular asset and only sharing the cache if the number of such sites is sufficiently high. Though I suppose that would reduce the effectiveness of the cache.


Many websites these days don't link 10 different files, they glob them at build time with gulp or whatever the latest hotness is, and serve one lib.js file instead.


The attack could still work in some cases, and is as described in the linked post.

Your webmail provider has a search box, and the content that is returned is styled with Roboto. If the search finds nothing, then Roboto isn't loaded. The attacker forces Rotobo out of the cache with a specially formatted fetch() request, then loads an iframe of the search. Then the attacker checks if Roboto is in the cache or not. This allows the attacker to essentially read your email inbox.

https://sirdarckcat.blogspot.com/2019/03/http-cache-cross-si...


I think it would be fine to let the CDN mark the common shared resource as "Caching: shared" as an opt-in, and also allow the including page to override with another header as an opt-out. If you are including shared cdn resources on a sensitive page, you are already doing it wrong. The CDN could already control its header to only send the opt-in for very commonly used resources in order to avoid fingerprinting based on less common ones.


This is a wonderful idea. You could also opt-in client-side with a shared attribute:

<script shared src="//:jquery.com/jquery.js"></script>


Hypothetically, there's nothing to stop some useragents from "blessing" certain libraries as common enough to justify shipping them with the useragent and satisfying requests for them locally. That could leak details on useragent, but none that shouldn't be available from the http headers anyway.

But I suspect this will be unnecessary; even the bandwidth-constrained use case is getting to be more bandwidth every year.


> Even the bandwidth-constrained use case is getting to be more bandwidth every year.

In my experience, websites are piling on bullshit faster than my mobile internet is getting faster.


You can try this out with an add-on like DecentralEyes.



This wouldn’t work for Safari, because Safari is trying to defend against data being transferred between even cooperating sites, as part of their anti-tracking work.


I think a lot of people are unclear on the threat model here. If I have it correct, there's no way around it: either you live with the privacy leak, or you disable the shared cache.

The threat is that when you navigate to creepy website, it loads some library and tracks the timing. They use that to infer that you've accessed some resource from a sensitive site.

None of the workarounds with extra attributes are going to help, because they rely on the web developer to

1. know about the attack

2. know that some library or asset is a realistic candidate for the attack, and take appropriate action.

Neither one is that realistic. We developers are just too lazy to get stuff like that right, even if we know about it. Cargo culting is the rule.

As for the effects, I suspect this will have a modest effect on the average website. The sources I've encountered seem to cast doubt on the effectiveness of share cache (https://justinblank.com/notebooks/browsercacheeffectiveness....). I poked around the mod pagespeed docs and project, and couldn't find any indication of how they'd measured impacts when they implemented the canonicalization feature.

I wonder if you'll see a big impact on companies like Squarespace and Wix, where there are a lot of custom domains that are all built using the same stack.


Off the top of my head I can think of several ways to compromise on this by making shared caching opt-in.

One way is for the requester to specify if the asset is shared. A new 'shared' attribute on html tags and XMLHttpRequest would do this. Browsers enforce cache isolation _unless_ the shared attribute is set, in which case it comes from a 'shared' cache.

So if the attacker requests a www.forum.example/moderators/header.css from the _shared_ cache, but the forum software itself didn't specify it was shared so it never got loaded into the shared cache, then nothing is leaked.

And as it would only make sense to opt to share stuff like jquery.js from a CDN, the forum wouldn't naturally share that css file and so on.

The other approach is for the response to specify sharing, e.g. new cache control headers. Only the big CDNs would bother to return these new headers, and most programmers wouldn't have to change anything to regain the speedup they just lost from going to isolated caches once the CDNs catch up and return the header.

In either case, sharing can _still_ be an information channel if the shared resource is sufficiently rare e.g. the forum admin page is pretty much the only software stuck on version x.y of lib z. The attacker can see if its in the cache, and infer if the victim is a logged-in admin or not. Etc.


I think the trouble with both of these plans is that it shifts cognitive load to a lot of people who aren't expert in the topic. How many people would put "shared" on something because it sounds good, or is the default in a template? And even if they don't, how many brain-hours do we have to burn on people understanding the complexity of an optimization that probably doesn't make much difference to the average website?


Isn't the bigger problem that the developer then choses for the user whether to leak information or not?


If the enemy is the developer then you've already lost. Its not like cache sharing is how a developer chooses to unmask your anonymity when browsing between sites; they have cookies to do that in much better ways.

A long time ago PHK wrote some very salient comments about HTTP 2.0 efforts https://varnish-cache.org/docs/trunk/phk/http20.html https://queue.acm.org/detail.cfm?id=2716278 etc. He puts forward the case for a browser-picked client-session-id instead of a server-supplied cookie.


> If the enemy is the developer then you've already lost.

It's not that the developer is the enemy.

Pretend I create a website called "Democratic Underground: how to foster democracy under a repressive regime." I'm naive, or I want it to load quickly, or I accidentally include a framework that is either of those two -- some library versions are cached.

Now, the EvilGov includes cache-detection scripting on its "pay your taxes here" webpage. Despite my salutatory goals, shared caching leaks to the government some subset of my readers.


Yes, precisely this. You can't let web sites "opt in" to tracking users. That's the exact wrong threat model to be using here.


I don’t think it does. I think it shifts the load to CDN maintainers. Which is fine because we just gave them a task to do that avoids obsolescence.

The browsers have always allowed cross domain requests which have been tolerated until now but involved all of us being aware of XSS and CSRF issues, or suffering the consequences.

Removing shared cache is the beginning of the end for cross domain requests by default. The other obvious use these days is ad networks, but they also get used for integrations like SSO and shared services like Apple Pay and presumably PayPal? And other collaborations between companies.

But those could also be opt in.


What if:

a) the origin sharing the resource must place a .well_known/static_resource file in place.

b) The presence of .well_known/static_resource prevents any request on this origin to send cookies, and any set-cookie header is ignored.

c) The document that includes the resource on this sharing origin must use subresource integrity attributes when loading the shared resource.

d) the resource cannot be cached unless the cache-control header is public and has a lifetime of at least 1 hour.

This guarantees that the resource is always requested cookieless, and that the resource can't vary per request, otherwise the subresource integrity check would fail.


Why .well_known/ instead of HTTP headers (possibly on a HEAD request beforehand, like CORS)?


To ensure the entire origin had the same policy. Perhaps that's unnecessary though.


Having the server be explicit about an asset being ok to share via cache thwarts the attack described in the link. And thats not too much for the CDNs to enable.


From Google Chrome design doc: https://docs.google.com/document/d/1U5zqfaJCFj_URrAmSxJ0C7z0...

> "early experimental results in canary/dev channels show the cache hit rate drops by about 4% but changes to first contentful paint aren’t statistically significant and the overall fraction of bytes loaded from the cache only drops from 39.1% to 37.8%."

What about exceptions for loading common JS libraries from a shared CDN? I'm looking at the Google Chrome design doc and don't see how one gets around this. Maybe I'm just missing something, but if not it seems like they need to dig more into perf from the perspective of the slower end of the distribution, it could make a big difference.


I too find their performance numbers hard to believe... More digging required I think!


After reading the other comments I think I was probably wrong. A lot more choices of libraries and versions nowadays that chances of cache hits on cdns has decreased.


I would think fonts specifically would have a large negative impact that would affect first contentful paint given that tons of sites load a few common fonts.


Why does Chrome (or maybe other browsers) not just keep a local cache of the most used Google web fonts by default? Seems like an easy performance win if they're used on lots of sites…


Good. I've been advocating for this since publishing a history-leaking attack on Chrome's shared bytecode cache, which also doesn't rely on the network (CVE-2019-13684 - see page 8 of [0]). Would also like to see this applied to visited link state eventually. Shared state between origins inevitably leads to information leaks.

[0] https://www.spinda.net/papers/smith-2018-revisited.pdf


I've wondered about visited link states for a while, and I could easily see them getting focused on soon as well.


Is it just me, or was shared caching not on its way out already? I mean, it was great when every website had jQuery on it, but with the proliferation of new JavaScript libraries, the chance of getting a shared cache hit must be getting smaller.

Besides, Webpack and similar bundlers with tree-shaking abilities makes it practical to just load a subset of a large library.

And last (but certainly not least) there is the security angle. Imagine if someone managed to sneak malicious code on to CDNJS or Bootstrap CDN, how many nasty things they might be able to get up to, even if everyone remembered to set crossorigin="anonymous" on their shared assets.


Not all shared resources are javascript. There are fonts and others too. Shared cache for fonts make sense to me even in 2019.


That is why SRI exists


It's not clear from the Chromium Design Document[1] whether resources loaded via Subresource Integrity (SRI) will have a shared cache or not. It's not explicitly mentioned, so it's probably best to assume it's not until someone has tested it.

[edit] The SRI spec github project has an issue for shared cache [2] that seems to be coming to the consensus that there will not be a shared cache for SRI:

> "it seems rather unlikely that we can ever consider a shared cache"

[1] https://docs.google.com/document/d/1XJMm89oyd4lJ-LDQuy0CudzB...

[2] https://github.com/w3c/webappsec-subresource-integrity/issue...


Why doesn't the browser just record the original request time for the resource and simulate the same download speed when a different domain requests it for the first time? Maybe even randomize the delay some.

Of course you get a false delay on first load but still saves network bandwidth while still preventing information leakage.


That's clever, but it still sounds like a way that information could be leaked. Download the target resource, and then download it again concurrently with a known unique resource, and see if the timing changes, for example.

It's an arms race where the browser would ultimately have to simulate every consequence of actually downloading every resource over the slowest link in the network. You're making the problem (and its solution) more complex but not completely solving it.


Yes I suppose adding a cache breaker query string will trigger a true network download that can be compared.

Although if the network hasn't changed much the true and simulated should be very similar where would you really know if its a real or simulated request.


> but still saves network bandwidth while still preventing information leakage.

If it saves network bandwidth, then you just have to measure the network bandwidth, like a speedtest page does. As Spectre and friends have shown, even the tiniest difference can be used for an information leak.


The speedtest pages work by measuring the time it takes to download a file. If the browser used a simulated time length, then the "bandwidth" will look the same.


The attack idea is that a site could check if a file is cached by downloading a test file at the same time and seeing if the test file download speed is affected.


I think you'd also need to be careful about freshness of cached data. If you give stale data but delay it to give the illusion it was just loaded, an attacker might be able to infer it actually was from the cache after all by looking at the data and cross-checking that against what the current data looks like.

Consider, as an example, an HTTP resource that contains a text string representing the current time and which is updated once a minute. Its cache lifetime is set to 1 minute. A page fetches it at 9:01:30AM and gets the string "9:01AM". This goes into the cache. At 9:02:15AM (45 seconds later), an attacker loads it, you give the cached data which is still the string "9:01AM". To cross-check the data, the attacker hits another server (say, its own proxy that it runs, which fetches the resource and forwards it on), so it can tell that the data it should have gotten is "9:02AM".

In other words, if you give it stale data slowly, it might be able to detect the staleness instead of trying to detect the slow loading time.

Perhaps you could fix this by validating the freshness of the cache, using ETags or something. You'd hit the server, validate what's in your cache is fresh, and then still delay it, thus giving a more complete illusion to the attacker.

I'm not sure if HTTP allows a page to access the TTL of cached data, but if so, you might want to fake that too. If you give the real TTL numbers, then some of the time it's going to look like you just loaded it but it's about to expire.


If you happen to download some resource on slow 3G, then all subsequent request will be slow?


Only the first request from a different domain.


The article links to a reliable attack that doesn't need to test timing/bandwidth at all. It requests the resource from a page that has a very long URL, which causes the origin server to fail due to Referer header exceeding its header length limit. Cached = ok 200, Uncached = error 431.


In that case, the use of the shared cache doesn’t reduce latency, so you’re giving up over a big part, arguably the majority, of the benefit of caching.

Where they come apart, I think it’s common to say latency is more important than bandwidth. It certainly is to me, though if you’re on a metered connection, you could certainly view things differently.


> I have Firefox 70.0.1 and it doesn't seem to be enabled.

It's behind a flag, browser.cache.cache_isolation: https://hg.mozilla.org/mozilla-central/file/tip/modules/libp...

Similarly Chrome has a bunch of feature flags, I'm not sure if they can be enabled from the UI: https://cs.chromium.org/chromium/src/net/base/features.h?typ...


chrome://flags/

(but I can't find this one yet)


Just a note chrome://flags/ only has some of the flags, many must be passed manually. https://peter.sh/experiments/chromium-command-line-switches/

Still can't find this option as a flag though, must be compile time only.


I'm having difficulty determining how this impacts subdomains.

From what I can tell a.example.com, b.example.com, and example.com would all have their own caches, correct?

We have multiple (sub)domains a|b|c.xxx.example.com that share a template, and therefore resources (we're a .edu). If we're now looking at an initial load hit for all of them, that may impact how we've been setting up landing pages for campaigns.

I can't see us completely moving away from a CDN because of the other benefits they provide.


It depends on the implementation currently. Some implementations consider subdomains isolated and others don't. See Chrome's implementation experiments: https://docs.google.com/document/d/1U5zqfaJCFj_URrAmSxJ0C7z0...


I wonder how the dust will eventually settle when these happy naive times of using shared caches for great performance gains are in the past, anywhere from CPUs (meltdown, spectre) to www. Will we decide that the extra cost of security is not worth it in all but a few critical applications? Or will we accept is as the necessary tax?


It should be safe by default with option to disable security in favour of performance. Just like I can disable CPU patches now because I don't believe in their severity.


DNS?


>Unfortunately, a shared cache enables a privacy leak. Summary of the simplest version:I want to know if you're a moderator on www.forum.example, I know that only pages under www.forum.example/moderators/private/ load www.forum.example/moderators/header.css.,When you visit my page I load www.forum.example/moderators/header.css and see if it came from cache.

You would expect fewer requests to www.forum.example/moderators/private/ than to, for example, www.forum.example/public. If you look at caching from the server load angle vis-à-vis security, then it could be inexpensive to not cache www.forum.example/moderators/header.css so you would simply not allow browsers to cache this resource.

If site A thinks that allowing the user's browser to cache a certain resource puts them at a security risk, then this resource should be treated as not-public.


Browsers can't rely on web developers to protect the privacy of their users, since they probably won't care.


That makes the experience worse for the moderator though, especially on mobile networks. Caching isn't just about server resources.


In this specific case, I think only marginally worse. If you're a moderator, you use the resource every day, going through many entries. The slowdown on the first request is both unlikely (already cashed), and insignificant for the task. This is not a "extra 100ms will cost you a customer" situation.


Perhaps a little 'allowcache' property on script tags/images/other resources could be of use here to prevent leaking info?

Something like:

<script src='jquery.js' allowcache></script>

That way we can specifically say which items we're willing to share with other sites and which ones we want an independent copy of.


How would that help? You're incentivizing developers to mark everything "allowcache" (to make their pages faster), and it's the users who will suffer (via privacy attacks).

If your plan to fix this situation is "trust that developers are competent and benevolent", we can achieve the same result by not doing anything.


I want Roboto font to load instantly for my first time visitors and knowing Roboto is cached leaks basically nothing. On a personal level, I don't want any site to be able to figure out what media sites I am reading based on which images are cached in my browser.

"allowcache" would cause developers to do something stupid like put it on all images, but "multisite-shared" may cause developers to make reasonable choices.

A change that makes small sites a bit slower to load things like font becomes another brick in the wall of walled gardens.


The problem is that the security risks aren’t obvious and that feature would instantly lead to a ton of posts, StackOverflow answers, etc. saying you need to put it on everything for performance reasons (no doubt billed as a huge SEO advantage). Then we’d learn that, say, a Roboto load wasn’t so harmless when someone used it to detect fonts or Unicode subsets used by, say, Uighur or Rohinga sites out that the combination of scripts, font variants, etc. was more unique than expected.

The other thing to question is how much impact this actually has on small sites. CDN performance is variable enough that in the HTTP/2 era I’d be skeptical that most sites load times are significantly impacted by that rather than loading too much JavaScript or delaying rendering.


By default, all existing code would not have the allowcache property, so would load from an isolated cache. Those devs that explicitly care about speed for certain resources, where leaks are not a concern (e.g. loading a lib from a CDN) can set the allowcache property on those resources.


I think there are two categories of developers who would use that. One is smart, experienced people who have correctly evaluated both security and performance concerns and decided to turn this on for a specific narrow case where it's truly valuable to speed up first-time page loads.

The other is people who want things to go faster and flip a lot of switches that sound fast without really understanding what they do, and then not turning off the useless ones because they're not doing any real benchmarking or real-world performance profiling. This group will get little or no benefit but open up security holes.

Given the declining usefulness of shared caching (faster connections, cheaper storage, explosion of libraries and versions), I expect the second group to be one or two orders of magnitude larger than the first.


I agree with you, for now. But, I can imagine a future where library payloads will increase significantly. In those cases, shared caching will be pretty useful (I'm thinking along the lines of a ffmpeg WASM lib for web based video editing apps - sounds crazy, I know, but I think we're heading in that direction!). I could of course be totally wrong, and instead we just get fancier browser APIs with little need for WASM... I guess we wait and see!


If you're opening a video editing web app, I would expect a bit of loading time the first time trying the app.

WASM modules also execute as they load (unlike JavaScript, which only executes after being loaded), decreasing the value of relying on a cache in general.

> I can imagine a future where library payloads will increase significantly.

TBH I see the opposite; to use the focus of the article, jQuery was obviated by browser improvements, the pace of which is not really slowing down.


I'd like to know what you think "where leaks are not a concern" might mean. As a web developer, I have no idea how I'd be able to know that, even if I were perfectly benevolent and competent. Loading resources from a CDN is exactly the sort of thing that a malicious website can use for a timing attack.

This sounds to me no different than a developer wanting to opt-out of memory protection, on the basis that it will be a little faster -- and my program doesn't have any bugs or viruses!


But that's a separate issue, no? The leak issue is all about someone knowing whether you've accessed a resource previously or not (i.e. checking to see if the resource comes for cache or not).

For a lib hosted on a CDN, who cares?! However, if someone wants to track if you've been to myservice.com, they could try and load myservice.com/logo.png - if it's from the cache, then bingo, you've been there. That's a leak.

Maybe I've misunderstood; could you explain your timing attack mechanism in more detail please?


Perhaps "allowcache" would make this too tempting. Perhaps something like "public-shared-resource"?


You don't need allowcache. You can do this already by serving script from domain under your control. The only potential problem is if attacker will also link same resource to your domain.


> The only potential problem is if attacker will also link same resource to your domain.

That's the example used in the article (www.forum.example/moderators/header.css).


CDNs only embolden developers to pile up more and more resources on a page. Good thing to see it go away.

And maybe they made sense for things like JS in the 2000s but many super-cheap hosting providers provide unmetered bandwidth nowadays. (and OF COURSE the privacy/security things)


It also made sense when everything was HTTP 1.1, but that’s on the way out too.

Browsers throttled the number of requests per domain because parallelism was expensive for the servers. Loading from another domain could happen simultaneously. If you had a fast internet connection you’d see a reduction in page load time. You’d also see that to a lesser extent if your connection was shared with others.


But I assume that unfortunately web devs will still keep using cdns for all their libs since this is now industry standard and shows how pro you are.


Unless the libraries are bundled in, which is the case for most of the stuff I use?

Exception - the analytics stuff I'm obliged to add.


You're right, I think I mostly see fonts from Google and trackers nowadays.


Great, now let's go and build websites that don't require 100 request per page.


Well, this might remove one argument against splitting the JS bundles too far.

If you'll never benefit from eg. a shared JQuery library on a CDN anyway, might as well include a (reduced) version of it in your bundle.


That's as long as the shared library is changing about as frequently as your custom code.

If you're using the same framework-x.y.z library for months at a time, but doing daily/weekly code changes and pushes, you're losing out on the cachability of the library.

But if your project is only being updated as frequently as the third party libraries it uses, maybe it makes sense.


My thought exactly. JS devs have told me for years that the fact that their site uses >1Mb of JS script doesn't affect performance because almost all of it is already in the cache for 90% for users. Now there's a counter-argument and we'll get back to reasonable page sizes.


But the future is single-page apps with 500MB JavaScript bundles, and Rube Goldberg microservices with 2000 dependencies that need containerization to keep from breaking everything else on the server. Bro.


Broswer vendors are in a good position to make this call because they can use telemetry to measure the effectiveness of shared caching. Personally I doubt shared caching is as effective as it used to be. Surely that remaining effectiveness would be decimated by any attempt to implement an isolated-by-default policy that would require website authors to opt in. So all in all disabling the shared cache strikes me as a reasonable option.

Broswser vendors could choose to bundle some popular fonts and libraries but that comes with its' own set of problems.


If you're pulling in things from a dozen or 2 different domains, it gets expensive (In milliseconds) for client dns lookup, tls negotiation, when added all up.


In general the ability for HTML to reference resources outside of the current domain seems to be a privacy and security nightmare. xss attacks, privacy leaks, adtech tracking cookies, etc.


It's weird to me that sites get to know whether files loaded from the cache in the first place. I guess you could time it, but that wouldn't be perfect.


It's not intended. The fine article prominently links to a page that explains how it works:

https://sirdarckcat.blogspot.com/2019/03/http-cache-cross-si...


Guess this kills SXG? Or will they consider it "same site"? https://developers.google.com/web/updates/2018/11/signed-exc...


How would it kill it?


Does the browser at least dedupe these files internally? For example, it goes through the motions of a real download and so forth but afterwards it just stores things in a content-addressable fs. Or will I now have 50 identical copies of React on my hard drive?


I don’t think it’s that clever, the origin will now be part of the cache key, so more likely 50 identical copies


The performance benefits of using the shared CDN copy of resources versus hosting your own with HTTP Keep-Alive are vastly overstated. In-theory, if everyone were using the same version of the resources and everyone were using the same CDN, you'd see a benefit (maybe). In practice, there are too many variables and you end up cache missing most of the time, anyway.

Besides, this was only ever a concern for bad devs loading tons of tracking scripts and hacking together sites via copy-paste anyway. If you're really concerned about performance, you should be building, tree-shaking, and minifying all of your JS into a single file.


With HTTP/2 pipelines, and upcoming HTTP/3 updates, I’m skeptical how much performance gains are actually achieved with shared cache anyways. As far as I’m aware, a TCP connection is still opened when using the cache, as well as TLS with all its overhead. This is all just speculation, but it would be an interesting experiment to compare how much (if any) benefits are seen pulling in that common JQuery script while still loading custom JavaScript from your own host vs bundling it all together or just loading them both separately from your own host. HTTP connections are certainly not free.


> As far as I’m aware, a TCP connection is still opened when using the cache, as well as TLS with all its overhead.

If the cache control headers says it expires in the future, the browser will not usually make any request, just load it from the disk. Hence a typical practice of setting a expiration date very far in the future, and just changing the URL when the resource is updated (thereby forcing the browser to request the new representation).


I wonder if a compromise could be caching things only if they are widely used across public sites. A browser vendor could use telemetry or crawling to aggregate information about commonly used resources across the web. The browser could cache these resources, even proactively. It's certainly more complex than the shared cache, but it could achieve an end that is broadly similar. Then again, maybe the vendors' telemetry it telling them that first site load is not that common and that the shared cache doesn't move the needle that much. This wouldn't be surprising to find out.


Please no for telemetry of such kind. How would you distribute such a list?


“ What does this mean for developers? The main thing is that there's no longer any advantage to trying to use the same URLs as other sites. You won't get performance benefits from using a canonical URL over hosting on your own site (unless they're on a CDN and you're not) and you have no reason to use the same version as everyone else (but staying current is still a good idea).”


Isn't this just a direct quote of the 4th paragraph?


From a security and privacy perspective there are already good reasons to self-host JS code and other external artifacts instead of sourcing them from CDNs. In some situations even without this change it's faster (if it's not already cached - because you can fetch it from the same host in the existing connection via http2).

So self-host those JS files, and also use fewer of them if possible.


This might be a crazy idea but... why is it that browsers haven't implemented something like Java Maven's package cache and proxy yet?

Basically the website says "I need com.google.angularjs:2.0.1" and browser grabs and caches the package for all future usages? It seems to work very well for Java... why hasn't there been any such initiative for the web?


That's essentially what the browser cache is.. except you indicate the package by URL, since there's no central repository of packages.


So why isn't there a central repository of packages?


If the idea is to track a browser, can't you just use DNS resolution time? Are they looking at per-host DNS caches?


Less effective, because browsers don't cache, recursive resolvers do, and they are often shared; and it may be harder to tell the difference between a cache hit and a cache miss in DNS (responses can be very fast).

But I guess it could work too


Browsers could safely pull a list of very commonly requested, content addressable resources from various CDNs and pre cache it (independently of any request). That would even help with first request latency, and for mobile (where bandwidth is expensive) you could do the pre caching on Wifi.


> I know that only pages under www.forum.example/moderators/private/ load www.forum.example/moderators/header.css.

Correction: only moderators and anyone who visits the author's page which loads header.css for everybody. And any other page which is doing the same speculative probe.


Anyone care to explain why, in the presented example, the resource www.forum.example/moderators/header.css is accessible to anyone and not only to clients with moderator access?


On many web applications, static files like CSS/JS files and non-user-generated images are not served by the application server, but directly from the filesystem. This conserves CPU resources and might also improve network throughput because one application less is involved in the path.


Do people still use nopkg and the like? I thought leftpad showed the problem there.


The general trend with regard to the most popular javascript and css libraries is that their features have eventually made it into web standards.

We've known that cache can fingerprint forever. This change won't be that bad if it encourages greater adoption of web standards.


Great, now just ban third-party origin resources.



Isn't this solved with Firefox containers?


Yes, but only if users know about containers, and they do proper compartmentalization. Which they don't.


As a side note:

CDNS, which usually are the use case for global caches, are also kind of critical when it comes to the GDPR and other privacy laws.

Having no global cache may kill of the usefullness of CDNs (which is somewhat doubtfull given the number of stuff available). But you are not allowed to use them anyway unless the site is plastered with some allow-all-the-things-popup.


CDNs will always be useful all the time the speed of light exists and DDoS is a threat


I meant more in the context of caching for fonts and js libraries like the article mentioned.

Afaik this is the main use case of CDNs.

I am pretty shure that there are way more pages with google fonts than cloudflare protection.

And even for the sole purpose DDoS prevention the privacy issue still holds. Sadly that means popups, redirects or other user unfriedly crap on the pages.


what if the cachekey to use is sent as a response header



>When you visit my page I load www.forum.example/moderators/header.css and see if it came from cache.

Why can your page know if a certain resource came from cache? Can't that hole be plugged, instead?


Just measure the time it takes to load the resource.


Timing attacks. Not really pluggable.


It's eminently pluggable if you stop running hostile general-purpose code on our own machines, giving it a large poorly-defined attack surface! That's the eventual answer here. Websites have a perfectly cromulent place to run whatever code they'd like - on their own servers. If you knew someone was trying to kill you, you wouldn't invite them into your home for a party so they could easily tamper with your medicine cabinet.


Can't we just show a "This site may harm your computer" message whenever a site is recording too much timing data? The page is justifiably considered malware at that point.

For example there's nothing preventing someone from timing all my keyboard events for keystroke biometrics: https://en.wikipedia.org/wiki/Keystroke_dynamics


Consider the requestAnimationFrame API. It will give you a 60hz timer (even higher on high refresh rate displays) but is used for a ton of animation related tasks as well as games. That said it effectively can be used as a timer which in this case would likely be precise enough.

What do you do in the case where a ton of website's use this API for legitimate animations?


So, based on the response to Spectre and friends ("intel knowingly sacrificed security in the pursuit of performance and everyone should sue them") [0][1], is the correct response here "browser vendors knowingly sacrificed security in the pursuit of performance and everyone should sue them?"

0: https://news.ycombinator.com/item?id=20867672 1: https://news.ycombinator.com/item?id=20873452


It's not exactly the same as having bought a processor, and then having to give up a significant fraction of performance to have it be secure, while other processor manufacturers had much smaller performance penalties...


Neither of those comments suggested suing Intel. Browsers are also free, whereas you pay for a CPU.


Why not inline such a critical resource directly in html? For css and js it would work just fine. And for fonts, I don't think they leak any private information.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: