Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Cloud Traffic Is Suffocating Fedora's Mirrors (phoronix.com)
143 points by heywire 3 months ago | hide | past | favorite | 90 comments



Something irks me about volunteers spending real money to support all the OSS freeloading businesses. I'm talking about companies with a market cap in the $Billions. Almost none of them can be bothered to kick back even a modicum of financial support to the authors of the software that runs their business, and to add insult to injury, they in fact soak the members of community who distribute the binaries for their bandwidth.


I wonder why Amazon did not create their own mirror. They would sync with Fedora mirrors once. All aws traffic can then go to their own mirrors.

Isn't that sensible thing to do?


AWS CodeArtifact doesn’t support RPM. Have your TAM +1 the PFR for RPM support. Then all those AWS customers running AL2 or RHEL 8 can use a secure, local-to-the-VPC mirror.


So, let's say you set up an EL7 system with stock everything, and then enable epel with dnf install epel-release or equivalent.

You now get the stock epel mirror listing, not a custom package Amazon rewrote, and they'd need to add a custom mirror that provides an override for that package to make it work that way.

So I would _guess_ one or more large companies migrated a bunch of things onto cloud hosted EL7-based system after EOL meant god help you finding new hardware support.

(I would also guess either EPEL doesn't have a mirror on Amazon or it should have auto-picked it, or the auto-picker is bad, and either way, Amazon will probably go "oh our bad", since I believe they host local mirrors of things anyway...)


How about just support community distributions? ...and no, Fedora is non of them.

Just stop support brand's and corporation's if you don't get paid?


Does Redhat pay for fedora's infra? Genuine question and I'm not saying it's better either. It's just that it would be weird for fedora to operate on volunteer/donated infra when it's quite important to red hat, considering it's the "upstream" (not sure if that's the correct term here) of RHEL


From TFA:

> The massive uptick in Fedora/EPEL activity puts additional pressure on Fedora web proxies for mirror data and then the mirrors themselves that tend to be volunteer run. Much of this new traffic is coming from the Amazon/AWS cloud.


Ah I read the article but I wasn't sure what they meant by volunteer (for example individuals only or corporate too) run in this context. Your interpretation makes sense though!


I think Digitalocean has their own package mirror for their image.

AWS is just being ignorant. If I were in charge of Fedora infrastructure I'd block them and send them instructions on how to setup a mirror.


The fact that this is EPEL strongly suggests that it was set up by an AWS user, not by Amazon themselves. EPEL is not used by default in any common AWS AMIs. Perhaps it is an Amazon Linux user who enabled EPEL via Amazon's package, but it's not supported in the most recent version of AL so Amazon seems to have addresses that issue anyway.


More likely this is a change in a popular AMI or container image that is being used by a lot of different users, eg. a startup script that unnecessarily pulls or syncs from EPEL.

I suspect there are only a handful of vendors with popular enough AMIs or container images that could account for this, though.


>that it was set up by an AWS user,

A user with "five million additional systems" on AWS?


> “A user with "five million additional systems" on AWS?”

Someone is going to be in for a big surprise when they get their AWS bill this month and realise there’s an infinite-loop bug in their instance spawning script.


It’s clearly either a large contract that would have been negotiated before any instances were spun up, or Amazon themselves.


A long time ago I knew a guy that uploaded a Counter-Strike patch to his ISP personal hosting and ended up on the official mirror list. Ended up taking down the ISP iirc.


I don't think it's one user; I think it's a ton of them. Want to use Let's Encrypt in your Openshift-on-AWS deployment? certbot's in EPEL, along with a lot of other quality-of-life stuff for log-shipping, monitoring, etc.


Each with a unique public IPv4 address too!


That is some massive AI training!


Escalating delays can help with this. Get it to be slow enough that people notice.

XML schemas have had a similar history, tanking w3c.org servers.


They totally deserved it for making namespaces that "just happen to be" URLs. XML is insane.


Ah I remember the good old days of strict XML parsers that would fail if they didn't have Internet access to pull the schema in.


Which you didn’t figure out for a while and so your CI/CD pipeline and dev code/build/test cycles hammer their servers for months.

Then you prefetch to fix that problem, and now some slow calls you hadn’t gotten to the bottom of suddenly aren’t.


well, but they're URIs. See, the difference is right there. An identifier not a location. Nobody should ever confuse the two!

/s of course


Why even are they URLs? The only reasonable suggestion I could find is that it was part of an abandoned or poorly adopted idea to also host the schema at that URL.


Probably the same sort of thought process that led to the convention of Java packages being named com.example.whatever. It identifies the creator and gives you some structure to create a unique identifier.

Lot of half-baked ideas floating around in the early years of the commercial internet, but the Java thing held up better.


It's a way to ensure global uniqueness. In the end they're just compared byte-for-byte as strings.


Precisely, so why are they URLs and not something like Java's packages? Or just a URL without the `http://`?


Snarky answer: because they're the W3C and are high on their own supply ;-)

But it does allow more flexibility. If you don't want to be tied to a domain name, you can use a URN with a UUID like urn:uuid:4603d9d3-e895-4000-9077-0ab0f2776e1e


You could also just do `com.mydomain.uid.4603d....` if you wanted. But yeah I think your snarky answer is probably more than a little true!


In addition to the uniqueness others have mentioned, where do you find the canonical definition of the schema if it doesn’t have a url? So they just cut out the mapping and made them one.


This is nonsensical. Where did you get the XSD from? You could get a proper spec from the same place. Contrast this with JSON, which makes zero attempt. People (generally) prefer that. If you were dead-set on embedding the documentation URL in the namespace, you could at least remove the tripping hazard and use new "protocol" like `xsd-spec://host/specs/foo/v1.2` or something.


Schemas != namespaces.

I'm sure some braindead software out there attempts to retrieve namespace URIs, but it would surely be a drop in the bucket compared to traffic for schemas/DTDs (which are intended to be retrieved).


Though there are probably things AWS could do anyway, this could well be caused by a large customer using a custom AMI, and not because of anything Amazon did or didn't do.


Surely AWS knows how to set up a mirror. It's just a mistake, they'll surely correct it. Also simply blogging about it (which gets amplified by Phoronix, then by HN) is a better strategy for getting their attention than blocking.


I worked in ops for 20+ years.

If someone blocks you it becomes an incident, a post mortem and you learn your lesson.

If someone blogs about it, or e-mails you, it gets added to a todo list and might get fixed in a few weeks by a disinterested intern.


Per the original blog:

> ADDENDUM (2024-05-30T01:08+00:00): Multiple Amazon engineers reached out after I posted this and there is work on identifying what is causing this issue. Thank you to all the people who are burning the midnight oil on this.

You worked in ops, but not in a context where your employer could get shamed by IBM in public on the pages of Phoronix and HN. Call it "cloud scale" ops, I guess.


Maybe they have a commercial relationship and dont want to harm it because a bug?


How do you know "it's just a mistake"?


I've made a point of calling out Digital Ocean in Linux mirroring talks as the gold standard for being a good citizen; run their own internal mirrors, which are FAST, making it a value add feature for them as well.


I doubt Amazon builds the Fedora images. So if they’re pointed to the wrong place, that’s not AWS’s fault.


Nahh that seems needlessly cruel they should continue to serve them at 100k speed.


I wish apt, dnf/rpm, flatpak, etc utilized a decentralized distribution option, like IPFS or BEP46 Mutable Torrents. It would be neat if the project leads seeded new package update hashes, volunteers ran seedboxes instead of http mirrors, and clients had (default-on?) seeding of package binaries in addition to only downloading. It would be neat to see the open source community contributing to support each other's experience.


You could update Archlinux with pacman over ipfs from around 2015 until last year. https://github.com/ipfs/notes/issues/84

Then the mirror became too slow and couldn't handle the amount of package data and was shut down: https://github.com/RubenKelevra/pacman.store

It worked rather well and even automatically mirrored the packages on my LAN. Maybe it'll be back some day.


I'm not quite following (including trying to skim your links) - what mirror? I thought the point was that it was decentralized and should scale arbitrarily?


It worked like this: rsync2ipfs-cluster fetched the packages from a rsync mirror into an IPFS-MFS-Folder updated the link in IPNS and then added this folder to a running IPFS instance/cluster.

All of those packages are then available for pacman under the (unchanging) IPNS name and the packages themselves are stored in IPFS as usual files. All you had to was add `http://x86-64.archlinux.pkg.pacman.store.ipns.localhost:8080...` to your mirror list after running your ipfs node (or use a public ipfs gateway).

The website sadly is down now. It had a nicer explanation. None of this still works now however.

EDIT: the folder/list of files was managed by the user hosting the project. The files/packages themselves were distributed. That's just how IPFS/IPNS works. Somebody has to put them into IPFS.

EDIT2: technically the example I posted is a DNSLink which points to a (now gone) TXT record which points to an IPNS name (which is a mutable pointer) that contained the directory of all current archlinux packages.

I hope that was comprehensible.


Ah, so when you say "Then the mirror became too slow and couldn't handle the amount of package data and was shut down" you're talking about the side that added packages to IPFS and did the initial seeding. That's what I wasn't following, thx.


Exactly. As I understood the problem lies in the import of new packages. That was hanging and the deletion of one single file from the folder took ~30s. In the older versions <=0.9 this was working faster. This caused the IPFS mirror to be days out of date. Then the project was discontinued.


I have a local MITM Squid proxy on my home network to try and alleviate this problem, but it's why the prevalence of TLS with no protocol design for how we do proxying has gotten pretty annoying.

I did try doing something fun with network-interface scripts and avahi to make this more dynamic - the idea was my desktop and laptop both ran squid proxies, and then dynamically set peer proxies if they detected each other on the network. It didn't work great given the overhead.

It really feels like there should be a better way to do this in general, and without having to break TLS directly for things which are not really "secret".


Why aren't packages based on torrents? Most distros come with a torrent client.


Torrents tank when there is lack of popularity. Linux can be configured for a wide variety of purposes, from IoT driver to gaming workstation, so not all packages are equally popular with all people, and just because a package is unpopular doesn't mean it isn't important. If I'm the one person in the world running a service that everyone else relies on, then the packages which support that service are important even if no one else uses them.

Therefore to use torrents requires one of two choices, neither of which are really viable when you think about it. Either:

* Require everyone who uses the distribution to torrent releases of the full set of packages for the distribution. (And remember that packages regularly get updated even in LTS releases, due to backporting security fixes etc.) It means everyone using the distro really needs to be running dozens or hundreds of torrents.

* Everyone torrents only the packages they need, so access to packages becomes up to popular vote instead of actual importance.

Torrents are fine for isos but intractible for ongoing package management.


Right, it wouldn't eliminate the need for mirrors (or seeds in the torenting case). But it would greatly reduce the bandwidth needed for those mirrors.

It shouldn't be any worse than hosting old versions of packages on an HTTP file server.


A couple thoughts:

* You can mitigate the popularity thing by using webseeds; if it's popular then it's p2p, and if not then it just falls back to HTTP

* Alternatively, you can 80/20 it; only torrent the 10 biggest and 10 most popular packages (changing 10 to taste), which is a relatively small number of torrents to run while minimizing load on mirrors


If every torrent includes a webseed, then you're still left with the problem of needing to build a full HTTP CDN, and now have to also maintain the largest tracker infrastructure ever deployed for Bittorrent.

Under normal conditions with well behaved clients, raw bandwidth for the large packages is essentially never the issue. Misbehaving clients, cache thrashing, IOPS are the sort of issues that cause pain for mirrors.


Does torrent have addition overhead that regular hosting does not ?


Some ISPs try to block traffic that looks torrent like.


I would love my package manager to support a simple fetch helper protocol to plug-in distributed caches from torrents (with webseeds), to mDNS, heck just a Redis service tracking which systems should have which packages cached per site, to something not much more complex than an old fashioned HTTP caching proxy.


Well if you use apt you might be in luck. There's a bunch of apt transports. I don't know the maturity of them but you can google apt-transport-debtorrent, apt-transport-s3, and one for ipfs.

https://github.com/JaquerEspeis/apt-transport-ipfs


Torrents don't handle frequent updates very well. You'd have tons of outdated torrent floating around. They work well for archives and installation media though that doesn't change that often.


Honestly, default behavior should be to at least share package files on the local network. But sharing to the wider internet should be fairly trivial in <current year>, we have no shortage of technologies that accomplish this.

I wish more systems had this kind of feature. I have a fat fiber connection, I'd be thrilled to pop up an unofficial mirror for something like a Linux distro. I've tried mirroring Linux ISO torrents, but it seems almost nobody ever downloads from a torrent, so I end up never actually uploading any of these images.


This would make it so much easier for end users to cache/host a mirror than the myriad of tools and hacks like squid.

I'm sure this will happen once the Linux community standardizes on a package format /s.


At the end of the day, there's a trust issue. The way distribution works nowadays, mitigates a lot of those. Making it decentralised would be a step backwards.


Aren't the RPMs signed?


Packages can be and generally are signed and verified.


You know that packages are signed right? That's why everyone can be a libreoffice or ArchLinux Mirror...or Fedora?


And because of this, these mirrors are often non https so your ISP can actually intercept and provide the data to you directly.

This is why torrents would actually work fine in this model.


>ISP can actually intercept and provide the data to you directly

What a nice gesture! However my mirror-servers are https plus rsync (for some projects). But are there not some ISP's that block torrent-traffic completely?


BitTorrent is hard to block completely from my understanding of how the protocol works, but I may be wrong.


Unless you can break the SHA1 hashes for the torrent chunks and the global hash torrents also authenticate the content. If you have a valid "info" dictionary at least.


Where intercept usually means point DNS for the mirror domains to the ISP's local mirror, and ISP could mean your cloud provider.


A digital signature does not protect you against a malicious actor who starts distributing outdated, vulnerable versions to you.


How do i know you don't know what your talking about?....a mystery ;)


I mean replay attack and freeze attack as described in https://doi.org/10.1145/1455770.1455841 . It's very likely that I'm not up to date on mitigations in individual package managers.


>all of these package managers have vulnerabilities that can be exploited by a man-in-the-middle or a malicious mirror.

Well, that's how software is, but that article is from 2008 and things have gotten a lot better in the meantime (I think archlinux wasn't even signing their packages back then).

If the distribution's private keyring isn't compromised and you don't have third-party repos, your packages are as trustworthy as your distribution team (and upstream).


Suppose you have an Artifactory server that mirrors/caches a lot of public stuff so one is (hopefully) a good citizen and don’t spam public mirrors with constant requests for the same thing.

But every tool has its own config to set to use the Artifactory. One setting for the OS package manager (which is different for different Linux distributions), another for PyPI, another for NPM (or Yarn or whatever), another for Maven/Gradle, something else for Go, then I need to download this Postgres extension and build it from source - the list goes on. So almost inevitably something gets missed and one ends up not being as good a citizen as one ought, and then one day some random Jenkins job is failing because some external dependency could not be downloaded.

I wish there was an easier way. Like some standard mechanism for saying “for this URL use this proxy”.

I guess one could just use a proxy server (http_proxy environment variable) but with most things on HTTPS it needs to MITM the TLS which then means you need that certificate installed in the build process - which is another one of those “everything can do it but everything does it differently” problems. And in any event, MITM is a bad smell.


It's always seemed ridiculous to me that `apt` by default isn't just implemented as a global hash lookup. Once I have my package indexes and signatures, where a package comes from really doesn't matter - I should be able to fire a request into the ether and get routed to whoever has it, not depend on one specific mirror not breaking mid-update.


> But every tool has its own config to set to use the Artifactory. One setting for the OS package manager (which is different for different Linux distributions), another for PyPI, another for NPM (or Yarn or whatever), another for Maven/Gradle, something else for Go, then I need to download this Postgres extension and build it from source - the list goes on.

Tools like Artifactory are a hack built upon a hack built upon a hack; redirecting certain HTTP requests to proxies would just be piling even more crap on top.

Content-addressing is a much cleaner option: identify files by their hash, rather than as the result of a HTTP request to some particular URL (many tools will already use these hashes to verify the result anyway; that's what "lock files" are for!). Content-addressed data is agnostic about how its retrieved, which makes caching trivial. There's no need to care about the data format, whether it's RPM, Deb, a source tarball, a patch, a Python "egg", or whatever.

For example, I lead a transition to Nix at a previous employer. We had a bunch of projects with various build processes (Maven, SBT, Gradle, PyPI, NPM). After wrapping these in Nix, the whole lot could be cached on S3 by simply copying files around (see https://nix.dev/manual/nix/2.22/store/types/s3-binary-cache-... )


I have been twirling around in my mind lately what it will take to deploy a Pulp Project instance for my business. Even in a small company it’s a bunch of work. I have a Kubernetes cluster, Ansible AWX running a bunch of playbooks using custom Execution Environments, a bunch of Ubuntu servers managed by AWX, and I’m evaluating the idea of migrating from Github to Gitea which would include Github Actions. A few critical apps are written in Laravel or Python and so that’s in the package/artifact caching mix too. keep punting because these workloads keep feeling like a time consuming chicken and an egg problem.

Pulp is perfect for this but the demands on my time make it hard to see around the corner.


Proxy auto config (PAC) supports specifying different proxies for different URLs. Unfortunately, a PAC file is just a file that contains a JavaScript function to pick the proxy, so they're crazily over-powered for the task, and support for them isn't very broad. Browsers support them, but I guess most command line tools wouldn't.

https://en.m.wikipedia.org/wiki/Proxy_auto-config


Another solution: a HTTP proxy server listening on localhost to which you sent HTTPS requests using GET https:// instead of CONNECT. Then the proxy server could have all the logic about which requests to handle via the cache versus which to fetch directly. It could also handle authentication to a cache server if that is required.

The problem is most clients don’t do GET https://, because in your old-school corporate web proxy use case, the proxy server is remote, and sending HTTPS requests to it over HTTP eliminates the security of HTTPS.

If only there was some standard environment variable like artifact_proxy, which had to be a localhost http URI, and which tools would understand as meaning “send HTTP GET to this proxy, even for https://, delegating all the TLS stuff to it, but only if you are trying to download a build artifact, not for any runtime use”

The hard part wouldn’t be implementing this idea (the local proxy server and the environment variable), the hard part would be getting all the different tool developers to agree to support it


> Amazon Cloud Traffic Is Suffocating Fedora's Mirrors

An astounding milestone for the English language.

Imagine what this sentence possibly could have meant in 1990.


Native English speaker here that has a Fedora box as a daily driver.

I thought this title referenced cars driving in a south american rain forest.


It always surprised me that no one complained about the current trend of having automated process or build quasi systematically retrieving packages from public repositories like pypi, debian, GitHub, ... Each time a debian image or something is build, or an automated test or GitHub action is run. Without personal cache.

A decade ago, each company used to have its own cache of all public packages for CI/CD, but it looks like that bo one cares anymore.


> A decade ago, each company used to have its own cache of all public packages for CI/CD, but it looks like that no one cares anymore.

Even worse, many of those processes are running an 'apt-get update' (or equivalent), so there's no way to know what packages they'll get each time!


It might be more appropriate to link to the original blog post: http://smoogespace.blogspot.com/2024/05/where-did-5-million-...


Checks out. The normal stuff is mirrored but not EPEL https://repost.aws/knowledge-center/ec2-enable-epel


Start putting ever harsher rate limits on their IP ranges in place until AWS has an actual human reaches out bypassing their so-called "support" channels by making their problem?


So I know this is only kind of relevant, but... why is EPEL on Fedora mirrors at all? AFAIK EPEL is specifically for RHEL et al. and its packages don't even target Fedora.


EPEL is a separate module from fedora-enchilada, but uses the same backend CDN infrastructure and most mirrors tend to carry both /fedora/ and /epel/, so they're not technically the same mirrors, but most mirrors tend to carry both.


It's built and maintained by the Fedora community


Just upload distributions in the blockchain




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: