Making Fastestmirror Less Awful
Fastestmirror is a configuration option for the yum and dnf package managers in the RPM ecosystem (i.e. Fedora, RHEL, CentOS, AlmaLinux, etc) that everyone loves to hate. The name of the option is so alluring; who wouldn’t want to use the fastest mirror when downloading software packages? Enabling fastestmirror is a staple of every “First ten things to do after installing Fedora” article from Linux content farms, so it’s pretty well known for people just getting started on Fedora.
But here’s the rub… It isn’t really that great of a feature.
Normally, without Fastestmirror, DNF requests an ordered list of mirrors from the project, and starts at the first mirror on that list to go off and download the RPM files DNF is looking for. The distro’s mirrorlist server can make a good guess on where the client is based on their IP address, and look up information about the mirrors like their physical location, how much bandwidth or weight each mirror was configured with, and shuffles the list a bit to load balance across mirrors in the same region as the client. So DNF receives a list of mirrors with generally closer mirrors towards the top, with generally faster mirrors towards the top, with some randomness mixed in so a high density of clients in one place don’t all get the same mirrorlist and clobber a single mirror.
Enabling fastestmirror tells DNF to ignore the provided ordering of the list of mirrors, but DNF spends two seconds going down the list of mirrors measuring their speed, and then sorting the mirrors by their locally measured speed and picking the first one.
The problem is that fastestmirror has a very silly concept of what makes a mirror fast
DNF measures the speed of a mirror by opening a HTTP socket to the mirror and… measuring how long the TCP SYN-SYNACK-ACK took. That’s it. Straight up latency measurement of the TCP socket open. The problem is that latency is at best a decent proxy for physical distance, and has almost nothing to do at all with available bandwidth or resulting user experience/performance. As long as mirrors are within the bandwidth-delay product range of the client’s maximum TCP window size, you really shouldn’t be seeing a profound difference in performance between a mirror that’s 10ms away vs 75ms away, all other things being equal.
So latency is a silly metric, but it is also undeniably simple, and more importantly, cheap. Mirror operators are already footing the bill for all the bandwidth they’re serving to users of free software, so it is unreasonable to expect mirror operators to tolerate each client performing frivolous bandwidth tests against some big file stored on the mirror just to pick the fastest one. Ideally you could be recording the historic performance of each mirror for RPM downloads in the past as a measure of download performance, but this data would be noisy (small requests dominated by latency vs large requests dominated by bandwidth) and could very well have a short shelf life if the DNF client is something portable like a laptop moving between different networks. Regardless, DNF doesn’t record that, and everyone has just kind of settled for latency being a tolerable metric that is turned off by default, and hopefully the provided mirrorlist is just good enough as-is that you leave fastestmirror in the default off position and never think about it.
The problem is when people turn it on
For either valid or invalid reasons, people do turn on fastestmirror, and the problem is that it really has quite a few undesirable properties to it:
- If the closest mirror to a user happens to have really bad performance, fastestmirror will consistently always pick that one mirror and downloads will always be slow. This isn’t nearly as noticable when you’re using the mirrorlist order provided by the project CDN, because those lists are shuffled so even if you happen to get a bad mirror one day, you’ll probably get a better mirror the next day.
- When a lot of clients in close proximity to a single mirror all turn on fastetmirror, they will all reliably select the single closest mirror. This happened to us on MicroMirror, where we turned up a 1Gbps mirror server that turned out to be the lowest latency mirror to AWS in Virginia, so we became the preferred EPEL mirror for approx 1.4 million CentOS 7 servers. We ultimately needed to simply stop hosting EPEL on that one mirror for how badly the mirror was getting crushed with traffic.
So I decided that I should do something about this undesirable behavior. I was never going to fix the behavior of CentOS 7 clients, but thankfully that problem solved itself by the whole OS beocming obsolete. If I could get some kind of behavior change accepted by the upstream DNF / librepo maintainers, in 5-15 years its possible that as a mirror operator, I won’t see as much undesirable behavior due to fastestmirror being enabled.
My fix: Don’t always pick the single fastest mirror
So ultimately, my pull request into librepo turned out to be quite a short patch, and in my opinion a pretty clever fix lacking the appetite to fundamentally change the measurement used to rank the mirrors.
Now, instead of librepo ranking mirrors by a strict ordering of lowest latency first, my change measures the latency of all the mirrors and separates mirrors into two lists: mirrors with less than twice the best latency, and mirrors with higher latency than twice the fastest. We take the pool of “mirrors with less than 2x latency” and shuffle them together, and then append the rest of the mirrors sorted by latency.
This change means that:
- When a single mirror is significantly closer to the user than the rest, there is no change in behavior and that one mirror is always picked.
- When theres a few mirrors that are about the same as the one closest mirror, instead of always picking the one closest one, we randomly rotate across the pool.
In both instances, DNF still has the full list of mirrors to fall back on if the first mirror(s) are unavailable.
My Concerns
I think there’s two ways that this change is going to go sideways:
- We are no longer picking the single lowest latency mirror, so it is possible that users are going to have instances where they’re picking something slightly further away that has significantly lower performance for reasons that aren’t latency related.
- Shuffling the mirror list means that every time that you run DNF, you’re going to be hitting a different mirror, so we’re relying more on the mirror-to-mirror consistency than before. This means that if one of the near-by mirrors is behind on updates and doesn’t have expected RPM files, the user is going to see more errors than previously.
The counter arguments boil down to “having DNF always pick the best mirror is a fundamentally unsolvable problem” and “without fastestmirror we already see the inconsistency issue”.
So I’m not fundamentally fixing fastestmirror here, and the answer for poor mirror performance is still always “turn off awful performing mirrors, add more mirrors to scale horizontally”. But hopefully this small change makes fastestmirror less bad, and once it eventually lands in major distributions I’m going to be very interested to see what the community feedback is with real world experience. Once it does end up getting shipped, I’ll be sure to update this blog post to mention where it’s available.