Using Squid StoreIDs to optimize Steam's CDN
As part of my new router build, I'm playing around with transparent HTTP caching proxies.
Caching proxies are a really neat idea; when one computer has already downloaded a web page or image, why download it again when another device right next to it asks for the same image? Ideally, something between all of the local devices and the bottleneck in the network (namely, my DSL connection) would intercept every HTTP request, save all of the answers, and interject its own responses when it already knows the answer.
My setup is pretty typical for caching proxies. On my router, I have a rule in iptables that any traffic from my local 10.44.0.0/20 subnet headed for the Internet on port 80 should be redirected to port 3127 on my router, where I have a squid proxy running in "transparent" mode.
The basic transparent proxy deserves a post of its own once I finish polishing it, but for right now I'm writing this mainly as notes to myself, because the lead time on the next part is going to be pretty long.
My protocol-compliant caching proxy seems to be able to answer about 2-5% of HTTP requests from the local cache, which means that the responses are coming back in the 1-3ms range instead of 40-200ms. 2-5% isn't something to sneeze at, but it isn't particularly profound either. Squid does allow you to write all kinds of rules about when to violate a response's cacheable meta-data or how to completely make up your own. A common rule is:
refresh_pattern -i \.(gif|png|jpg|jpeg|ico)$ 3600 90% 43200
which indicates to cache any images missing a cacheable header for 90% of their current age (with upper and lower bounds). This opens a whole rabbit hole of how deeply you want to abuse and mangle cacheable headers in the name of squeezing out a few more hits. I've played that game before, and it usually ends up causing a lot of pain because incorrectly cached items tend to break websites in very subtle ways...
Another problem with caching proxies is the opposite of the previously mentioned over-caching. While that was an issue of a single URL consecutively mapping to different content, there is the issue of multiple URLs mapping to the same content.
This is very common; any large content delivery network will have countless different servers each locally serving the same content. The apt repositories for Ubuntu or Debian are perfect examples of this: universityA.edu/mirror/ubuntu/packagename and universityB.edu/mirror/ubuntu/packagename are the same file, even though they have different URLs.
Squid, in version 3.4, has finally added a feature called StoreID which lets your fight around this multiple URLs to one content problem. It allows you to have Squid pass every URL through an exterior filter program that mangles each URL to try and generate a one-to-one mapping between URLs and content. I decided to play with this on the Steam CDN.
When you download a game in Steam, it is actually downloaded as 1MB chunks from something on the order of four different servers at once. In the menu Steam - Settings - Downloads - Download Region you can tell Steam which set of servers to download from, but it still selects exactly which servers to use beyond your control.
A typical Steam chunk URL looks like this:
http://valveSERVERID.cs.steampowered.com/depot/GAMEID/chunk/CHUNKID
Caching proxies are a really neat idea; when one computer has already downloaded a web page or image, why download it again when another device right next to it asks for the same image? Ideally, something between all of the local devices and the bottleneck in the network (namely, my DSL connection) would intercept every HTTP request, save all of the answers, and interject its own responses when it already knows the answer.
My setup is pretty typical for caching proxies. On my router, I have a rule in iptables that any traffic from my local 10.44.0.0/20 subnet headed for the Internet on port 80 should be redirected to port 3127 on my router, where I have a squid proxy running in "transparent" mode.
The basic transparent proxy deserves a post of its own once I finish polishing it, but for right now I'm writing this mainly as notes to myself, because the lead time on the next part is going to be pretty long.
My protocol-compliant caching proxy seems to be able to answer about 2-5% of HTTP requests from the local cache, which means that the responses are coming back in the 1-3ms range instead of 40-200ms. 2-5% isn't something to sneeze at, but it isn't particularly profound either. Squid does allow you to write all kinds of rules about when to violate a response's cacheable meta-data or how to completely make up your own. A common rule is:
refresh_pattern -i \.(gif|png|jpg|jpeg|ico)$ 3600 90% 43200
which indicates to cache any images missing a cacheable header for 90% of their current age (with upper and lower bounds). This opens a whole rabbit hole of how deeply you want to abuse and mangle cacheable headers in the name of squeezing out a few more hits. I've played that game before, and it usually ends up causing a lot of pain because incorrectly cached items tend to break websites in very subtle ways...
Another problem with caching proxies is the opposite of the previously mentioned over-caching. While that was an issue of a single URL consecutively mapping to different content, there is the issue of multiple URLs mapping to the same content.
This is very common; any large content delivery network will have countless different servers each locally serving the same content. The apt repositories for Ubuntu or Debian are perfect examples of this: universityA.edu/mirror/ubuntu/packagename and universityB.edu/mirror/ubuntu/packagename are the same file, even though they have different URLs.
Squid, in version 3.4, has finally added a feature called StoreID which lets your fight around this multiple URLs to one content problem. It allows you to have Squid pass every URL through an exterior filter program that mangles each URL to try and generate a one-to-one mapping between URLs and content. I decided to play with this on the Steam CDN.
When you download a game in Steam, it is actually downloaded as 1MB chunks from something on the order of four different servers at once. In the menu Steam - Settings - Downloads - Download Region you can tell Steam which set of servers to download from, but it still selects exactly which servers to use beyond your control.
A typical Steam chunk URL looks like this:
http://valveSERVERID.cs.steampowered.com/depot/GAMEID/chunk/CHUNKID
- SERVERID is a relatively small number (two or three digits) and identifies which server this chunk is coming from. At any one point, a Steam client seems to be hitting about four different servers. valve48 and valve271 are two that I'm seeing a lot in San Jose, but the servers seem to come and go throughout the day.
- GAMEID is a number assigned to each game, although I've seen some games move from one ID to another halfway through the download. The largest game ID I've seen is in the high 50,000s. I strongly suspect that these are sequentially issued.
- CHUNKID is a 160 bit hex number. Presumably a SHA1 checksum of the chunk? I haven't bothered poking at it.
The main takeaway is that, even when I have three computers downloading the same update, since each one of them is going to hit different servers for each chunk, I'm only seeing 25-40% cache hits for three sets of the exact same {GAMEID, CHUNKID} pairs.
Using Squid's new StoreID feature, I'm able to map each {SERVERID, GAMEID, CHUNKID} vector to the correct {GAMEID, CHUNKID} and now see 100% cache hits for every download after the first. With the VM I'm using for testing, I'm seeing about 20MBps throughput for anything that has already been accessed by any other system, and that is limited by the VM's NIC maxing out. I expect to be seeing close to Gigabit throughput once I move this to my router with it's SSD.
In hindsight, I think rewriting all the URLs to a consistent steamX.cs.steampowered.com is a poor choice. If you're going to rewrite URLs, you may as well go all in and rewrite it as an invalid hostname so there isn't the chance to break some future change on Valve's part. A rewrite to something like valveX.cs.steampowered.squid likely prevents any future possible namespace problems. I really hope the documentation for StoreID catches up and starts presenting some best practices, because I'm finding their documentation short of reading the code a little lacking...
Related rant: I really wish the Internet DNS system codified a top level domain for site-local use like IPv4 did in RFC1918 for the 10.0.0.0/8, 192.168.0.0/16 and 172.16.0.0/12 subnets. There exists a draft RFC from 2002 proposing "private.arpa.", but I'd like to see a shorter TLD like "lan." I personally use "lan.", but with how ICANN keeps talking about making TLDs a free-for-all, I dread the day that they make "lan." live.
In the end, the drag here is that Squid3.4 is so new that there doesn't exist any packages for it in Ubuntu or Debian. Even Debian bleeding edge is 3.3.8. It's obviously possible to compile and run squid3.4.6 on your own, but I really hate trying to maintain software outside of the package manager unless I really have to. I don't see myself using this new StoreID feature until Ubuntu 16.04 unless Debian packages it really soon in Jessie and I'm somehow convinced to switch my router back to Debian.
Related rant: I really wish the Internet DNS system codified a top level domain for site-local use like IPv4 did in RFC1918 for the 10.0.0.0/8, 192.168.0.0/16 and 172.16.0.0/12 subnets. There exists a draft RFC from 2002 proposing "private.arpa.", but I'd like to see a shorter TLD like "lan." I personally use "lan.", but with how ICANN keeps talking about making TLDs a free-for-all, I dread the day that they make "lan." live.
In the end, the drag here is that Squid3.4 is so new that there doesn't exist any packages for it in Ubuntu or Debian. Even Debian bleeding edge is 3.3.8. It's obviously possible to compile and run squid3.4.6 on your own, but I really hate trying to maintain software outside of the package manager unless I really have to. I don't see myself using this new StoreID feature until Ubuntu 16.04 unless Debian packages it really soon in Jessie and I'm somehow convinced to switch my router back to Debian.