Using Squid StoreIDs to optimize Steam's CDN
As part of my new router build, I’m playing around with transparent HTTP caching proxies.
Caching proxies are a really neat idea; when one computer has already downloaded a web page or image, why download it again when another device right next to it asks for the same image? Ideally, something between all of the local devices and the bottleneck in the network (namely, my DSL connection) would intercept every HTTP request, save all of the answers, and interject its own responses when it already knows the answer.
My setup is pretty typical for caching proxies. On my router, I have a rule in iptables that any traffic from my local 10.44.0.0/20 subnet headed for the Internet on port 80 should be redirected to port 3127 on my router, where I have a squid proxy running in “transparent” mode.
The basic transparent proxy deserves a post of its own once I finish polishing it, but for right now I’m writing this mainly as notes to myself, because the lead time on the next part is going to be pretty long.
My protocol-compliant caching proxy seems to be able to answer about 2-5% of HTTP requests from the local cache, which means that the responses are coming back in the 1-3ms range instead of 40-200ms. 2-5% isn’t something to sneeze at, but it isn’t particularly profound either. Squid does allow you to write all kinds of rules about when to violate a response’s cacheable meta-data or how to completely make up your own. A common rule is:
| refresh_pattern -i .(gif | png | jpg | jpeg | ico)$ 3600 90% 43200 |
which indicates to cache any images missing a cacheable header for 90% of their current age (with upper and lower bounds). This opens a whole rabbit hole of how deeply you want to abuse and mangle cacheable headers in the name of squeezing out a few more hits. I’ve played that game before, and it usually ends up causing a lot of pain because incorrectly cached items tend to break websites in very subtle ways…
Another problem with caching proxies is the opposite of the previously mentioned over-caching. While that was an issue of a single URL consecutively mapping to different content, there is the issue of multiple URLs mapping to the same content.
This is very common; any large content delivery network will have countless different servers each locally serving the same content. The apt repositories for Ubuntu or Debian are perfect examples of this: universityA.edu/mirror/ubuntu/packagename and universityB.edu/mirror/ubuntu/packagename are the same file, even though they have different URLs.
Squid, in version 3.4, has finally added a feature called StoreID which lets your fight around this multiple URLs to one content problem. It allows you to have Squid pass every URL through an exterior filter program that mangles each URL to try and generate a one-to-one mapping between URLs and content. I decided to play with this on the Steam CDN.
When you download a game in Steam, it is actually downloaded as 1MB chunks from something on the order of four different servers at once. In the menu Steam - Settings - Downloads - Download Region you can tell Steam which set of servers to download from, but it still selects exactly which servers to use beyond your control.
A typical Steam chunk URL looks like this:
http://valveSERVERID.cs.steampowered.com/depot/GAMEID/chunk/CHUNKID
- SERVERID is a relatively small number (two or three digits) and identifies which server this chunk is coming from. At any one point, a Steam client seems to be hitting about four different servers. valve48 and valve271 are two that I'm seeing a lot in San Jose, but the servers seem to come and go throughout the day.
- GAMEID is a number assigned to each game, although I've seen some games move from one ID to another halfway through the download. The largest game ID I've seen is in the high 50,000s. I strongly suspect that these are sequentially issued.
- CHUNKID is a 160 bit hex number. Presumably a SHA1 checksum of the chunk? I haven't bothered poking at it.