my previous aside explaining the basic concept of buffer bloat and what is making consumer internet such a miserable experience at times, let's go on to talk about what you can do to improve the situation.
Consumer Internet connections have been getting progressively faster, but this growth has been more than off-set in the inappropriate growth of the buffer the modem uses to stage packets before being able to fit them through the inevitably finite upload bandwidth available. As these buffers fill, packets start taking longer and longer to get uploaded, just as a longer line at Disneyland means you spend more and more time waiting in line, where if they had limited the ride queues to only 15 minutes worth of people, you could have gotten on and ridden Star Tours thrice over instead of doing nothing but waiting in line for Space Mountain. Faster Internet connections mean that you can download and upload larger files in less time, but the vast majority of the time this isn't the property of your internet connection that you even care about.
When you're using an Internet browser, you usually care more about the latency of your Internet connection than your bandwidth. What difference does it take if a picture loads in 50ms or 100ms when it doesn't even start loading until 2 seconds after you clicked on it? Under heavy load, your modem becomes a black hole, where packets disappear to for several seconds before hopefully reappearing again out on the internet. Even once the modem has run out of buffer and dropped a packet, it takes so long for the event of this packet being dropped to reach the sender (via (it not) being delivered to the receiver, who then gives up waiting for it before deciding that it really is lost) that the damage has already been done. TCP continues to merrily dump packets onto the Internet, and a whole bunch of control theory mumbo-jumbo shows that the system is very likely to oscillate and be unstable (if not entirely collapse, as happened several times in the ARPANET and NSFNet days).
So this needs to be fixed. Jim Gettys et al. have already made remarkable progress on bringing the issue to light, and hopefully more future Internet device manufacturers will better follow the rules. This is all well and good, but what can the home user do now to help mitigate this while waiting for someone to build a better modem? Many routers, such as the WRT54GL and the newer WNDR3800, are build using open source firmware, meaning that researchers and home users can reach in up to their elbows right now and start tweaking knobs in their routers to fix the issue. Another advantage of running a custom third-party firmware like Tomato comes to light...
The key to mitigating the modem's buffer is to keep the modem's buffer empty. Using a service like SpeedTest or Netalyzr, figure out how fast your modem can push data out onto the Internet, and then use your (much smarter) router to ensure that you never give the modem packets faster than that. By feeding it packets slower than it can upload them, we make sure the modem never has an excuse to start hoarding packets and destroying the uplink's latency, at the small cost of a few percent gross bandwidth. This moves the critical buffer off of the modem and onto the router, which is very much more under our control.
Tomato installation, by visiting http://192.168.1.1/qos-settings.asp, you are given the choice of enabling QoS, setting the maximum upload bandwidth (which you should set to a few percent less than the upload result you got from testing), and then allocating bandwidth to up to ten different priorities of traffic (Ten is a little excessive; I can only come up with enough meaningful rules for five). You also have the option of controlling in-bound traffic, but remember that this traffic has already been handled by the DSL bottleneck in our network, so I tend to not bother with download QoS on edge routers (but do use them to rate-limit guest access points inside my network).
One thing that is not particularly clear on the basic settings page is what the significance of the two numbers are for each class of traffic. I will go into this in more detail when I talk about the tc tool running the back-end of this, but they mean "this class is guaranteed this much bandwidth" and "this class can borrow from higher priority classes up to this much bandwidth." So in the screenshot above, I guarantee "high" priority traffic 20% of the bandwidth, but that it can go on to borrow unused bandwidth from "Highest" class up to 100%. Instead looking at "low" and "lowest" classes, I guarantee them almost no bandwidth, and don't even let them borrow all of the bandwidth from higher classes, because these classes of traffic are background jobs where latency is unimportant so I'd rather let bandwidth go unused while making sure it's available for higher classes than try and have my BitTorrent cake and upload it too.
As for exactly how many classes and what bounds you use, that is more than anything else a matter of personal preference and experimentation. My rules are all written with the first priority being everyone else in the house not noticing how much I hammer the internet, but obviously your priorities may differ.
L7 or "deep" packet scanning). Tomato will take each packet and, starting at the top, work its way down until it finds a criteria that matches, and assign that class, so the order of the rules on this page does matter. For example, I give web traffic a high priority (since uploaded web traffic is usually short requests for content), but once a web connection transfers more than 512kB, I decide that the user is probably uploading something big, like a picture, instead of doing something latency-sensitive like requesting an HTML page.
Again, how you write these rules depend most on what kinds of traffic your network tends to produce and what of that traffic you care about most. Getting the settings "right" on the first two screens is very much an iterative process based on qualitative measurements, lots of small changes, and watching results from the last two screens (and eventually asking the internal traffic control system exactly what is going on, but that's for my next blog post).
Now to finish off this article on the matter, lets walk through my classification list and talk about my justifications for each one. QoS design is a very qualitative art, so I don't only expect, but hope, that you will disagree with my decisions and have your own ideas on what you think is important.
Domain Name System, which is needed to resolve web URLs into IP addresses. Every time a user types a URL into a web browser, the computer needs to first use DNS to turn that URL into an IP address before it can even start loading the website, so this traffic should be handled before any other. DNS traffic tends to be very small ("google.com" -> "184.108.40.206"), so handling this traffic before anything else should be trivial.
Moving down the list, any packets that look like they're SIP (the VoIP system I use myself) and small web requests (which are usually destined for ports 80 or 443) are given "high" priority. Again, the vast majority of web requests are small and trivial. Remember that the majority of web traffic is from the server to the client (which is why your internet connection's download bandwidth is so much greater than your upload), so we are not dealing with the lion's share of web traffic which is the content you ask for, but only the initial requests ("Facebook, please give me these 12 jpegs to look at"). Ideally, we would also look at this traffic from the rest of the internet and filter it, but that must be done on the far side of the DSL connection bottleneck; a point well beyond our control and why download QoS is usually ignored.
While most out-going web requests will be very small, only asking for the contents of specific documents, some web requests can become very large. While you will usually be looking at pictures uploaded by other users, you will intermittently decide to upload your own batch of photos. It's not unusual to see jpegs that are half a megabyte or more, so uploading an entire album of photos can quickly turn into a web request on the order of 10s of MBs. Of course, when you're uploading photos, you don't much care if it completes in 5 minutes or 5 minutes and 10 seconds, so while interactive requests for content are latency sensitive, these bulk uploads are very much less so, and should be treated as "Low" priority traffic. Prioritizing the small requests for content over the large uploads of new content mean that you have the ability to open another tab while waiting for the upload to finish and continue to browse the web, without the upload making the Internet unbearably slow.
Finally, the last three rules are looking for traffic which is obviously large bulk uploads, and giving them the lowest priority.
Anything the router can identify as BitTorrent traffic, which is my P2P transfer client of choice, is clearly unimportant since interacting with that program is measured on the order of hours, not seconds. This L7 classification rule does unfortunately depend on being able to read the BitTorrent packets, which is easily defeated by encrypting the torrent connections. I personally make sure that any torrent clients on my network are configured to not encrypt traffic, so this single rule works quite well, but a more adversarial network may require more sophisticated rules. Some peers will remotely require you to encrypt your traffic to defeat QoS on their end of the internet, but these users will also often use port 53, since it much be open to traffic for the web to work at all. Adding the additional rule that any traffic seen on port 53 which is more than 2kB is clearly not actually DNS does an appreciably good job for this encrypted traffic.
Finally, I have the last rule that any connection which uploads more than 4MB must be of the lowest priority. Of course, scenarios where a single connection could upload this much data while latency is still important do exist; any real-time video game will likely generate this much data during the course of a game, and would be inappropriately classified as lowest priority. I do not see any gaming traffic on my network, so I ignored this possibility. It's likely you may not be able to make this decision.
If the router has made it all the way to the bottom of this list of filters, and still hasn't managed to find a match for a connection, the QoS system will then look to the setting in the Basic Settings tab for the default traffic class. I
decided to treat this traffic as the unknown middle-ground between latency-sensitive and obviously bulk traffic and classify it as medium priority, but others
will argue that if they can't decipher what the traffic is, it must be either significantly lower or higher priority than anything matching a written rule, and change the default class accordingly. Again, your choice for the
default behavior is your own.
More even more detail, and probably a better design overall, Toastman in the Linksys forums has some excellent write-ups on the topic: QoS Tutorial, Example Rule Set.
So that's the glossy exterior of the Tomato QoS system. Using just these four pages, you can write a set of rules which will do a decent job of making your Internet experience less painful in face of adverse circumstances. In the next part of this series, I'll lift the hood on the QoS system and discuss the mechanics of exactly how the traffic is treated between when it is classified and when it is finally sent off to the Internet. This next part of the system won't lend itself to be easily reconfigured in the Tomato firmware, but understanding the router's queuing behavior and being able to monitor the queue buffers can be useful while tuning your QoS rules in the web interface.