Monday, December 31, 2018

FCIX - State of the Exchange

The First Year of Running an Internet Exchange

It has been a little over a year since one of my friends challenged me on a whim to get an autonomous system number and stand up my own little corner of the Internet, and what a long slippery slope that has been. One of the advantages to running your own autonomous system is that you can blend your own connection to the Internet via peering, so as we continued to make more friends in the Hurricane Electric FMT2 data center who we wanted to peer with, the number of desired cross connects started to get out of hand; particularly since they aren't free, and we have all of about zero budget for... just about everything.

It's that quadratic growth of the number of interconnects in a full mesh that really get you.

But that's exactly the problem that Internet Exchange Points are meant to solve; I have N number of networks that want to interconnect in one place without running O(N2) Ethernet cables between each other, so everyone connects to one central Ethernet switch and it's just as effective at a much lower cost of entry.

So eight months ago, we jokingly registered the domain FCIX.net, grabbed a spare /24 + /64 of address space we had laying around, and founded the Fremont Cabal Internet Exchange.

We had a good laugh setting up a cheeky little website to make it look like we're a real Internet Exchange, which lasted all of about two weeks before the owner of the data center, Hurricane Electric, applied to our Internet Exchange, brought in the 75,000 prefixes from their customer cone, and put us on their advertising material for the building.
Well crap. That joke got out of hand rather quickly.

Membership Growth

For the first few months there, we were handling two or three new membership applications per week, so it became evident that we needed to get our act together rather quickly.

The first point of order was dealing with the fact that we were running this no-longer-just-between-friends Internet Exchange on borrowed address space, so we needed to get the exchange its own ASN and /24 + /64, but that's $550 for the ASN and $250 for the resource service agreement for the address space... A bit of a problem when we have zero budget...

But at that point we had the advantage of having ~15 current + pending members, so we passed around the hat and between our membership and some other very amused on-lookers, very quickly managed to scrape together the $800 needed to cover the registration costs of FCIX's resources.

Not only did this enable us to re-number onto a real ASN and IXP address space, but getting such a concrete signal of support from our members was touching. This thing wasn't much of a joke anymore; we're actually providing a service to networks which they value enough to throw a few hundred bucks our way to make it happen.

New membership applications have slowed down, but as of this writing we're up to 25 members, which I don't think is half bad for a less than one year old exchange in the east bay in a single site.

Sponsors

So after getting a round of donations to cover the start-up costs for the ARIN resources, the next question was how to handle getting networks actually connected to FCIX. Originally, we were just running FCIX as a VLAN on my Cisco 6506 which was powering my personal network (AS7034), but that suffered from a few issues, the largest of which is that the 6506 is so old that 10G Ethernet was a cutting edge feature at the time. At best, 10G line cards for the 6500 support 16 ports, but the 10G line card we had managed to scrounge for AS7034 only had four ports for XENPAKs, and burns 100W per port, so offering 10G for FCIX was going to be problematic at such a low port density, even ignoring the issue of sourcing line cards and such vintage optics.

This is where Arista stepped in and has contributed to FCIX in a huge way. I got a call from a long-time friend who works at Arista who liked what we were doing and was interested in getting us a real switch to run the exchange on. This means that we got a pair of Arista 7050S-64 switches, which have 48 SFP+ ports which can support either 1G or 10G optics, plus another four QSFP+ ports for 40G, because hey, maybe we'll need 40G at some point...

This now only left the issue of optics. Every member port that we turn up needs an LX or LR optic, which even from third party vendors start to add up quickly (remember how we have zero budget?), so we were very rapidly tapping out our junk bins of left-over optics we all had laying around. So while we sat there brain-storming ways to work around this sustainability problem, we got an awesome direct message on Twitter from Flexoptix!

Flexoptix is a third-party optical transceiver vendor who has the additional advantage that they sell what they call their "FLEXBOX" which allows you to insert one of their optics and over USB reprogram their optic for any vendor's switch which you need, so even though we've got that scrappy "we'll use whatever switch we can drum up" aesthetic to us, we only need to stock one tray of 1G and 10G optics to cover any possible switch we'd want to plug these optics into. Furthermore, as we moved from my Cisco 6506 to our shiny new Arista, we were able to simply reprogram the optics and reuse them, so already the flexibility of their universal transceivers have borne fruit.

The Cost of Entry

Having been started originally as mostly a joke, we've been very against charging any kind of membership fee to join FCIX, for multiple reasons:

  1. There is already several established pay-to-play IXPs in the bay area, so trying to charge for ours when we have none of the valuable peers that existing exchanges have would be kind of silly.
  2. We all have real day jobs, so if someone paid to join FCIX and it then stopped working during the day, they'd rightfully kind of expect us to get it working again ASAP. My day job boss probably wouldn't appreciate that, so we don't charge anything, and problems get fixed when we can get to them. Outages, of course, are refunded in full. (see what we did there?)
This sort of zero cost of entry and zero membership fee model definitely wouldn't have been possible without all the donations we've gotten, and particularly Flexoptix donating trays of optics so we can light our end of every new member's port without trying to deal with every new member somehow contributing an optic to FCIX to light their port.

This has meant a few issues with people trying to abuse our free model, so we very quickly needed to institute an informal "one port per cabinet" rule, since we at one point got six applications from different people sharing one cabinet, and although Hurricane Electric is sponsoring all the cross connects, I'm not going to abuse that deal to run six pieces of fiber to one cabinet. Charging a one time turn-up fee like SIX does probably would have been a good idea and prevented most of our issues with low quality membership applications.

Plans for Growth and Value Add

At this point the basic framework for the exchange has been set up. Adding new members is relatively painless and mainly involves generating them a letter of authorization to redeem for a free cross-connect from HE and adding them to a CSV which propagates to the website and route servers automatically.
The most exciting piece of news with regards to growth is that Hurricane Electric has agreed to sponsor FCIX with a second cabinet in their other building, FMT1. That, plus a pair of dark fiber between the two buildings (thanks HE!), plus a pair of LR4 40G optics (thanks Flexoptix!), plus a second switch (thanks Arista!), and FCIX will soon be multi-site! Membership applications from FMT1 are now open.

The other challenge has been coming up with projects to focus on to help increase the value of the exchange to existing members. Adding new members is easy, but we have also been working on getting things like cache appliances and DNS servers on-net. Verisign was kind enough to contribute a J root + .com/.net DNS server to the exchange, so anyone running their own recursive DNS resolvers get to enjoy direct access to J root and B.gTLD over the exchange. Work on other value-add appliances is on-going.

As we head into this new year, I couldn't be happier or more grateful with how far we've gotten with this project while keeping it sustainable. The annual expenses specific to the exchange are still below $500 between our ARIN fees and other misc fees, and contributions of hardware and money from sponsors and members have enabled us to grow much further than we would have been able to fund on our own.

So thank you again to everyone involved in FCIX, and I wish all of you a lovely 2019.

Monday, July 16, 2018

A Quickstart Guide to IRR & RPSL

In case you are in the very limited pool of people in the world who are responsible for an Autonomous System and need to set up IRR for peer route filtering, I've written a whitepaper on the matter for FCIX.

If you aren't a network engineer for an ISP, I suspect that article will be less interesting on its own.

Sunday, July 8, 2018

Measuring Anycast DNS Networking Using the RIPE Atlas

As I have previously mentioned, I set up a RIPE Atlas probe in my autonomous system to help measure the Internet and accrue myself Atlas credits to spend on my own custom measurements. As I have been looking for various networks to invite into our newly created Internet Exchange, anycast DNS is a good candidate since they're networks that want to be at a lot of IXPs (Internet Exchange Points) and tend to be tiny (typically a single 1U server or VM) so we are able to just offer them free hosting as part of the exchange.

Anycast DNS is where you pick one global IP address that all your users query against, and instead of tolerating the nuisance of the quite limited speed of light (lol) back to a single server, you create multiple identical copies of your server (even down to using the same IP address) and distribute them across the entire Internet. So identical that it really doesn't matter which server the query goes to; they all come back with the same answer anyways. This ability to have the same IP address in multiple places is made possible by dedicating a whole /24 IPv4 block and a /48 IPv6 block to the server's public facing address (these size subnets are driven by those being the smallest block of IP addresses generally accepted in the global BGP routing table). You then have the server itself advertise those two IP blocks into BGP itself, and as networks see the same advertisement from each of the distributed servers, they use the typical BGP best path selection process to select the "closest" server (where closest is unfortunately in the sense of network topology and business policy, not necessary linear distance or latency like you'd hope).

So this is great; anycast is used to make it seem like a single server is beating the speed of light with how fast it's able to answer queries from across the globe, and is a popular technique used by practically all of the root DNS servers, as well as many other DNS servers, be they authoritative DNS servers for specific zones or public recursive DNS resolvers like those provided by Google or CloudFlare.

So my challenge to myself has been to find any of these anycast networks who would plausibly be interested in tying into our IXP, and would benefit from an anycast node added in California.

This. This is where my RIPE Atlas credits I've accumulated until now come in super handy. I have the entire global network of RIPE Atlas probes at my disposal to poke at anycast servers to measure their performance and see how good their global deployment really is.

The process consists of these major steps:
  1. Trawl the Internet for anycast DNS services and "Top [Number] DNS Servers You Should Use Instead of Whatever You're Using" articles listing popular DNS services (odd, often in a very similar order to the other "Top [Number] DNS Servers" articles...) and collate all of these into a list of potential DNS networks to measure.
  2. Craft a DNS query which can be sent to each instance of an anycast server and its response time measured to give a good indication of how well the anycast network is performing.
  3. Roll this query out to 500 RIPE Atlas probes and spend some super exciting evenings analyzing the data to try and discover what each DNS network's current topology is, and if they'd seemingly benefit from a node in Fremont. 
So that first step isn't too bad; that's just some Google-Fu and a willingness to really confuse ad networks with my sudden willingness to read click-bait articles. 

The second step is where it starts to get interesting, because we need to somehow measure how fast the closest anycast node can answer a DNS query. There are DNS benchmarking tools which run through a few hundred of the most popular websites and try and resolve those to see how quickly the DNS resolver responds, but those measurements are a little problematic for what I'm looking for, because they don't only measure the network round-trip-time to the anycast node, but also how hot its cache is and how well connected it is to the actual authoritative DNS servers. If I was an end-user looking to pick the best DNS server, how quickly it can answer actual real DNS queries is important, but that's not really what I'm interested in. I want to know how quickly a random place on the Internet can get a query to this server, and after the DNS server has gone off and done all the work of resolving the query into an answer, how long it takes to get that answer back to the client. (In the case of authoritative DNS servers which don't provide open recursive resolution, these "top 500 websites" benchmarks are worthless, since the authoritative servers will only answer queries for zone files they're hosting themselves)

So really, I need a DNS query which the anycast servers can definitely answer themselves locally, without (maybe) having to send off queries to other DNS servers and walk the DNS hierarchy trying to find the answer. And hang up the rotary phone guys, cause most servers actually support a query that meets this criteria, and it even usually returns a TXT answer which identifies which specific instance of the anycast network you queried, which is much easier than running a traceroute from each probe to the DNS server and trying to analyze 500 traceroutes to figure out which end-node they happen to be getting routed to. 

The "id.server" domain name (as documented in RFC4892) is a handy TXT record, which (for presumably very histericalorical reasons) is a CHAOS class domain instead of an IN class domain (like every other DNS query you're used to) and tells you a unique name for each instance of a DNS server. By convention, global things like backbone routers and anycast DNS servers identify themselves by the nearest IATA three letter airport code, so knowing that, it's often quite successful figuring out generally where a DNS answer is coming from.

A Specific Example

So as an example, let us study one of the anycast DNS resolvers out there; namely, UncensoredDNS, an open resolver based out of Denmark. They happen to have both a unicast node hosted in Denmark, as well as an anycast IP address (91.239.100.100, 2001:67c:28a4::) hosted from multiple locations. 

The first thing to do is confirm that this DNS admin has happened to implement id.server, so we can use a local tool like nslookup to query CHAOS: id.server to see if it even works.
nslookup -class=chaos -type=txt id.server 91.239.100.100

Gives us an answer of: "deic-lgb.anycast.censurfridns.dk"

Excellent! So this tell us two things:
  1. They answer to id.server, so it's possible for us to run that query using the Atlas network to get a good perspective on where they are.
  2. They seemingly put something useful in their id.server string, namely perhaps a hosting network name and a city designator (although LGB is Long Island, and based on the 153ms ping time from Fremont, California, I doubt it's really in the same state as me). So maybe not entirely IATA city designators, but something unique at least...
So now we need to create a custom Atlas measurement to run this query from everywhere and see what comes back and how quickly. 
From the "My Atlas" dashboard, I select "Create a New Measurement". The measurement creation wizard has three parts:
  1. Measurement definition: This is where we specify to query "CHAOS: id.server" against the DNS server's IP address
  2. Probe selection: This is where you decide how many probes you're willing to spend credits on to take this measurement. Since these measurements are relatively cheap, I max this out at the allowed 500 probes.
  3. Timing: Since I'm not trying to monitor these networks but just characterize them, I just make absolutely sure that I check the "This is a one-off" box so the measurement doesn't get re-run on a schedule.
For the test definition, add a DNS measurement for each DNS server address you're interested in querying (I kind of wish there was a "clone this other measurement definition" button, but regardless you can create multiple measurements to run against the same probe set). When filling out the test definition, the important fields are to set the target to the DNS server of interest, change the query class to CHAOS, the query type to TXT, and the query argument to "id.server". I also give the test a meaningful description so I stand a chance of finding it later, and tagged it "dns" so my DNS measurements are all generally grouped together. 

If you're scheduling a long-running measurement, the interval field would be important, but since I'll be checking the one time off measurement box in the timing options, this interval field will eventually disappear.
By default, measurements are run on ten probes randomly selected from anywhere in the world, but we want a larger test group than that, so press the X on the "Worldwide 10" selection and "+ New Set - wizard" to open the map to add more probes.
Ideally, you would be able to just type in "Worldwide", select it, say you want 500 probes, and be done with it, but the problem is since RIPE is the European RIR, the concentration of probes in Europe is unusually high, so an unspecified "worldwide" selection tends to have the majority of its probes in Europe and a thin select everywhere else. Since I'm particularly interested in the behavior across the continental US, but also curious what each anycast server's behavior looks like worldwide, I've settled on a bit of a compromise of selecting 250 probes from "worldwide" and 250 probes from "United States". This ensures a good density in the US where I need it, but still gives me a global perspective in the same measurement.
Type in worldwide, click on it in the auto-completion drop-down, and it should pop up a window on the right asking you how many probes you'd like to select. I choose 250, and press yes.
It will then ask you which tags you'd like to include or exclude when selecting probes. The system automatically tags Atlas probes with various categories like what hardware version they are, or how stable their IP addresses are, or if they're able to resolve AAAA records correctly, etc. I'm unsure if Atlas is smart enough to do this already, but particularly when measuring an IPv6 DNS server, I make sure to specify that probes tagged "IPv6 Works" or "IPv6 Stable 1d" should be used. For an IPv4 measurement, filtering on tags is probably not critical, but I usually select "IPv4 Stable 1d" just for good measure.
Click add and the 250 worldwide probes get added to the sidebar, and repeat the same process for the United States to get up to the maximum 500 probes. Once that's done, press OK and it returns to the measurement form.
The probe selection box should then be filled in with the 250 + 250 selection. In the Timing section, simply check the "This is a one-off" box, and the measurement(s) are ready to launch.

Press the big grey "Create My Measurement(s)" button.
At this point, if all goes well, it should pop up with a window giving you a hyperlink to your measurement results. The results won't be immediately available, but for one-off DNS measurements it should only take a minute or two before most of the results start populating on the results page. Link to the results of this example measurement.
For the measurements I'm doing, the most useful results page is the map, where it color codes the results by latency, and you can then click on any single dot to see which probe that is and what result it got for the TXT query we requested.
Looking at the map, you can see some green hotspots in Europe, so his coverage in Europe seems to be pretty good, and there's a green gradient across the US east to west, so there's seemingly a node on the east coast. Clicking on any of the green east coast probes tells us that their TXT result was "rgnet-iad.anycast.censurfridns.dk", where IAD is the IATA code for Washington DC, so chances are the east coast node is there.

Browsing around on the west coast, there's zero green measurements, and most of them are in the ~150ms range coming from servers located in Europe, so this result actually points to this being a reasonably good candidate of an anycast network to try and bring into our exchange...

Wednesday, June 13, 2018

Peering with Root DNS Servers

The Domain Name System is a recursive system for resolving host names to IP addresses and from IP addresses back to host names, which is really handy, since ideally no one interacts with IP addresses and instead refers to servers by names like google.com or blog.thelifeofkenneth.com.

When you're resolving a hostname like blog.thelifeofkenneth.com, it's actually a multi-step process where you first figure out the DNS server for the .com domain, then ask them where the name server for thelifeofkenneth.com is, then you ask them what the address for the "blog" server is. This is a well documented process elsewhere, but what I'm particularly interested in is that first little step where you somehow find the first DNS server; this is done by asking one of the 13 root name servers, which are 13 specific servers (lettered A through M) hard coded into every recursive DNS implementation as a starting point to resolve any other address.

The reason that I'm interested is because I recently became part of the team running the Fremont Cabal Internet Exchange. IXPs often peer with root name servers to make the fabric more valuable since root name servers tend to be really important for the other networks connecting to the IXP. This is possible because many of the root servers aren't implemented as one enormous DNS server in a specific place like you'd imagine, but are actually many identical copies of the same server advertising the same anycast prefix from every instance.

This means that even though we're a small IXP in the bay area, we actually stand a chance of an instance of several of the root servers being close by, or being willing to ship us the equipment to host an instance of them. We have spare rack space, so hosting their hardware to be able to increase our value and make the Internet generally better is worth providing them the space and power.

For curiosity's sake, I've been stepping through the list of root DNS servers to try and find what information I can on them, and figured these notes would be useful for some small fraction of other people online.


  • A ROOT - Run by Verisign
    • Homepage
    • Status: Only hosted in six locations; Ashburn, Los Angeles, New York, Frankfurt, London and Tokyo
  • B ROOT - Run by University of Southern California, Information Sciences Institute
    • Homepage
    • Status: Only hosted in Los Angeles and Miami
  • C ROOT - Run by Cogent
    • Homepage
    • Status: Only hosted in 10 locations; LA, Chicago, New York, etc
  • D ROOT - Run by University of Maryland
    • Homepage
    • 136 Sites
    • Partially hosted by Woodynet (AS42), which means they're already in FMT2
  • E ROOT - Run by NASA Ames Research Center
    • Homepage
    • Status: 194 sites
    • Also partially hosted by Woodynet (AS42), which means they're already in FMT2
  • F ROOT - Run by Internet Systems Consortium
  • G ROOT - Run by Defense Information Systems Agency
    • Homepage
    • Status: Only 6 sites, none in California
  • H ROOT - Run by US Army Research Lab
    • Homepage
    • Status: Only 2 sites; San Diego and Aberdeen
  • I ROOT - Run by netnod
    • Homepage
    • Hosting Requirements: Contact info[at]netnod[dot]se
    • Peering Requirements
    • Status: 68 sites, including one somewhere in San Francisco 
  • J ROOT - Run by Verisign
    • Homepage
    • Requirements include:
      • 1U space, 2x power
      • 2x network, peering LAN and /29+/64 management interface
    • Status: Already somewhere in San Francisco 
  • K ROOT - Run by RIPE
    • Homepage
    • Hosting Requirements include:
      • Provide a Dell server with 16GB RAM, quad core, 2x500GB HDD, etc.
      • Public IPv6 address with NAT64
    • Status: Seems the physically closest one is on TahoeIX in Reno.
  • L ROOT - Run by ICANN
    • Homepage
    • Hosting Requirements:
      • Sign NDA
      • Purchase code named appliance to host inside own network
    • Status: Somewhere in San Jose, per their FAQ they are not joining any additional IXPs.
  • M ROOT - Run by WIDE Project
    • Homepage
    • Status: Somewhere in San Francisco, nine sites total

Summary:
  • Roots that will never be in the bay area: A, B, C, G, H
  • Roots already in the bay area: D, E, F, I, J, L, M
My rationale for the first list is that several of the root servers only have 2-10 instances spread across the world, so they're presumably not in the business of deploying the 100-200 anycast nodes that several of the other ones are. If they don't happen to already be in the bay area, it's not like we can afford to lease fiber out to where they already happen to be. 

Root servers on the "already in the bay area" list is also problematic since our exchange currently is only in Hurricane Electric's building, so if they already have a local node, it's unlikely that we would be able to convince them to build another node in the east bay just for us unless they already happen to be co-located with us.

But you'll notice that between those two lists, there's only 12 roots... K root isn't in the bay area. 
So I did some more digging. Using the Atlas probe in my rack, I can see that K root is currently 70ms away from us, so it has a not quite optimal latency to the bay area. It looks like it's currently reachable via its node in Utah, but it has a physically closer node in Reno connected to the TahoeIX.

TahoeIX is interesting for two reasons:
  1. They're a fantastic example of another tiny IXP who has done a remarkably good job of collecting value-add peers to their network; Verisign, PCH/WoodyNet, Akamai, AS112, K root, and F root.
  2. Hurricane Electric is in "provisioning" with them, so presumably at some point soon, HE will have access to K root from Tahoe, dropping its latency to the bay area quite a bit.
So this opportunity posed by K root being the last root server not yet built out in the bay area very well might disappear soon. This is a bit of a drag since that might make RIPE less likely to entertain us hosting yet another California node, and the K roots don't come for free. We would need to provide a Dell server meeting all of their specifications, which I just priced out at $1475 for a Dell R230.

Bummer.

So, at this point, it's possible that I will be able to get F root to join FCIX, since they're just a cross connect away in the same building (and I happen to already be friendly with them), plus D and E if they happen to be on the local AS42 node. I, J, and M are in the area, so getting them on the fabric is conceivable except that they aren't in our building, so that problem would need to somehow be solved. And K is currently several states away, so I'd need to convince them that yet another west coast node is worth their bother, and we'd need to pony up $1500 to get the gear they require.

There are other value-add networks we can work on getting on our IXP, such as AS112 to trap bogon DNS requests, CDNs like Akamai, CloudFlare, and (maybe?) NetFlix, and it seems like there's also 13 DNS servers for gTLDs, but I can't find much information on who hosts those or how they're rolled out. Presumably they're hosted by just Verisign so one of those would come with a J root.

Overview of Fiber Optic Transceivers

Video:


Thanks again to Arista Networks and FlexOptix for helping make FCIX possible.