Sunday, February 3, 2019

Running NAT64 in a BGP Environment

IPv6 is one of those great networking technologies, like multicast, which makes your life as a network operator much easier, but is generally pretty useless outside of your own network if you try to expect it to work across the Internet.

People like to focus on how the percentage of traffic going over IPv6 has slowly been creeping up, but unfortunately until that number reaches practically 100%, an Internet connection without IPv4 connectivity is effectively broken. Ideally, everyone runs dual-stack, but the whole shortage of IPv4 addresses was the original motivator for IPv6, so increasingly networks will need to run native IPv6 plus some form of translation layer for IPv4, be it carrier grade NAT on, 464XLAT, or as I'll cover today, NAT64.

NAT64 is a pretty clever way to add IPv4 connectivity to an IPv6-only network, since it takes advantage of the fact that the entire IPv4 address space can actually be encoded in half of a single subnet of IPv6. Unfortunately, I also found it somewhat confusing since there are actually quite a few moving parts in a functional NAT64-enabled network, so I figured I'd write up my experience adding NAT64 to my own network.

There are really two parts to NAT64, and to make it a little more confusing, I decided to add BGP to the mix because it's a Sunday and I'm bored.

  1. A DNS64 server, which is able to give a AAAA response for every website. For what few websites actually have AAAA records, it just returns those. For websites which are IPv4 only, it returns a synthetic AAAA record which is the A record packed into a special IPv6 prefix. The RFC recommended global prefix for this packing is 64:FF9B::/96, but there could possibly be reasons to use any /96 out of your own address space instead (I can't really think of any). I'm using the standard 64:FF9B::/96 prefix mainly because it means I can use Google's DNS64 servers instead of running my own. 
  2. A NAT64 server, which translates each packet destined for an IPv4 address packed in this special IPv6 /96 prefix to an IPv4 packet from an IPv4 address in the NAT64's pool destined for the actual IPv4 address to be routed across the Internet. My NAT64 server goes a step further and NAT's all of these IPv6-mapped IPv4 addresses to one public address, since mapping IPv6 to IPv4 addresses one-for-one doesn't gain me much.
  3. BGP to inject a route to 64:FF9B::/96 towards the NAT64 server into my core router. Realistically, you could just do this with a static route on your router pointed towards your NAT64's IPv6 address, but I want to be able to share my NAT64 server with friend's I'm peered with over FCIX, so since BGP is a routing policy protocol, it helps. If you're not running a public autonomous system, just ignore anything I say about BGP and point a static route instead.

So to do this, I'm using a software package called Tayga, along with these numbers (to make my specific example clearer):

  • is the public IPv4 address of my NAT64 server. You would replace this address with whatever IPv4 address you wanted all your NATed traffic to be sourced from in the end.
  • 2620:13B:0:1000::4/64 is the public IPv6 address of my NAT64, which is the gateway to the special /96 prefix containing all IPv4 addresses. This address again needs to be something specific to your network, and is mainly important because it's where your route to the /96 needs to point.
  • 64:FF9B::/96 is the standard prefix for NAT64. There's no reason to change it to anything else.
  • is an arbitrary big chunk of local-use-only addresses I picked for this project. For how I have my NAT64 server configured, it really didn't matter what prefix of CGNAT or RFC1918 address space I picked here, since it's really only used as an intermediary between the client's actual IPv6 source address and the single public IPv4 address that they're all NATed to. A /16 is plenty of address space, since trying to NAT even 64k hosts to a single public IPv4 address is going to be a bad time. Theoretically there is ways to use CGNAT concepts to use a whole pool of IPv4 addresses on the public side, but one address is plenty for my needs.
  • AS4264646464 is an arbitrary private-use ASN I picked to be able to set up an eBGP peering session between my NAT64 server and my core AS7034 router.

The source IPv6-only host makes a DNS query against a DNS64 server, which encodes IPv4 A records into the bottom 32 bits of the 64:FF9B::/96 subnet as a AAAA record, which gets routed to the NAT64 server:
Tayga allocates an address out of its NAT pool for the IPv6 source address, and translates the packet on the nat64 loopback interface to:
The iptables nat MANGLE rule set up by Tayga then NATs the local subnet to the one public address and sends the packet on its way: > [DESTINATION IPv4 ADDRESS]
And dramatically, the IPv6 packet has been translated into an IPv4 packet, with the NAT64 server holding two pieces of NAT state; one in Tayga for the IPv6 to IPv4 translation, then one in iptables for the CGNAT address to public address translation.

Setting up Tayga

Starting from a stock Ubuntu 18.04 system, sudo apt install tayga gets us the needed NAT64 package, and edit the /etc/netplan/01-netcfg.yaml file to match your network.

The Tayga config (/etc/tayga.conf) is relatively short, ignoring all the comments. The main parameters to update are the "ipv4-addr" so it's inside your local use IPv4 block, "dynamic-pool" to match the ipv4-addr, "ipv6-addr" to your public IPv6 address on the server, and turn on the "prefix 64:ff9b::/96" option.

Tayga config listing:

The second file you need to edit to enable Tayga is the /etc/default/tayga file, which feeds parameters into the /etc/init.d/tayga RC file (which you don't need to touch).

Key parameters to note in the defaults file is to change RUN to yes, and make sure both CONFIGURE_IFACE and CONFIGURE_NAT44 are turned on so the RC file will do all the iptables setup for us.

Tayga default listing:

At this point, you should be able to sudo service tayga start and do a sanity check by looking at your IPv4 and IPv6 routing tables and seeing the prefixes referenced in the Tayga config added to a new nat64 tun network interface. Listing:

Enabling Forwarding in sysctl.conf

Unfortunately, while the Tayga RC file does a good job of setting up the tun interface and NAT iptable rules, the one thing it doesn't do is turn on IPv4 and IPv6 forwarding in general, which is needed to forward packets back and forth between the public interface and the nat64 tun device. Uncomment the two relevant lines in sysctl.conf and restart. Listing:

Setting up BGP to advertise the NAT64 prefix

For exporting the /96 of NAT64 prefix back into my networks core, I decided to use the Quagga BGP daemon. This is a pretty standard configuration for Quagga, exporting a directly attached prefix, but for completeness...

sudo apt install quagga
sudo cp /usr/share/doc/quagga-core/examples/zebra.conf.sample /etc/quagga/zebra.conf
sudo touch /etc/quagga/bgpd.conf
sudo service zebra start
sudo service bgpd start
sudo vtysh

I then interactively configured BGP to peer with my core and issued the write mem command to save my work. There's plenty of Quagga BGP tutorials out there, and the exact details of this part is a little out of scope for getting NAT64 working.

Quagga BGPd config listing:

At this point, my network has NAT64 support, so any IPv6-only hosts which are configured with a DNS64 name server can successfully access IPv4 hosts. I wouldn't depend on this one VM to serve a whole ISP's worth of IPv6-only hosts, but for a single rack of VMs which just need to be able to reach GitHub (which is still amazingly IPv4 only) to clone my config scripts, this setup has been working great for me.

Friday, January 25, 2019

Implementing BCP214 on Catalyst 6500

When it comes to running an Internet Exchange, at its most basic level, you're a metro Ethernet provider with a range of as little as 19". The most basic IXP is a single rack mount Ethernet switch, that you plug in, power on, and start plugging customers into.

Which is great, right up until you want to unplug or move any of those customers.

The problem is that, unlike a private interconnect between two routers, with an IXP switch in the middle, if you just yank the cord on one of the routers, it may be able to see that the interface is down and start calculating better routes from other BGP peers than those on the IXP, but every other customer on the IXP sending traffic towards the poor sap who you just disconnected won't see any change.  When unplugged customer A, customer B will continue to see their link to the IXP switch up, and will continue to send traffic towards customer A until the BGP session between them eventually times out, which can be on the order of minutes.

So in an ideal world, before yanking the cord on an IXP peer, you'd like to be able to make it seem like you've yanked the cord (without actually doing it), give BGP the minutes it takes to reconverge around the soon-to-be-down link, and then finally unplug the physical cable only once all of the traffic has drained and doing so won't result in a minute or two of black-holing traffic.

The simplest way to do this is to send an email to the customer under question the day/week/month before and say "hey, I'm going to be unplugging your port, so turn down all of your BGP sessions with others first" but that's a pretty unrealistic expectation to see that level of cooperation from another autonomous system, and wastes a lot of time on the customers part manually turning down all their peering sessions before, and then turning them back up after.

A better way to do it is to actively force the BGP sessions to go down without disrupting any other traffic, then wait for the reconvergence that will happen because of that. This technique is called BCP214. and basically involves using the IXP's switches' ability to filter traffic to specifically filter the IPv4 and IPv6 BGP packets going between peers on the exchange.

I've been doing this "turn down to move the peer to another switch" action quite a bit in the last few months for FCIX, where we've been moving everyone off my Cisco 6506 to a much nicer Arista 7050S-64. The problem is that, while BCP214 helpfully provides some sample configs in the appendix to implement this technique on Cisco IOS, for some reason which is beyond my understanding of the history of IOS command syntax, the Cisco sample doesn't seem to work on my Cisco 6500 running IOS 15.1(2)SY11.

It took some digging to figure out the exact syntax needed to implement the needed ACLs and then apply them to an interface on my 6506, so just in case anyone else needs these, enjoy:

The first part of implementing BCP214 is permanently creating two ACLs for specifically dropping BGP traffic from one IXP address to another (one for IPv4 and the second for IPv6). It's important to appreciate why you want to be specific in filtering only BGP with IXP addresses on it; multi-hop BGP could be flowing over the IXP between two routers not connected to the IXP for some reason, and that traffic shouldn't be dropped.  These example ACLs use the FCIX subnets of and 2001:504:91::/64, but need to be modified accordingly to your own IXP subnets.

ip access-list extended acl-v4-bcp214
 deny   tcp eq bgp
 deny   tcp eq bgp
 permit ip any any
ipv6 access-list acl-v6-bcp214
 deny tcp 2001:504:91::/64 eq bgp 2001:504:91::/64
 deny tcp 2001:504:91::/64 2001:504:91::/64 eq bgp
 permit ipv6 any any

There's two deny lines on each ACL because you don't know if this customer's router happened to be the initiator of the TCP connection for BGP or not, so the source TCP port might be port 179, or the destination port might be 179. You want to drop both of those.

With those ACLs now part of the config, when you need to cause a port to drain its traffic, you temporarily apply those two ACLs to the interface's config and give it a few minutes for the BGP sessions to time out, the routers on both sides to re-converge, and the rest of the Internet to pick up the slack with no black-holed traffic when you then shutdown the interface.

interface GigabitEthernet1/3
 description Peering: FCIX Peer
 switchport access vlan 100
 switchport mode access
 ip access-group acl-v4-bcp214 in
 ipv6 traffic-filter acl-v6-bcp214 in

Monday, December 31, 2018

FCIX - State of the Exchange

The First Year of Running an Internet Exchange

It has been a little over a year since one of my friends challenged me on a whim to get an autonomous system number and stand up my own little corner of the Internet, and what a long slippery slope that has been. One of the advantages to running your own autonomous system is that you can blend your own connection to the Internet via peering, so as we continued to make more friends in the Hurricane Electric FMT2 data center who we wanted to peer with, the number of desired cross connects started to get out of hand; particularly since they aren't free, and we have all of about zero budget for... just about everything.

It's that quadratic growth of the number of interconnects in a full mesh that really get you.

But that's exactly the problem that Internet Exchange Points are meant to solve; I have N number of networks that want to interconnect in one place without running O(N2) Ethernet cables between each other, so everyone connects to one central Ethernet switch and it's just as effective at a much lower cost of entry.

So eight months ago, we jokingly registered the domain, grabbed a spare /24 + /64 of address space we had laying around, and founded the Fremont Cabal Internet Exchange.

We had a good laugh setting up a cheeky little website to make it look like we're a real Internet Exchange, which lasted all of about two weeks before the owner of the data center, Hurricane Electric, applied to our Internet Exchange, brought in the 75,000 prefixes from their customer cone, and put us on their advertising material for the building.
Well crap. That joke got out of hand rather quickly.

Membership Growth

For the first few months there, we were handling two or three new membership applications per week, so it became evident that we needed to get our act together rather quickly.

The first point of order was dealing with the fact that we were running this no-longer-just-between-friends Internet Exchange on borrowed address space, so we needed to get the exchange its own ASN and /24 + /64, but that's $550 for the ASN and $250 for the resource service agreement for the address space... A bit of a problem when we have zero budget...

But at that point we had the advantage of having ~15 current + pending members, so we passed around the hat and between our membership and some other very amused on-lookers, very quickly managed to scrape together the $800 needed to cover the registration costs of FCIX's resources.

Not only did this enable us to re-number onto a real ASN and IXP address space, but getting such a concrete signal of support from our members was touching. This thing wasn't much of a joke anymore; we're actually providing a service to networks which they value enough to throw a few hundred bucks our way to make it happen.

New membership applications have slowed down, but as of this writing we're up to 25 members, which I don't think is half bad for a less than one year old exchange in the east bay in a single site.


So after getting a round of donations to cover the start-up costs for the ARIN resources, the next question was how to handle getting networks actually connected to FCIX. Originally, we were just running FCIX as a VLAN on my Cisco 6506 which was powering my personal network (AS7034), but that suffered from a few issues, the largest of which is that the 6506 is so old that 10G Ethernet was a cutting edge feature at the time. At best, 10G line cards for the 6500 support 16 ports, but the 10G line card we had managed to scrounge for AS7034 only had four ports for XENPAKs, and burns 100W per port, so offering 10G for FCIX was going to be problematic at such a low port density, even ignoring the issue of sourcing line cards and such vintage optics.

This is where Arista stepped in and has contributed to FCIX in a huge way. I got a call from a long-time friend who works at Arista who liked what we were doing and was interested in getting us a real switch to run the exchange on. This means that we got a pair of Arista 7050S-64 switches, which have 48 SFP+ ports which can support either 1G or 10G optics, plus another four QSFP+ ports for 40G, because hey, maybe we'll need 40G at some point...

This now only left the issue of optics. Every member port that we turn up needs an LX or LR optic, which even from third party vendors start to add up quickly (remember how we have zero budget?), so we were very rapidly tapping out our junk bins of left-over optics we all had laying around. So while we sat there brain-storming ways to work around this sustainability problem, we got an awesome direct message on Twitter from Flexoptix!

Flexoptix is a third-party optical transceiver vendor who has the additional advantage that they sell what they call their "FLEXBOX" which allows you to insert one of their optics and over USB reprogram their optic for any vendor's switch which you need, so even though we've got that scrappy "we'll use whatever switch we can drum up" aesthetic to us, we only need to stock one tray of 1G and 10G optics to cover any possible switch we'd want to plug these optics into. Furthermore, as we moved from my Cisco 6506 to our shiny new Arista, we were able to simply reprogram the optics and reuse them, so already the flexibility of their universal transceivers have borne fruit.

The Cost of Entry

Having been started originally as mostly a joke, we've been very against charging any kind of membership fee to join FCIX, for multiple reasons:

  1. There is already several established pay-to-play IXPs in the bay area, so trying to charge for ours when we have none of the valuable peers that existing exchanges have would be kind of silly.
  2. We all have real day jobs, so if someone paid to join FCIX and it then stopped working during the day, they'd rightfully kind of expect us to get it working again ASAP. My day job boss probably wouldn't appreciate that, so we don't charge anything, and problems get fixed when we can get to them. Outages, of course, are refunded in full. (see what we did there?)
This sort of zero cost of entry and zero membership fee model definitely wouldn't have been possible without all the donations we've gotten, and particularly Flexoptix donating trays of optics so we can light our end of every new member's port without trying to deal with every new member somehow contributing an optic to FCIX to light their port.

This has meant a few issues with people trying to abuse our free model, so we very quickly needed to institute an informal "one port per cabinet" rule, since we at one point got six applications from different people sharing one cabinet, and although Hurricane Electric is sponsoring all the cross connects, I'm not going to abuse that deal to run six pieces of fiber to one cabinet. Charging a one time turn-up fee like SIX does probably would have been a good idea and prevented most of our issues with low quality membership applications.

Plans for Growth and Value Add

At this point the basic framework for the exchange has been set up. Adding new members is relatively painless and mainly involves generating them a letter of authorization to redeem for a free cross-connect from HE and adding them to a CSV which propagates to the website and route servers automatically.
The most exciting piece of news with regards to growth is that Hurricane Electric has agreed to sponsor FCIX with a second cabinet in their other building, FMT1. That, plus a pair of dark fiber between the two buildings (thanks HE!), plus a pair of LR4 40G optics (thanks Flexoptix!), plus a second switch (thanks Arista!), and FCIX will soon be multi-site! Membership applications from FMT1 are now open.

The other challenge has been coming up with projects to focus on to help increase the value of the exchange to existing members. Adding new members is easy, but we have also been working on getting things like cache appliances and DNS servers on-net. Verisign was kind enough to contribute a J root + .com/.net DNS server to the exchange, so anyone running their own recursive DNS resolvers get to enjoy direct access to J root and B.gTLD over the exchange. Work on other value-add appliances is on-going.

As we head into this new year, I couldn't be happier or more grateful with how far we've gotten with this project while keeping it sustainable. The annual expenses specific to the exchange are still below $500 between our ARIN fees and other misc fees, and contributions of hardware and money from sponsors and members have enabled us to grow much further than we would have been able to fund on our own.

So thank you again to everyone involved in FCIX, and I wish all of you a lovely 2019.

Saturday, November 24, 2018

Inspiration for Date Nights

My previous romantic relationship was recently classified as "not successful," which is one of those things that generally sucks. A lot.

After a few days of just being all around miserable about it, I moved on to the RCCA phase of me moping around after a relationship and tried to take a hard look at the Root Causes and Corrective Actions for my part in the relationship not being successful. One of the issues I identified was that my tendency as an introvert to not plan outings doesn't lend itself well to building healthy relationships, so I clearly need a more concrete framework for date night.

Putting a few days into the idea during these holidays, this is what I came up with: trawl as many click-bait "top 100 date ideas" articles as I can stomach cherry picking any that I thought I'd enjoy with someone or wouldn't be a terrible way to push the envelope with someone, and write each one on a 2.5"x3" index card. Each card isn't an actionable plan, but a category of activity (i.e. "Go for a hike"), with space below it to fill in specifics. Whether the most specifics are something I do on my own or I sit down with my partner to spend an evening workshopping "exactly where would we be interested in hiking?" to fill in the rest of the space is unclear, but I'm interested to see if this proves helpful to me in the future.

Since others have already asked for this list, I figured I could post my current collection of cards here. Some of them have pretty obvious "more specifics" to be filled in, but some of them do suffer from the difficulty of figuring out exactly how to find a place to do them:

  1. Visit a local tourist destination
  2. Visit an aquarium/zoo
  3. Go to a library together
  4. Go shopping at a thrift shop together
  5. Go shopping at a flea market together
  6. Go on a local walking tour
  7. Go to a movie
  8. Go for a hike
  9. Go camping
  10. Night of coloring/doodling
  11. Host a dinner party
  12. Host a board game night
  13. Go an art gallery
  14. Visit a museum
  15. Work on a home depot project/build something
  16. Have a picnic
  17. Go to a pottery class
  18. Go fruit picking
  19. Take a cooking class
  20. Sample drinks at a bar/brewery/vineyard
  21. Morning coffee date/brunch
  22. Day at the beach
  23. Go to a used book store
  24. Make dinner together
  25. Go to an amusement park
  26. Go to a county fair
  27. Take a boat/ferry ride
  28. Go to a farmers market
  29. Go play pool
  30. Go out bowling
  31. Go to an arcade
  32. Go to a shooting range
  33. Train for and run a race together
  34. Brew beer together
  35. Visit a local nursery for house plants
  36. Read the same book
  37. Visit a botanical garden
  38. Work on a puzzle together
  39. Go play miniature golf
  40. Go garage/estate sale hopping

Monday, July 16, 2018

A Quickstart Guide to IRR & RPSL

In case you are in the very limited pool of people in the world who are responsible for an Autonomous System and need to set up IRR for peer route filtering, I've written a whitepaper on the matter for FCIX.

If you aren't a network engineer for an ISP, I suspect that article will be less interesting on its own.