Sunday, February 3, 2019

Running NAT64 in a BGP Environment

IPv6 is one of those great networking technologies, like multicast, which makes your life as a network operator much easier, but is generally pretty useless outside of your own network if you try to expect it to work across the Internet.

People like to focus on how the percentage of traffic going over IPv6 has slowly been creeping up, but unfortunately until that number reaches practically 100%, an Internet connection without IPv4 connectivity is effectively broken. Ideally, everyone runs dual-stack, but the whole shortage of IPv4 addresses was the original motivator for IPv6, so increasingly networks will need to run native IPv6 plus some form of translation layer for IPv4, be it carrier grade NAT on, 464XLAT, or as I'll cover today, NAT64.

NAT64 is a pretty clever way to add IPv4 connectivity to an IPv6-only network, since it takes advantage of the fact that the entire IPv4 address space can actually be encoded in half of a single subnet of IPv6. Unfortunately, I also found it somewhat confusing since there are actually quite a few moving parts in a functional NAT64-enabled network, so I figured I'd write up my experience adding NAT64 to my own network.

There are really two parts to NAT64, and to make it a little more confusing, I decided to add BGP to the mix because it's a Sunday and I'm bored.

  1. A DNS64 server, which is able to give a AAAA response for every website. For what few websites actually have AAAA records, it just returns those. For websites which are IPv4 only, it returns a synthetic AAAA record which is the A record packed into a special IPv6 prefix. The RFC recommended global prefix for this packing is 64:FF9B::/96, but there could possibly be reasons to use any /96 out of your own address space instead (I can't really think of any). I'm using the standard 64:FF9B::/96 prefix mainly because it means I can use Google's DNS64 servers instead of running my own. 
  2. A NAT64 server, which translates each packet destined for an IPv4 address packed in this special IPv6 /96 prefix to an IPv4 packet from an IPv4 address in the NAT64's pool destined for the actual IPv4 address to be routed across the Internet. My NAT64 server goes a step further and NAT's all of these IPv6-mapped IPv4 addresses to one public address, since mapping IPv6 to IPv4 addresses one-for-one doesn't gain me much.
  3. BGP to inject a route to 64:FF9B::/96 towards the NAT64 server into my core router. Realistically, you could just do this with a static route on your router pointed towards your NAT64's IPv6 address, but I want to be able to share my NAT64 server with friend's I'm peered with over FCIX, so since BGP is a routing policy protocol, it helps. If you're not running a public autonomous system, just ignore anything I say about BGP and point a static route instead.

So to do this, I'm using a software package called Tayga, along with these numbers (to make my specific example clearer):

  • is the public IPv4 address of my NAT64 server. You would replace this address with whatever IPv4 address you wanted all your NATed traffic to be sourced from in the end.
  • 2620:13B:0:1000::4/64 is the public IPv6 address of my NAT64, which is the gateway to the special /96 prefix containing all IPv4 addresses. This address again needs to be something specific to your network, and is mainly important because it's where your route to the /96 needs to point.
  • 64:FF9B::/96 is the standard prefix for NAT64. There's no reason to change it to anything else.
  • is an arbitrary big chunk of local-use-only addresses I picked for this project. For how I have my NAT64 server configured, it really didn't matter what prefix of CGNAT or RFC1918 address space I picked here, since it's really only used as an intermediary between the client's actual IPv6 source address and the single public IPv4 address that they're all NATed to. A /16 is plenty of address space, since trying to NAT even 64k hosts to a single public IPv4 address is going to be a bad time. Theoretically there is ways to use CGNAT concepts to use a whole pool of IPv4 addresses on the public side, but one address is plenty for my needs.
  • AS4264646464 is an arbitrary private-use ASN I picked to be able to set up an eBGP peering session between my NAT64 server and my core AS7034 router.

The source IPv6-only host makes a DNS query against a DNS64 server, which encodes IPv4 A records into the bottom 32 bits of the 64:FF9B::/96 subnet as a AAAA record, which gets routed to the NAT64 server:
Tayga allocates an address out of its NAT pool for the IPv6 source address, and translates the packet on the nat64 loopback interface to:
The iptables nat MANGLE rule set up by Tayga then NATs the local subnet to the one public address and sends the packet on its way: > [DESTINATION IPv4 ADDRESS]
And dramatically, the IPv6 packet has been translated into an IPv4 packet, with the NAT64 server holding two pieces of NAT state; one in Tayga for the IPv6 to IPv4 translation, then one in iptables for the CGNAT address to public address translation.

Setting up Tayga

Starting from a stock Ubuntu 18.04 system, sudo apt install tayga gets us the needed NAT64 package, and edit the /etc/netplan/01-netcfg.yaml file to match your network.

The Tayga config (/etc/tayga.conf) is relatively short, ignoring all the comments. The main parameters to update are the "ipv4-addr" so it's inside your local use IPv4 block, "dynamic-pool" to match the ipv4-addr, "ipv6-addr" to your public IPv6 address on the server, and turn on the "prefix 64:ff9b::/96" option.

Tayga config listing:

The second file you need to edit to enable Tayga is the /etc/default/tayga file, which feeds parameters into the /etc/init.d/tayga RC file (which you don't need to touch).

Key parameters to note in the defaults file is to change RUN to yes, and make sure both CONFIGURE_IFACE and CONFIGURE_NAT44 are turned on so the RC file will do all the iptables setup for us.

Tayga default listing:

At this point, you should be able to sudo service tayga start and do a sanity check by looking at your IPv4 and IPv6 routing tables and seeing the prefixes referenced in the Tayga config added to a new nat64 tun network interface. Listing:

Enabling Forwarding in sysctl.conf

Unfortunately, while the Tayga RC file does a good job of setting up the tun interface and NAT iptable rules, the one thing it doesn't do is turn on IPv4 and IPv6 forwarding in general, which is needed to forward packets back and forth between the public interface and the nat64 tun device. Uncomment the two relevant lines in sysctl.conf and restart. Listing:

Setting up BGP to advertise the NAT64 prefix

For exporting the /96 of NAT64 prefix back into my networks core, I decided to use the Quagga BGP daemon. This is a pretty standard configuration for Quagga, exporting a directly attached prefix, but for completeness...

sudo apt install quagga
sudo cp /usr/share/doc/quagga-core/examples/zebra.conf.sample /etc/quagga/zebra.conf
sudo touch /etc/quagga/bgpd.conf
sudo service zebra start
sudo service bgpd start
sudo vtysh

I then interactively configured BGP to peer with my core and issued the write mem command to save my work. There's plenty of Quagga BGP tutorials out there, and the exact details of this part is a little out of scope for getting NAT64 working.

Quagga BGPd config listing:

At this point, my network has NAT64 support, so any IPv6-only hosts which are configured with a DNS64 name server can successfully access IPv4 hosts. I wouldn't depend on this one VM to serve a whole ISP's worth of IPv6-only hosts, but for a single rack of VMs which just need to be able to reach GitHub (which is still amazingly IPv4 only) to clone my config scripts, this setup has been working great for me.

Friday, January 25, 2019

Implementing BCP214 on Catalyst 6500

When it comes to running an Internet Exchange, at its most basic level, you're a metro Ethernet provider with a range of as little as 19". The most basic IXP is a single rack mount Ethernet switch, that you plug in, power on, and start plugging customers into.

Which is great, right up until you want to unplug or move any of those customers.

The problem is that, unlike a private interconnect between two routers, with an IXP switch in the middle, if you just yank the cord on one of the routers, it may be able to see that the interface is down and start calculating better routes from other BGP peers than those on the IXP, but every other customer on the IXP sending traffic towards the poor sap who you just disconnected won't see any change.  When unplugged customer A, customer B will continue to see their link to the IXP switch up, and will continue to send traffic towards customer A until the BGP session between them eventually times out, which can be on the order of minutes.

So in an ideal world, before yanking the cord on an IXP peer, you'd like to be able to make it seem like you've yanked the cord (without actually doing it), give BGP the minutes it takes to reconverge around the soon-to-be-down link, and then finally unplug the physical cable only once all of the traffic has drained and doing so won't result in a minute or two of black-holing traffic.

The simplest way to do this is to send an email to the customer under question the day/week/month before and say "hey, I'm going to be unplugging your port, so turn down all of your BGP sessions with others first" but that's a pretty unrealistic expectation to see that level of cooperation from another autonomous system, and wastes a lot of time on the customers part manually turning down all their peering sessions before, and then turning them back up after.

A better way to do it is to actively force the BGP sessions to go down without disrupting any other traffic, then wait for the reconvergence that will happen because of that. This technique is called BCP214. and basically involves using the IXP's switches' ability to filter traffic to specifically filter the IPv4 and IPv6 BGP packets going between peers on the exchange.

I've been doing this "turn down to move the peer to another switch" action quite a bit in the last few months for FCIX, where we've been moving everyone off my Cisco 6506 to a much nicer Arista 7050S-64. The problem is that, while BCP214 helpfully provides some sample configs in the appendix to implement this technique on Cisco IOS, for some reason which is beyond my understanding of the history of IOS command syntax, the Cisco sample doesn't seem to work on my Cisco 6500 running IOS 15.1(2)SY11.

It took some digging to figure out the exact syntax needed to implement the needed ACLs and then apply them to an interface on my 6506, so just in case anyone else needs these, enjoy:

The first part of implementing BCP214 is permanently creating two ACLs for specifically dropping BGP traffic from one IXP address to another (one for IPv4 and the second for IPv6). It's important to appreciate why you want to be specific in filtering only BGP with IXP addresses on it; multi-hop BGP could be flowing over the IXP between two routers not connected to the IXP for some reason, and that traffic shouldn't be dropped.  These example ACLs use the FCIX subnets of and 2001:504:91::/64, but need to be modified accordingly to your own IXP subnets.

ip access-list extended acl-v4-bcp214
 deny   tcp eq bgp
 deny   tcp eq bgp
 permit ip any any
ipv6 access-list acl-v6-bcp214
 deny tcp 2001:504:91::/64 eq bgp 2001:504:91::/64
 deny tcp 2001:504:91::/64 2001:504:91::/64 eq bgp
 permit ipv6 any any

There's two deny lines on each ACL because you don't know if this customer's router happened to be the initiator of the TCP connection for BGP or not, so the source TCP port might be port 179, or the destination port might be 179. You want to drop both of those.

With those ACLs now part of the config, when you need to cause a port to drain its traffic, you temporarily apply those two ACLs to the interface's config and give it a few minutes for the BGP sessions to time out, the routers on both sides to re-converge, and the rest of the Internet to pick up the slack with no black-holed traffic when you then shutdown the interface.

interface GigabitEthernet1/3
 description Peering: FCIX Peer
 switchport access vlan 100
 switchport mode access
 ip access-group acl-v4-bcp214 in
 ipv6 traffic-filter acl-v6-bcp214 in