Update, June 22, 2022: In light of the root cause analysis published by Cloudflare for their recent outage1, we thought we'd refresh this article since it remains relevant.
Much as was the case with Facebook back in October 2021, the downtime was the result of a misconfiguration of BGP – in the case of Cloudflare, modifications to BGP communities, and in the Facebook case, an errant command which dropped BGP advertisements.
**************************
October 4th, 2021, Facebook properties experienced a six-hour outage. This outage extended to the Facebook-related properties including WhatsApp, Instagram, and Oculus VR. Given the magnitude of this event, we thought it would be good to dig a little bit deeper into some of the Internet technologies that we rely on.
It's Always DNS
As we’ve said before, DNS is a single point of failure for Internet systems. DNS (Domain Name System) maps names, such as facebook.com, to IP addresses, allowing users to easily refer to sites by name. DNS, in effect, provides translation between names and IP addresses, like an address book. When a site’s DNS servers are down, this lookup cannot happen, and people will be unable to reach your site. Keeping your DNS servers up, operational, and secure is a critical piece of site reliability.
Except when it’s BGP
Underneath that, there’s another technology that is at least as critical as DNS. This is a routing protocol (one of many) called BGP, Border Gateway Protocol. BGP is the protocol that allows Autonomous Systems (collections of large networks controlled by a single entity) to let other Autonomous Systems know how to reach the networks they control. It doesn't do the routing directly, but rather is the protocol that shares information between routers. Having received this information, routers can make decisions about where to forward data.
Why Is BGP Important?
As an example, one might type “f5.com” into a web browser. This causes your computer to perform a DNS lookup, and the local DNS server your computer uses will hopefully return an IP address of 107.162.162.40. That’s the address book part.
Now, however, your computer must be able to send traffic to that IP address. It's important to note that routing decisions are made on a hop-by-hop basis. Each router your data passes through will decide what the next step of the route should be by looking at the destination IP address and consulting its routing table to determine the next place to forward the data.
If the router participates in BGP, this routing table is constructed from the announcements it has received from other BGP enabled routers. This will include information on what networks can be reached by which routers. It will also have information about how close that route is to the destination. Close in this case doesn't mean the number of routers the data will have to go through, but rather the number of Autonomous Systems that the data will traverse. There is a complex algorithm used to determine which of the possible routes is best. Best can mean a lot of things too, as factors such as egress policies and transit agreements between ISPs are considered as well.
If it turns out that Router A’s routing table shows two routers that it can forward the data to reach 107.162.162.40, it will pick one of the two, based on those metrics. Similar routing decisions are made by each router that receives the data, either forwarding it to another router, or determining that it is directly connected to the 107.162.0.0/16 network and delivering the data to the final destination. The same process will be performed in reverse to route the traffic back through another series of routers, and then to the client.
There are a lot of advantages to this scheme. As long as an eventual final destination router for the traffic is available (and most companies with large Internet presences have many such routers), our data should (eventually) end up there. As the information required to serve a site is broken down into many packets, they may even take different routes.
This is a feature – if some intermediate router goes down, the packets that compose our request or response can be re-routed to avoid the issue. This is great, as long as the routing tables are consistent and have good information in them. After all, the Internet was originally designed to route around nuclear strikes.
Can You Supply an Enlightening Metaphor?
Consider the following metaphor: you want to get to your friend’s house, but you’ve never been there. You look up their address. (That’s like the DNS part). Now you need to figure out how to get there, so you go to the nearest intersection and ask someone which way you should go. They tell you to turn left. You go along that road until you reach another intersection, and ask again. This person tells you to go right.
You continue this process until you reach your destination. It's possible that someone will tell you "Normally, I'd say go over the bridge, but the bridge is out, so go left here and ask at the next intersection." Or they may say "going left is more direct, but going right and getting on the highway is actually faster."
The route you take won't always be the most direct way to get there, nor even necessarily the fastest, but it will help you avoid road blocks, collapsed bridges, and washed-out roads. You will, assuming that everyone you ask has good info, get to where you're going.
The means by which that good info is communicated is BGP. If BGP is providing incorrect information, or no information at all about how to get where you want to go, bad things can happen.
Is BGP Bulletproof?
In a word, no. It's very robust, and scales well, which is a critical feature when you're trying to interconnect billions of hosts. But problems can occur.
A route announcement can omit routes it should be providing - – meaning that the associated network simply disappears from the Internet. No one knows how to get there and the traffic destined for that network will be dropped.
This is sometimes done intentionally, and it’s called blackholing a route, and it’s typically done to block connections to or from a given network. It's used in a variety of cases – for example, to block DDoS traffic from a hostile network, or in some circumstances, to remove an entire country from the Internet during a time of civil crisis. The result is the network traffic is simply deleted, often with no notification back to the sender. The network being blackholed will receive no traffic, and will effectively be cut off from the (digital) world.
A route can be announced incorrectly as well. A misconfiguration on the part of an Autonomous System can make it appear as if it can route traffic to networks it does not control. Done intentionally, this is called BGP hijacking, and while there are defenses against this, it has happened many times, causing large amounts of traffic to be routed to very strange places, perhaps as an attempt to capture and inspect the traffic for purposes of espionage.
Accidents are far more common; a network operator or automated system misconfigures something. The necessary route either disappears entirely, or the misconfiguration ends up creating a routing loop (where traffic is forwarded back and forth between two routers endlessly), or it sends the traffic to a router that doesn’t know anything about the route, which then drops it.
Conclusion
Based on what’s been said so far, it’s likely that the Facebook outage may have been caused by a BGP misconfiguration. We understand that a large number of BGP updates happened at Facebook shortly before the outage.2, 3 To be clear: at this stage, this is all speculation, and we’ll have to wait until Facebook decides to say what really went on, and why, but in the absence of more information, it appears that those updates were at least part of the problem. We are grateful that an organization as conspicuous as Facebook provided us with such a great example to spread the word about this lesser-known but critically important part of the Internet's plumbing, and how it helps us get all of those cat videos to our browsers in one (eventual) piece.