Facebook, WhatsApp, Instagram and any services using Facebook services were unavailable for six hours yesterday (4 October 2021), due to a configuration change that disconnected Facebook from the internet.
According to some reports, engineers at Facebook were also unable to access the Facebook servers remotely, which meant that admins needed physical access to the datacentre hardware to resolve the issue. The problem was made worse due to the way the internet works, which autonomously replicated the misconfiguration around the globe. In effect, billions of people were unable to access Facebook-based services.
Santosh Janardhan, vice-president of infrastructure at Facebook, issued an apology in a blog post: “To all the people and businesses around the world who depend on us, we are sorry for the inconvenience caused by today’s outage across our platforms.We apologise to all those affected, and we’re working to understand more about what happened today so we can continue to make our infrastructure more resilient.”
In the post, Janardhan said that configuration changes on the backbone routers that coordinate network traffic between Facebook datacentres caused issues that interrupted communications. “This disruption to network traffic had a cascading effect on the way our datacentres communicate, bringing our services to a halt,” he said.
According to Cloudfare’s analysis of the outage, the configuration change caused Facebook’s DNS names to stop resolving IP addresses. In effect, this meant that their infrastructure IPs were unreachable. “It was as if someone had ‘pulled the cables’ from their datacentres all at once and disconnected them from the internet,” Cloudflare noted in a blog post.
“At 16:58 UTC we noticed that Facebook had stopped announcing the routes to their DNS prefixes. That meant that, at least, Facebook’s DNS servers were unavailable. Because of this, Cloudflare’s 1.1.1.1 DNS resolver could no longer respond to queries asking for the IP address of facebook.com or instagram.com,” Cloudflare stated in the blog post.
According to Cloudflare, the offline DNS issue was exacerbated by Border Gateway Protocol (BGP), a mechanism to exchange routing information between autonomous systems (AS) on the internet. The internet is effectively a network of networks bound together by BGP.
Each of these networks has an Autonomous System Number (ASN) with a unified internal routing policy. According to Cloudflare, every ASN needs to announce its prefix routes to the internet using BGP, otherwise no one will know how to connect and where to find internet-based services.
Its logging data of internet traffic showed that there was a peak of routing changes from Facebook at 15:40 UTC.
“That’s when the trouble began. Routes were withdrawn, Facebook’s DNS servers went offline, and one minute after the problem occurred, Cloudflare engineers were in a room wondering why [1.1.1.1, our DNS resolver] couldn’t resolve facebook.com and worrying that it was somehow a fault with our systems,” said Cloudfare.
The network of networks that makes up the internet is built for resiliency, and IP traffic is automatically routed using the DNS system. But with the configuration changes Facebook made, other DNS servers could no longer “see” Facebook’s nameservers, that translate facebook.com to a physical IP address, and assumed they were offline.
“Due to Facebook stopping announcing their DNS prefix routes through BGP, our and everyone else’s DNS resolvers had no way to connect to their nameservers. Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses,” Cloudflare noted.
Since web apps tend to continue to try accessing servers even if they issue the SERVFAIL error, Cloudflare said it saw a huge increase in DNS requests. Its log data showed a 30 fold increase in such requests. According to Cloudflare, Facebook services resumed at 21:28 UTC.