Cloudflare Outage Whacks 19 Data Centers for Global TrafficGiant Outage Is Result of a Network Configuration Change Gone Awry
A network configuration change that went awry resulted in a massive Cloudflare outage that left many of the world's most popular websites inaccessible for 75 minutes.
The San Francisco-based internet infrastructure vendor says the unfortunate configuration change was meant to increase resilience at 19 of Cloudflare's busiest data centers, which handle a significant portion of the global traffic. Instead, the change resulted in an outage Tuesday that knocked everything from Minecraft to UPS to DoorDash offline, according to the company blog.
"Although Cloudflare has invested significantly … to improve service availability, we clearly fell short of our customer expectations with this very painful incident," write Cloudflare's Tom Strickx, edge network technical lead, and Jeremy Hartman, senior vice president of production engineering. "We are deeply sorry for the disruption to our customers and to all the users who were unable to access Internet properties during the outage."
Strickx and Hartman say Cloudflare has been working over the past 18 months to convert its busiest locations to a more flexible and resilient architecture called Multi-Colo PoP. Nineteen data centers have been converted to this architecture, including Atlanta, Chicago and Los Angeles in the Americas; London, Frankfurt and Madrid in Europe; and Singapore, Sydney and Tokyo in Asia-Pacific.
"Even though these locations are only 4% of our total network, the outage impacted 50% of total requests," according to Strickx and Hartman.
What Went Wrong?
As part of this new architecture, Strickx and Harman say there's an additional layer of routing that allows Cloudflare to easily disable or enable parts of the internal network for maintenance or to deal with a problem. They say the new architecture has provided Cloudflare with significant reliability improvements and allowed the company to conduct maintenance without disrupting customer traffic.
Strickx and Harman say Cloudflare uses BGP protocol to define which IP addresses are advertised to or accepted by other networks to which Cloudflare needs to connect. A change in policy can mean that a previously advertised IP address becomes no longer reachable on the internet, according to the blog post.
While deploying a change to the IP address advertisement policies, a re-ordering of terms caused Cloudflare to withdraw a critical subset of IP addresses. Due to this withdrawal, it became more difficult for Cloudflare engineers to reach the affected locations to revert the problematic change, Strickx and Harman say.
Cloudflare began sounding the alarm bells five minutes after the problematic change was pushed through, and the first changes were made on a router to verify the root cause 24 minutes after the 19 data centers were accidentally taken offline. The root cause was found and understood 31 minutes after the incident started. At that point, work began to revert the problematic change, Cloudflare says.
Forty-four minutes after that, all of the problematic changes had been reverted. During this time, Cloudflare says, problems were reappearing sporadically due to network engineers walking over each other's changes, which caused issues that had already been mitigated to resurface. The incident ended up being closed less than 90 minutes after it was first declared, according to Cloudflare.
What Can Be Done Differently?
Going forward, Strickx and Harman say they plan to examine Cloudflare's processes, architecture and automation and implement some immediate changes to ensure nothing like this happens again. From a process standpoint, Cloudflare acknowledges that its stagger procedure didn't include any of the most-trafficked Multi-Colo PoP data centers until the final step.
In the future, Strickx and Harman say, Cloudflare's change procedures and automation need to include Multi-Colo POP-specific procedures for testing and deployment to avoid unintended consequences.
From an architecture standpoint, Cloudflare says, the incorrect router configuration prevented the proper routes from being announced, which in turn blocked traffic from flowing properly into the company's infrastructure. Going forward, Strickx and Harman say, the policy statement that caused the incorrect routing to take place will need to be redesigned to prevent unintentional incorrect ordering.
The two also say automation initiatives could mitigate some or all of the impact seen from this outage. The automation will focus primarily on enforcing an improved stagger policy for rollouts of network configuration, which Strickx and Harman say would have significantly lessened the overall impact of the outage.
Finally, Strickx and Harman say, an automated "commit-confirm" rollback would have greatly reduced the time to resolve during the incident.
"This incident had widespread impact, and we take availability very seriously," they write. "We have identified several areas of improvement and will continue to work on uncovering any other gaps that could cause a recurrence."