Not So Fastly: Global Outage Highlights Cloud ChallengesWithout Resiliency Plans, Cloud Infrastructure Can Become Single Point of Failure
Content delivery network Fastly says its global outage on Tuesday was caused by a software bug. While discovered and fixed quickly, the vulnerability nevertheless disrupted access for many internet users around the world for part of the day.
See Also: Beginners Guide to Observability
The outage has led some IT experts to caution that while cloud-based services are cost-effective and provide greater reliability and uptime - and many times security - they can also become single points of failure if they go down, unless users have backup approaches in place.
Fastly has apologized for the outage and provided a first glimpse at what went wrong.
"We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change," says Nick Rockwell, senior vice president of engineering and infrastructure at Fastly, in a postmortem blog post. "We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal."
On June 8, we experienced a global service interruption. Here is what happened — and what happens next.https://t.co/gffDur5Moh— Fastly (@fastly) June 9, 2021
Even so, there were rolling disruptions for at least a short time afterward. As Fastly said in a warning issued two hours after resolving the problem: "Customers could continue to experience a period of increased origin load and lower cache hit ratio." Origin load refers to the load placed on customers' servers, while cache ratio measures how effective a cache is at meeting requests for content.
Fastly says it failed to anticipate the type of failure that occurred.
"Even though there were specific conditions that triggered this outage, we should have anticipated it," Rockwell says. "We provide mission-critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support."
Content delivery networks exist to replicate sites at "points of presence" that are ideally located as physically close as possible to the location of a user browsing the website. When users browse a CDN-hosted site, while they might enter the actual URL, behind the scenes it typically redirects to the CDN infrastructure. If that infrastructure fails, or suffers disruptions, then the CDN-hosted website may simply become unavailable.
Such problems exist with all CDNs, including the top players, which include Cloudflare, as well as Fastly, Amazon CloudFront, Akamai and others.
Infrastructure expert David Warburton says the outage is a reminder that the internet was built to be decentralized, so that if systems failed, communications would carry on regardless.
"What we’ve seen over the past decade, however, is the unintentional centralization of many core services through large cloud solution providers, like infrastructure vendors and CDNs," says Warburton, who's the principal threat research evangelist at Seattle-based application delivery networking and application security firm F5 Labs.
Over the past decade, more organizations have also come to rely on cloud-based applications such as Salesforce, ServiceNow and Square. Many applications and services - such as Amazon Web Services, Microsoft Azure, Google Cloud - are run on cloud-based infrastructure.
To provide high levels of availability to their websites and services, many organizations and SaaS providers alike now use CDNs. The world's biggest CDN is Cloudflare, but there are numerous players, including not just Fastly and Amazon CloudFront, but also Akamai, KeyCDN and Microsoft Azure CDN, among others.
Again, such approaches can create centralized models which, if not buttressed, can become single points of failure.
"In a traditional internet app deployment model, an outage of a server or misconfigured application might take out a single website," Warburton says. But with a cloud-based provider, an outage "can end up taking out all of their customers, resulting in not one website being taken offline, but hundreds or thousands. The impact can potentially affect organizations' digital experiences, revenues and reputations."
Not just CDNs but their upstream providers can also become failure points. On Aug. 30, 2020, telecommunications provider CenturyLink was offline for most of that Sunday morning, which in turn led to sites such as Cloudflare, Discord, Feedly, Hulu, PlayStation Network, Xbox Live and many others being unreachable. Cloudflare's outage, in turn, led to outages of dozens of its CDN customers. So too did a July 17, 2020, outage at Cloudflare that lasted a half hour, which the company blamed on a configuration error.
Network Resiliency Planning
Organizations that rely on CDNs need to include such outages in their risk management plans, says Brian Honan, head of cybersecurity consultancy BH Consulting in Dublin.
"Companies need to risk-assess whatever solutions they are implementing to determine what impact any outage in their CDN provider, or one of the CDN provider’s suppliers, could have on their own service and systems," he says. "Based on that risk assessment, they may need to look at what additional controls they may need to put in place to mitigate the risk caused by an outage."
Kris Beevers, CEO of New York-based NS1, which sells application traffic automation and intelligence solutions, says his firm frequently works with customers to create network resiliency plans to avoid outages or at least minimize their impact He emphasizes that some level of underlying automation and rule sets is essential.
"Just having multiple CDNs isn’t enough because you need to have automation and [have] them configured properly to mitigate risk," he says. "Having a CDN for one type of static content and another for dynamic content, for example, wouldn’t solve the issue. "
Hence one approach is to implement infrastructure from multiple vendors, he says, backed by using automated tools to reroute traffic.
Based on their risk tolerance, some organizations instead opt for a more full-blown approach involving multiple CDNs or other providers. "For companies where absolutely no downtime or service impact is acceptable, network and application teams can establish dynamic steering policies to automatically shift traffic based on data about real-time conditions experienced by end users, ensuring that the applications and their customers are not impacted by a provider outage," he says. "Together, redundant infrastructure, appropriate configurations, and dynamic traffic steering will ensure that companies - and their customers - are not impacted by a provider outage."