'Kill Your Darlings' for Better Disaster RecoveryLesson From Amazon S3 Outage: Identify Weaknesses
For any of the tens of thousands of organization that may be smarting from this week's Amazon Web Services and Simple Storage Solution (S3) outage, take the following advice to heart: "You must kill your darlings."
Recommendations don't get more Gothic than that. The famous quote from William Faulkner concerns his writing advice, warning against the danger of getting too comfortable with what you know.
"We need to identify weaknesses before they manifest in system-wide, aberrant behaviors."
The same advice, however, also applies to technology, and handily highlights the disaster recovery secrets practiced by the world's most cloud-savvy organizations.
Netflix, for example, uses an internal tool called Chaos Monkey. "This service pseudo-randomly plucks a server from our production deployment on AWS and kills it. At the time we were met with incredulity and skepticism. Are we crazy? In production?!?" the company's chaos team recounts in a 2015 blog post.
The group has even coined a related term: chaos engineering. "We need to identify weaknesses before they manifest in system-wide, aberrant behaviors," the group says.
The principle is simple: Summon demons and learn how to beat them before they sneak up and eat you for lunch.
Netflix says it's continued to refine its approach. "Building on the success of Chaos Monkey, we looked at an extreme case of infrastructure failure. We built Chaos Kong, which doesn't just kill a server. It kills an entire AWS Region."
While such outages are unusual, they do happen, and Netflix says its preparatory work has helped it sidestep many availability blips that it would have otherwise suffered.
No Amazon S3 For You
Such advice is relevant as more organizations and services rely on cloud-based infrastructure for everything from serving websites, to cloud-enabling IoT devices, to storing backups.
Of course, cloud-connected services can have bad days. Early on Feb. 28, for example, Amazon reported that it was seeing "high error rates with S3" in its eastern United States, tied to a data center in northern Virginia. "We are working hard at repairing S3," it promised.
Numerous organizations were affected, including Netflix. Indeed, users of the service from around the world experienced disruptions, as did a range of other sites and services, including Medium, GitHub, Yahoo Mail and more.
Hi all - we are aware of streaming issues in North & South America and we are working quickly to solve them. We will update you when... 1/2— Netflix CS (@Netflixhelps) February 27, 2017
The outage also had implications for users of various internet-connected devices, including complaints from people that they couldn't turn their internet-connected lights on or get their internet-connected oven turned off.
Later on Feb. 28, however, Amazon reported that the problem had been fixed. "As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally."
Cloud Upside: Uptime
The outage aside, on the whole, cloud-based services from the likes of Akamai, Amazon, Cloudflare and Google still provide better uptime and availability than what the vast majority of enterprises could concoct by themselves, as Microsoft's Carmen Crincoli has noted. The services are also billed based on usage, which can make them especially affordable for smaller organizations.
No, you wouldn't be doing better than AWS. Stop it. Just stop. You're embarrassing yourself by even thinking it.— Carmen Crincoli (@CarmenCrincoli) February 28, 2017
It behooves any organization that relies on such services to test what might happen if a major part of its cloud-based infrastructure becomes unavailable, and then to put better disaster, recovery and failover plans in place. In other words, before disaster strikes, please unleash your chaos monkey.
Unfortunately, it's not clear that many organizations - beyond the likes of Netflix and its peers - have adopted these principles. "Sadly, I think organizations adopt these approaches rather like they adopt technology: there are leaders, [the] mainstream and, of course, the laggards," Alan Woodward, a computer science professor at Surrey University, tells me. "You just have to hope the service you may be dependent upon is a leader, not a laggard."
This piece has been updated with comment from Alan Woodward.