I have to give AWS credit for posting a fairly detailed technical description of the cause of their recent outage. Many companies rely on crisis PR people to phrase vague and uninformative announcements that do little to inform customers and put their minds at ease. I must admit, having read the AWS post-mortem a couple times, I don’t fully understand what happened, but it seems my previous article on automation running wild was not far off. Of course, the point of the article was not to criticize automation. An operation the size of AWS would be simply impossible without it. The point was to illustrate the unintended consequences of automation systems. As a pilot and aviation buff, I can think of several examples of airplanes crashing due to out-of-control automation as well.
AWS tells us that “an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network.” What’s interesting here is that the automation event was not itself a provisioning of network devices. Rather, the capacity increase caused “a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network…” This is just the old problem of overwhelming link capacity. I remember one time, when I was at Juniper, and a lab device started sending a flood of traffic to the Internet, crushing the Internet-facing firewalls. It’s nice to know that an operation like Amazon faces the same challenges. At the end of the day, bandwidth is finite, and enough traffic will ruin any network engineer’s day.
“This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.” This is the age-old problem, isn’t it? Monitoring our networks requires network connectivity. How else do we get logs, telemetry, traps, and other information from our devices? And yet, when our network is down, we can’t get this data. Most large-scale customers do maintain a separate out-of-band network just for monitoring. I would assume Amazon does the same, but perhaps somehow this got crushed too? Or perhaps what they refer to as their “internal network” was the OOB network? I can’t tell from the post.
“Operators continued working on a set of remediation actions to reduce congestion on the internal network including identifying the top sources of traffic to isolate to dedicated network devices, disabling some heavy network traffic services, and bringing additional networking capacity online. This progressed slowly…” I don’t want to take pleasure in others’ pain, but this makes me smile. I’ve spent years telling networking engineers that no matter how good their tooling, they are still needed, and they need to keep their skills sharp. Here is Amazon, with presumably the best automation and monitoring capabilities of any network operator, and they were trying to figure out top talkers and shut them down. This reminds me of the first broadcast storm I faced, in the mid-1990’s. I had to walk around the office unplugging things until I found the source. Hopefully it wasn’t that bad for AWS!
Outages happen, and Amazon has maintained a high-level of service with AWS since the beginning. The resiliancy of such a complex environment should be astounding to anyone who has built and managed complex systems. Still, at the end of the day, no matter how much you automate (and you should), no matter how much you assure (and you should), sometimes you have to dust off the packet sniffer and figure out what’s actually going down the wire. For network engineers, that should be a reminder that you’re still relevant in a software-defined world.