Skip navigation

Tag Archives: amazon

I have to give AWS credit for posting a fairly detailed technical description of the cause of their recent outage.  Many companies rely on crisis PR people to phrase vague and uninformative announcements that do little to inform customers and put their minds at ease.  I must admit, having read the AWS post-mortem a couple times, I don’t fully understand what happened, but it seems my previous article on automation running wild was not far off.  Of course, the point of the article was not to criticize automation.  An operation the size of AWS would be simply impossible without it.  The point was to illustrate the unintended consequences of automation systems.  As a pilot and aviation buff, I can think of several examples of airplanes crashing due to out-of-control automation as well.

AWS tells us that “an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network.”  What’s interesting here is that the automation event was not itself a provisioning of network devices.  Rather, the capacity increase caused “a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network…”  This is just the old problem of overwhelming link capacity.  I remember one time, when I was at Juniper, and a lab device started sending a flood of traffic to the Internet, crushing the Internet-facing firewalls.  It’s nice to know that an operation like Amazon faces the same challenges.  At the end of the day, bandwidth is finite, and enough traffic will ruin any network engineer’s day.

“This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.”  This is the age-old problem, isn’t it?  Monitoring our networks requires network connectivity.  How else do we get logs, telemetry, traps, and other information from our devices?  And yet, when our network is down, we can’t get this data.  Most large-scale customers do maintain a separate out-of-band network just for monitoring.  I would assume Amazon does the same, but perhaps somehow this got crushed too?  Or perhaps what they refer to as their “internal network” was the OOB network?  I can’t tell from the post.

“Operators continued working on a set of remediation actions to reduce congestion on the internal network including identifying the top sources of traffic to isolate to dedicated network devices, disabling some heavy network traffic services, and bringing additional networking capacity online. This progressed slowly…”  I don’t want to take pleasure in others’ pain, but this makes me smile.  I’ve spent years telling networking engineers that no matter how good their tooling, they are still needed, and they need to keep their skills sharp.  Here is Amazon, with presumably the best automation and monitoring capabilities of any network operator, and they were trying to figure out top talkers and shut them down.  This reminds me of the first broadcast storm I faced, in the mid-1990’s.  I had to walk around the office unplugging things until I found the source.  Hopefully it wasn’t that bad for AWS!

Outages happen, and Amazon has maintained a high-level of service with AWS since the beginning.  The resiliancy of such a complex environment should be astounding to anyone who has built and managed complex systems.  Still, at the end of the day, no matter how much you automate (and you should), no matter how much you assure (and you should), sometimes you have to dust off the packet sniffer and figure out what’s actually going down the wire.  For network engineers, that should be a reminder that you’re still relevant in a software-defined world.

As I write this, a number of sites out on the Internet are down because of an outage at Amazon Web Services.  Delta Airlines is suffering a major outage.  On a personal note, my wife’s favorite radio app and my Lutron lighting system are not operating correctly.  Of course, this outage is a reminder of the simple principle of not putting one’s eggs in a single basket.  AWS became the dominant web provider early on, but there are multiple viable alternatives now.  Long before the modern cloud emerged, I regularly ran disaster recovery exercises to ensure business continuity when a data center or service provider failed.  Everyone who uses a cloud provider better have a backup, and you better figure out a way to periodically test that backup.  A few startups have emerged to make this easier.

While the cause of the outage is yet unknown, there was an interesting comment in an Newsweek article on the outage.  Doug Madory, director of internet analysis an Kentik Inc, said:  “More and more these outages end up being the product of automation and centralization of administration…”  I’ve been involved in automation in some form or another for my entire six years at Cisco, and one aspect of automation is not talked about enough:  automation gone wild.  Let me give a non-computer example.

Back when I worked at the San Francisco Chronicle, the production department installed a new machine in our Union City printing plant.  The Sunday paper, back then, had a large number of inserts with advertisements and circulars that needed to be stuffed into the paper.  They were doing this manually, if you can believe it.

The new machine had several components.  One part of the process involved grabbing the inserts and carrying them in a conveyor system high above the plant floor, before dropping them down into the inserter.  It’s hard to visualize, so I’ve included a picture of a similar machine.

You can see the inserts coming in via the conveyor, hanging vertically.  This conveyor extended quite far.  One day I was in the plant, working on some networking thing or other, and the insert machine was running.  I looked back and saw the conveyor glitch somehow, and then a giant ball of paper started to form in the corner of the room, before finally exploding and raining paper down on the floor of the plant.  There was a commotion and one of the workers had to shut the machine down.

The point is, automation is great until it doesn’t work.  When it fails, it fails big.  You don’t just get a single problem, but a compounding problem.  It wasn’t just a single insert that got hit by the glitch, but dozens of them, if not more.  When you use manual processes, failures are contained.

Let’s tie this back to networking.  Say you need to configure hundreds of devices with some new code, perhaps adding a new routing protocol.  If you do it by hand in one device, and suddenly routes start dropping out of the routing table, chances are you won’t proceed with the other devices.  You’ll check your config to see what happened and why.  But if you set up, say, a Python script to run around and do this via NETCONF to 100 devices, suddenly you might have a massive outage on your hands.  The same could happen using a tool like Ansible, or even a vendor network management platform.

There are ways to combat this problem, of course.  Automated checks and validation after changes is an important one, but the problem with this approach is you cannot predict every failure.  If you program 10 checks, it’s going to fail in way #11, and you’re out of luck.

As I said, I’ve spent years promoting automation.  You simply couldn’t build a network like Amazon’s without it.  And it’s critical for network engineers to continue developing skills in this area.  We, as vendors and promoters of automation tools, need to be careful how we build and sell these tools to limit customer risk.

Eventually they got the inserter running again.  Whatever the cause of Amazon’s outage, let’s hope it’s not automation gone wild.