As I write this, a number of sites out on the Internet are down because of an outage at Amazon Web Services. Delta Airlines is suffering a major outage. On a personal note, my wife’s favorite radio app and my Lutron lighting system are not operating correctly. Of course, this outage is a reminder of the simple principle of not putting one’s eggs in a single basket. AWS became the dominant web provider early on, but there are multiple viable alternatives now. Long before the modern cloud emerged, I regularly ran disaster recovery exercises to ensure business continuity when a data center or service provider failed. Everyone who uses a cloud provider better have a backup, and you better figure out a way to periodically test that backup. A few startups have emerged to make this easier.
While the cause of the outage is yet unknown, there was an interesting comment in an Newsweek article on the outage. Doug Madory, director of internet analysis an Kentik Inc, said: “More and more these outages end up being the product of automation and centralization of administration…” I’ve been involved in automation in some form or another for my entire six years at Cisco, and one aspect of automation is not talked about enough: automation gone wild. Let me give a non-computer example.
Back when I worked at the San Francisco Chronicle, the production department installed a new machine in our Union City printing plant. The Sunday paper, back then, had a large number of inserts with advertisements and circulars that needed to be stuffed into the paper. They were doing this manually, if you can believe it.
The new machine had several components. One part of the process involved grabbing the inserts and carrying them in a conveyor system high above the plant floor, before dropping them down into the inserter. It’s hard to visualize, so I’ve included a picture of a similar machine.
You can see the inserts coming in via the conveyor, hanging vertically. This conveyor extended quite far. One day I was in the plant, working on some networking thing or other, and the insert machine was running. I looked back and saw the conveyor glitch somehow, and then a giant ball of paper started to form in the corner of the room, before finally exploding and raining paper down on the floor of the plant. There was a commotion and one of the workers had to shut the machine down.
The point is, automation is great until it doesn’t work. When it fails, it fails big. You don’t just get a single problem, but a compounding problem. It wasn’t just a single insert that got hit by the glitch, but dozens of them, if not more. When you use manual processes, failures are contained.
Let’s tie this back to networking. Say you need to configure hundreds of devices with some new code, perhaps adding a new routing protocol. If you do it by hand in one device, and suddenly routes start dropping out of the routing table, chances are you won’t proceed with the other devices. You’ll check your config to see what happened and why. But if you set up, say, a Python script to run around and do this via NETCONF to 100 devices, suddenly you might have a massive outage on your hands. The same could happen using a tool like Ansible, or even a vendor network management platform.
There are ways to combat this problem, of course. Automated checks and validation after changes is an important one, but the problem with this approach is you cannot predict every failure. If you program 10 checks, it’s going to fail in way #11, and you’re out of luck.
As I said, I’ve spent years promoting automation. You simply couldn’t build a network like Amazon’s without it. And it’s critical for network engineers to continue developing skills in this area. We, as vendors and promoters of automation tools, need to be careful how we build and sell these tools to limit customer risk.
Eventually they got the inserter running again. Whatever the cause of Amazon’s outage, let’s hope it’s not automation gone wild.