TAC Tales #4: Airline Outage

I don’t advertise this blog so I’m always amazed that people even find it. I figured the least-read articles on this blog were my “TAC Tales,” but someone recently commented that they wanted to see more… Well, I’m happy to oblige.

The recent events at United reminded me of a case where operations were down for one of the major airlines at Miami International Airport. It didn’t directly impact flight operations, but ticketing and baggage handling systems were down. Naturally, it was a P1 and so I dialed into the conference bridge.

This airline had four Cat 6500’s acting as their core devices for the network. The four switches had vastly disparate configurations, both hardware and software. I seem to recall one of them was running a Supe 1 module, which was even old in 2007 when I took the case. There was a different software version on each of them.

EIGRP was acting funny. As a TAC engineer in the routing protocols team, I absolutely hated EIGRP. EIGRP Stuck-In-Active was my nightmare case. It was always such a pain to track down the source, and meanwhile you’d have peers resetting all over the place. OSPF doesn’t do that, nor ISIS. I once got in a debate on an internal Cisco alias with some EIGRP guys. Granted, I had insulted their life’s work, but I stated that EIGRP was fast, but unreliable and prone to meltdown. Their retort was that properly designed EIGRP networks do not melt down. Great, but when are networks ever properly designed? They are so often slapped together haphazardly, grow organically, and overall need to be resilient when even when unplanned. Of course, those of us in design and architecture positions do our best to build highly available networks, but you don’t want to be running a protocol that flips out when a route at some far end of the network disappears. Anyhow…

The adjacencies on all four boxes were resetting constantly. It was totally unstable. Every five minutes or so, some manager from the airline would hop on the bridge to tell us that they were using handwritten tickets and baggage tags, that lines at the ticket counters were going out the door, etc, etc. Because that really helps me to concentrate. I tried to troubleshoot the way TAC engineers are trained to troubleshoot: collect logs, search for bugs in the relevant software, look for configuration issues. With routing adjacency flaps on switches, always check for STP issues. I couldn’t figure it out.

Finally some high-level engineer for the airline got on the phone and took over like a five-star general. He had his ops team systematically shut down and reset the switches, one at a time. The instability stopped. Wish I’d thought of that.

The standards for a routing protocol like OSPF are written by slow-moving committees, and hence don’t change much. These committees often have members from multiple competing vendors who disagree on exactly what should be done, and even when they do agree, nothing happens fast in IETF committees. Conversely, Cisco owns EIGRP, and they can change it as much as they want. Even their internal committees are nowhere near as bureaucratic as IETF. This means that there can be significant changes in the EIGRP code between IOS releases, much more so than for OSPF, and it is thus vital to keep code revisions amongst participating routers fairly close.

In this case, the consulting engineers for the airline helped them to standardize the hardware and software revisions. They never re-opened the case.

TAC Tales #2: How to troubleshoot

The case came in P1, and I knew it would be a bad one. One thing you learn as a TAC engineer is that P1 cases are often the easiest. A router is down, send an RMA. But I knew this P1 would be tough because it had been requeued three times. The last engineer who had it was good, very good. And it wasn’t solved. Our hotline gave me a bridge number and I dialed in.

The customer explained to me that he had a 7513 and a 7206, and they had a multilink PPP bundle between them with 8 T1 lines. The MLPPP interface had mysteriously gone down/down and they couldn’t get it back. The member links were all up/down. Why they were connecting them this way was not a question an HTTS engineer was allowed to ask. We were just there to troubleshoot. As I was on the bridge, they were systematically taking each T1 out of the bundle and putting HDLC encapsulation on it, pinging across, and then putting it back into the MLPPP bundle. This bought me time to look over the case notes.

There were multiple RMA’s in the notes. They had RMA’d the line cards and the entire chassis. The 7513 they were shipped had problems and so they RMA’d it a second time. RMA’ing an entire 7513 chassis is a real pain. I perused the configs to see if authentication was configured on the PPP interface, but it wasn’t. It looked like a PPP problem (up/down state) but the interface config was plain MLPPP vanilla.

They finished testing all of the T1’s individually. One of the engineers said “I think we need another RMA.” I told them to hang on. “Take all of the links out of the bundle and give me an MLPPP bundle with one T1,” I said. “But we tested them all individually!” they replied. “Yes, but you tested them with HDLC. I want to test one link with multilink PPP on it.” They agreed. And with a single link it was still down/down. Now we were getting somewhere. I had them switch which link was the active one. Same problem. Now disable multilink and just run straight PPP on a single link. Same thing.

“Can you turn on debug ppp with all options?” I asked. They were worried about doing it on the 7513, but I convinced them to do it on the 7206. They sent me the logs, and this stood out:

AAA/AUTHOR/LCP: Denied

Authorization failed. But why? Nothing was configured under the interface, but I looked at the top of the config, where the AAA commands are, and saw this:

aaa authorization network default

And there it was. “Guys, could you remove this one line from the config?” I asked. They did. The single PPP link came up. “Let’s do this slowly. Add the single link back into multilink mode.” Up/up. “Now add all the links back.” It was working.

It turns out they had a project to standardize their configs across all their routers and accidentally added that line. They had RMA’d an entire 7513 chassis–twice!–for a single line of config. Replacing a 7513 is a lot of work. I still can’t believe it got that far.

Some lessons from this story: first, RMAs don’t always fix the problem. Second, even good engineers make stupid mistakes. Third, when troubleshooting, always limit the scope of the problem. Troubleshoot as little as you can. And finally, even hard P1’s can turn out easy.

Tac Tales #1: Case routing

Before I worked at TAC, I was pretty careless about how I filled in a TAC case online. For example, when I had to select the technology I was dealing with in the drop-down menu, if I didn’t see exactly what I had then I would go ahead and pick something at random and figure TAC would sort it out. And then I would get frustrated when I didn’t get an answer on my case for hours. Working in TAC showed me why.

When you open a TAC case, and you pick a particular technology, your choice determines into which queue the case is routed. For example, if you pick Catalyst 6500, the case ends up in a queue which is being monitored by engineers who are experts on that platform. Under TAC rules (assuming it is a priority 3 case) the engineers have 20 minutes to pick up the case. If they don’t, it turns blue in their display and their duty manager starts asking questions. (In high touch TAC where I worked, we didn’t have too many blue cases, but in backbone TAC it wasn’t uncommon to see a ton of blue and even black (> 1hr) cases sitting in a busy queue.)

If the customer categorized his case wrong, this meant it was sitting in the wrong queue. Now an engineer had to notice his case, review it, determine where it should go, and “punt” it to the appropriate queue, at which point the counters are reset and the case is sitting again.

Imagine for a moment that you are an overworked TAC engineer with 30 minutes left to go on your shift. You are supposed to clear out your queue and take any cases before the next crew comes on (at least we were in HTTS). You don’t want to take any more cases, however. There is a case sitting in your queue which has turned blue and your colleagues may not be happy to see it sitting there when they come on shift. Well, you’re an experienced TAC engineer and you know what to do: punt the case to another queue, even if it’s the wrong one. If you pick a busy queue, it will take at least 30 minutes for the engineers on that queue to see the “mis-queue” and punt the case back to your queue, at which point you are off shift and it becomes the problem of your colleagues on the next shift.

My recommendation is to be very careful to select the right menu options when you open a case online with any tech support organization. Make sure you route the case to the right place the first time so you don’t have to wait for engineers and managers to look at it and re-categorize it.