I’ve mentioned before that, despite being on the Routing Protocols team, I spent a lot of time handling crash cases in TAC. At the time, my queue was just a dumping ground for cases that didn’t fit into any other bucket in the High Touch structure. Backbone TAC had a much more granular division of teams, including a team entirely dedicated to crash. But in HTTS, we did it all.
Some crashes are minor, like a (back then) 2600-series router reloading due to a bus error. Some were catastrophic, particularly crashes on large chassis-type routing systems in service provider networks. These could have hundreds of interfaces, and with sub-interfaces, potentially thousands of customers affected by a single outage. Chassis platforms vary in their architecture, but many of the platforms we ran at the time used a distributed architecture in which the individual line cards ran a subset of IOS. Thus, unlike a 2600 which had “dumb” WIC cards for interface connections, on chassis systems line cards themselves could crash in addition to the route processors. Oftentimes, when a line card crashed, the effect would cascade through the box, with multiple line cards crashing, which would result in a massive meltdown.
The 7500 was particularly prone to these. A workhorse of Cisco’s early product line, the 7500 line cards ran IOS but forwarded packets between each other by placing them into special queues on the route processor. This was quite unlike later products, such as the Gigabit Switch Router (GSR), which had a fabric architecture enabling line cards to communicate directly. On the 7500, oftentimes a line card having a problem would write bad data into the shared queues, which the subsequent line cards would read and then crash, causing a cascading failure.
One of our big customers, a Latin American telecommunications company I’ll call LatCom, was a heavy user of 7500’s. They were a constant source of painful cases, and for some reason had a habit of opening P1 cases on Fridays at 5:55pm. Back then HTTS day-shift engineers’ shifts ended at 6pm, at which point the night shift took over, but once we accepted a P1 or P2 case, unlike backbone TAC, we had to work it until resolution. LatCom drove us nuts. Five minutes was the difference between going home for the weekend and potentially being stuck on the phone until 10pm on a Friday night. The fact that LatCom’s engineers barely spoke English also proved a challenge and drew out the cases–occasionally we had to work through non-technical translators, and getting them to render “there was a CEF bug causing bad data to be placed into the queue on the RP” into Spanish was problematic.
After years of nightmare 7500 crashes, LatCom finally did what we asked: they dropped a lot of money to upgrade their routers to GSRs with PRPs, at that time our most modern box. All the HTTS RP engineers breathed a sigh of relief knowing that the days of nightmare cascading line card failures on 7500’s were coming to an end. We never had a seen a single case of such a failure on a GSR.
That said, we knew that if anything bad was going to happen, it would happen to these guys. And sure enough, one day I got a case with…you guessed it, a massive cascading line card failure on a GSR! The first one I had seen. In the case notes I described the failure as follows:
- Six POS (Packet over Sonet) interfaces went down at once
- Fifteen seconds later, slots 1 and 15 started showing CPUHOG messages followed by tracebacks
- Everything stabilized until a few hours later, when the POS interfaces go down again
- Then, line cards in slots 0, 9, 10, 11, and 13 crashed
- Fifteen seconds later, line cards in slots 6 and 2 crash
- And so forth
My notes said: “basically we had a meltdown of the box.” To make matters worse, 4 days later they had an identical crash on another GSR!
When faced with a this sort of mess, TAC agents usually would send the details to an internal mailer, which is exactly what I did. The usual attempt by some on the mailer to throw hardware at the problem didn’t go far as we saw the exact same crash on another router. This seemed to be a CEF bug.
Re-reading the rather extensive case notes bring up a lot of pain. Because the customer had just spent millions of dollars to replace their routers with a new platform that, we assured them, would not be susceptible to the same problem, this went all the way to their top execs and ours. We were under tremendous pressure to find a solution, and frankly, we all felt bad because we were sure the new platform would be an end to their problems.
There are several ways for a TAC engineer to get rid of a case: resolve the problem, tell the customer it is not reproducible, wait for it to get re-queued to another engineer. But after two long years at TAC, two years of constant pressure, a relentless stream of cases, angry customers, and problem after problem, my “dream job” at Cisco was taking a toll. When my old friend Mike, who had hired me at the San Francisco Chronicle, my first network engineering job, called me and asked me to join him at a gold partner, the call wasn’t hard to make. And so I took the easiest route to getting rid of cases, a lot of them all at once, and quit. LatCom would be someone else’s problem. My newest boss, the fifth in two years, looked at me with disappointment when I gave him my two weeks notice.
I can see the case notes now that I work at Cisco again, and they solved the case, as TAC does. A bug was filed and the problem fixed. Still, I can tell you how much of a relief it was to turn in my badge and walk out of Cisco for what I wrongly thought would be the last time. I felt, in many ways, like a failure in TAC, but at my going away party, our top routing protocols engineer scoffed at my choice to leave. “Cisco needs good engineers,” he said. “I could have gotten you any job you wanted here!” True or not, it was a nice comment to hear.
I started writing these TAC tales back in 2013, when I still worked at Juniper. I didn’t expect they’d attract much interest, but they’ve been one of the most consistently popular features of this blog. I’ve cranked out 20 of these covering a number of subjects, but I’m afraid my reservoir of stories is running dry. I’ve decided that number 20 will be the last TAC Tale on my blog. There are plenty of other stories to tell, of course, but I’m finished with TAC, as I was back in 2007. My two years in TAC were some of the hardest in my career, but also incredibly rewarding. I have so much respect for my fellow TAC engineers, past, present, and future, who take on these complex problems without fear, and find answers for our customers.
5 Comments
I loved reading this tale, Jeff. I learned some new things from this article and laughed at Spanish translater trying to translate – “there was a CEF bug causing bad data to be placed into the queue on the RP”.
You are an inspiration to all of us, Jeff !!
Ah the CEF bugs! Thanks for that painful memory! The DDTS that I filed had been duped, and duped and duped again and even resolved and unresolved many times. I moved out of TAC into business operations in 2007-2019 and got calls on that DDTS probably all the way to the end.
Thanks for stopping by Brent! CEF was, of course, a great invention but also a source of quite a few headaches back then. Thankfully I don’t get calls on any of my old bugs now 🙂
I remember this Latam customer very clearly. Nice folks with an often overloaded network! I also remember all the issues they initially and constantly had with the 7513 back pressure and the infamous CEf bugs… man those were the days! They had their own share of stress but were so fun and full of learning.
I distinctly remember leaving on a Friday afternoon after you had taken your second Friday @ 5:55PM P1 in a row from them… I felt so bad and you were none too happy about it. As I recall you visited them too once upon a time. Thanks for stopping by Edgar!