TAC Tales #20: Crash, burn, and exit

I’ve mentioned before that, despite being on the Routing Protocols team, I spent a lot of time handling crash cases in TAC.  At the time, my queue was just a dumping ground for cases that didn’t fit into any other bucket in the High Touch structure.  Backbone TAC had a much more granular division of teams, including a team entirely dedicated to crash.  But in HTTS, we did it all.

Some crashes are minor, like a (back then) 2600-series router reloading due to a bus error.  Some were catastrophic, particularly crashes on large chassis-type routing systems in service provider networks.  These could have hundreds of interfaces, and with sub-interfaces, potentially thousands of customers affected by a single outage.  Chassis platforms vary in their architecture, but many of the platforms we ran at the time used a distributed architecture in which the individual line cards ran a subset of IOS.  Thus, unlike a 2600 which had “dumb” WIC cards for interface connections, on chassis systems line cards themselves could crash in addition to the route processors.  Oftentimes, when a line card crashed, the effect would cascade through the box, with multiple line cards crashing, which would result in a massive meltdown.

The 7500 was particularly prone to these.  A workhorse of Cisco’s early product line, the 7500 line cards ran IOS but forwarded packets between each other by placing them into special queues on the route processor.  This was quite unlike later products, such as the Gigabit Switch Router (GSR), which had a fabric architecture enabling line cards to communicate directly.  On the 7500, oftentimes a line card having a problem would write bad data into the shared queues, which the subsequent line cards would read and then crash, causing a cascading failure.

One of our big customers, a Latin American telecommunications company I’ll call LatCom, was a heavy user of 7500’s.  They were a constant source of painful cases, and for some reason had a habit of opening P1 cases on Fridays at 5:55pm.  Back then HTTS day-shift engineers’ shifts ended at 6pm, at which point the night shift took over, but once we accepted a P1 or P2 case, unlike backbone TAC, we had to work it until resolution.  LatCom drove us nuts.  Five minutes was the difference between going home for the weekend and potentially being stuck on the phone until 10pm on a Friday night.  The fact that LatCom’s engineers barely spoke English also proved a challenge and drew out the cases–occasionally we had to work through non-technical translators, and getting them to render “there was a CEF bug causing bad data to be placed into the queue on the RP” into Spanish was problematic.

After years of nightmare 7500 crashes, LatCom finally did what we asked:  they dropped a lot of money to upgrade their routers to GSRs with PRPs, at that time our most modern box.  All the HTTS RP engineers breathed a sigh of relief knowing that the days of nightmare cascading line card failures on 7500’s were coming to an end.  We never had a seen a single case of such a failure on a GSR.

That said, we knew that if anything bad was going to happen, it would happen to these guys.  And sure enough, one day I got a case with…you guessed it, a massive cascading line card failure on a GSR!  The first one I had seen.  In the case notes I described the failure as follows:

  1. Six POS (Packet over Sonet) interfaces went down at once
  2. Fifteen seconds later, slots 1 and 15 started showing CPUHOG messages followed by tracebacks
  3. Everything stabilized until a few hours later, when the POS interfaces go down again
  4. Then, line cards in slots 0, 9, 10, 11, and 13 crashed
  5. Fifteen seconds later, line cards in slots 6 and 2 crash
  6. And so forth

My notes said: “basically we had a meltdown of the box.”  To make matters worse, 4 days later they had an identical crash on another GSR!

When faced with a this sort of mess, TAC agents usually would send the details to an internal mailer, which is exactly what I did.  The usual attempt by some on the mailer to throw hardware at the problem didn’t go far as we saw the exact same crash on another router.  This seemed to be a CEF bug.

Re-reading the rather extensive case notes bring up a lot of pain.  Because the customer had just spent millions of dollars to replace their routers with a new platform that, we assured them, would not be susceptible to the same problem, this went all the way to their top execs and ours.  We were under tremendous pressure to find a solution, and frankly, we all felt bad because we were sure the new platform would be an end to their problems.

There are several ways for a TAC engineer to get rid of a case:  resolve the problem, tell the customer it is not reproducible, wait for it to get re-queued to another engineer.  But after two long years at TAC, two years of constant pressure, a relentless stream of cases, angry customers, and problem after problem, my “dream job” at Cisco was taking a toll.  When my old friend Mike, who had hired me at the San Francisco Chronicle, my first network engineering job, called me and asked me to join him at a gold partner, the call wasn’t hard to make.  And so I took the easiest route to getting rid of cases, a lot of them all at once, and quit.  LatCom would be someone else’s problem.  My newest boss, the fifth in two years, looked at me with disappointment when I gave him my two weeks notice.

I can see the case notes now that I work at Cisco again, and they solved the case, as TAC does.  A bug was filed and the problem fixed.  Still, I can tell you how much of a relief it was to turn in my badge and walk out of Cisco for what I wrongly thought would be the last time.  I felt, in many ways, like a failure in TAC, but at my going away party, our top routing protocols engineer scoffed at my choice to leave.  “Cisco needs good engineers,” he said.  “I could have gotten you any job you wanted here!”  True or not, it was a nice comment to hear.

I started writing these TAC tales back in 2013, when I still worked at Juniper.  I didn’t expect they’d attract much interest, but they’ve been one of the most consistently popular features of this blog. I’ve cranked out 20 of these covering a number of subjects, but I’m afraid my reservoir of stories is running dry.  I’ve decided that number 20 will be the last TAC Tale on my blog.  There are plenty of other stories to tell, of course, but I’m finished with TAC, as I was back in 2007.  My two years in TAC were some of the hardest in my career, but also incredibly rewarding.  I have so much respect for my fellow TAC engineers, past, present, and future, who take on these complex problems without fear, and find answers for our customers.

 

TAC Tales #13: All Zeros

A common approach for TAC engineers and customers working on a tough case is to just “throw hardware at it.”  Sometimes this can be laziness:  why troubleshoot a complex problem when you can send an RMA, swap out a line card, and hope it works?  Other times it’s a legitimate step in a complex process of elimination.  RMA the card and if the problem still happens, well, you’ve eliminated the card as one source of the problem.

Hence, it was not an uncommon event the day that I got a P1 case from a major service provider, requeued (reassigned) after multiple RMAs.   The customer had a 12000-series GSR, top of the line back then, and was frustrated because ISIS wasn’t working.

“We just upgraded the GRP to a PRP to speed the router up,” he said, “but now it’s taking 4 hours for ISIS to converge.  Why did we pay all this money on a new route processor when it just slowed our box way down?!”

The GSR router is a chassis-type router, with multiple line cards with ports of different types, a fabric interconnecting them, and a management module (route processor, or RP) acting as the brains of the device.  The original RP was called a GRP, but Cisco had released an improved version called the PRP.

The GSR 12000-series

The customer seemed to think the new PRP had performance issues, but this didn’t make sense.  Performance issues might cause some small delays or possibly packet loss for packets destined to the RP, but not delays of four hours.  Something else was amiss.  I asked the customer to send me the ISIS database, and it was full of LSPs like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
0651.8412.7001.00-00  0x00000000   0x0000        193               0/0/0

ISIS routers periodically send CSNPs, or Complete Sequence Number PDUs, which contain a list of all the link state packets (LSPs) in the router database.  In this case, the GSR was directly attached to a Juniper router which was its sole ISIS adjacency.  It was receiving the entire ISIS database from this router.  Normally an ISIS database entry looks like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
bb1-sjc.00-00         0x0000041E   0xF97D        65365             0/0/0

Note that instead of a router ID, we actually have a router name.  Note also that we have a sequence number and a checksum for each LSP.  As the previous output shows, something was wrong with the LSPs we were receiving.  Not only was the name not resolving, the sequence and checksum were zero.  How can we possibly have an LSP which has no sequence number at all?

Even weirder was that as I refreshed the ISIS outputs, the LSPs started resolving, suddenly popping up with names and non-zero sequences and checksums.  I stayed on the phone with the customer for several hours, before finally every LSP was resolved, and the customer had full reachability.  “Don’t do anything to the router until I get back to you,” I said before hanging up.  If only he had listened.

I was about to pack up for the day and I got called by our hotline.  The customer had called in and escalated to a P1 after reloading the router.  The entire link state database was zero’d out again, and the network was down.  He only had a short maintenance window in which to work, and now he had an outage.  It was 6pm.  I knew I wasn’t going home for a while.

Whatever was happening was well beyond my ISIS expertise.  Even in the routing protocols team, it was hard to find deep knowledge of ISIS.  I needed an expert, and Abe Martey, who sat across from me, literally wrote the book on ISIS.  The Cisco Press book, that is.  The only issue:  Abe had decided to take PTO that week.  Of course.  I pinged a protocols escalation engineer, one of our best BGP guys.  He didn’t want anything to do with it.  Finally I reached out to the duty manager and asked for help.  I also emailed our internal mailers for ISIS, but after 6pm I wasn’t too optimistic.

Why were we seeing what appeared to be invalid LSPs?  How could an LSP even have a zero checksum or sequence number?  Why did they seem to clear out, and why so slowly?  Did the upgrade to the PRP have anything to do with it?  Was it hardware?  A bug?  As a TAC engineer, you have to consider every single possibility, from A to Z.

The duty manager finally got Sanjeev, an “ISIS expert” from Australia on the call.  The customer may not realize this while a case is being handled, but if it’s complex and high priority, there is often a flurry of instant messaging going on behind the scenes.  We had a chat room up, and as the “expert” listened to the description of the problem and looked at the notes, he typed in the window:  “This is way over my head.”  Great, so much for expertise.  Our conversation was getting heated with the customer, as his frustration with the lack of progress escalated.  The so-called expert asked him to run a command, which another TAC engineer suggested.

“Fantastic,” said the customer, “Sanjeev wants us to run a command.  Sanjeev, tell us, why do you want to run this command?  What’s it going to do?”

“Uh, I’m not sure,” said Sanjeev, “I’ll have to get back to you on that.”

Not a good answer.

By 8:30 PM we also had a senior routing protocols engineer in the chat window.  He seemed to think it was a hardware issue and was scraping the error counters on the line cards. The dedicated Advanced Services NCE for the account also signed on and was looking at the errors. It’s a painful feeling knowing you and the customer are stranded, but we honestly had no idea what to do.  Because the other end of the problem was a Juniper router, JTAC came on board as well.  We may have been competitors, but we were professionals and put it aside to best help the customer.

Looking at the chat transcript, which I saved, is painful.  One person suggests physically cleaning the fiber connection.  Another thinks it’s memory corruption.  Another believes it is packet corruption.  We schedule a circuit test with the customer to look for transmission errors.

All the while, the 0x0000 LSPs are re-populating with legitimate information, until, by 9pm, the ISIS database was fully converged and routing was working again.  “This time,” I said, “DO NOT touch the router.”  The customer agreed.  I headed home at 9:12pm, secretly hoping they would reload the router so the case would get requeued to night shift and taken off my hands.

In the morning we got on our scheduled update call with the customer.  I was tired, and not happy to make the call.  We had gotten nowhere in the night, and had not gotten helpful responses to our emails.  I wasn’t sure what I was going to say.  I was surprised to hear the customer in a chipper mood.  “I’m happy to report Juniper has reproduced the problem in their lab and has identified the problem.”

There was a little bit of wounded pride knowing they found the fix before we did, but also a sense of relief to know I could close the case.

It turns out that the customer, around the same time they installed the PRP, had attempted to normalize the configs between the Juniper and Cisco devices.  They had mistakenly configured a timer called the “LSP pacing interval” on the Juniper side.  This controls the rate at which the Juniper box sends out LSPs.  They had thought they were configuring the same timer as the LSP refresh interval on the Cisco side, but they were two different things.  By cranking it way up, they ensured that the hundreds of LSPs in the database would trickle in, taking hours to converge.

Why the 0x0000 entries then?  It turns out that in the initial exchange, the ISIS routers share with each other what LSPs they have, without sending the full LSP.  Thus, in Cisco ISIS databases, the 0x0000 entry acts as a placeholder until complete LSP data is received.  Normally this period is short and you don’t see the entry.  We probably would have found the person who knew that eventually, but we didn’t find him that night and our database of cases, newsgroup postings, and bugs turned up nothing to point us in the right direction.

I touched a couple thousand cases in my time at TAC, but this case I remember even 10 years later because of the seeming complexity, the simplicity of the resolution, the weirdness of the symptoms, and the distractors like the PRP upgrade.  Often a major outage sends you in a lot of directions and down many rat holes.  I don’t think we could have done much differently, since the config error was totally invisible to us.  Anyway, if Juniper and Cisco can work together to solve a customer issue, maybe we should have hope for world peace.