crash

All posts tagged crash

I’ve mentioned before that, despite being on the Routing Protocols team, I spent a lot of time handling crash cases in TAC.  At the time, my queue was just a dumping ground for cases that didn’t fit into any other bucket in the High Touch structure.  Backbone TAC had a much more granular division of teams, including a team entirely dedicated to crash.  But in HTTS, we did it all.

Some crashes are minor, like a (back then) 2600-series router reloading due to a bus error.  Some were catastrophic, particularly crashes on large chassis-type routing systems in service provider networks.  These could have hundreds of interfaces, and with sub-interfaces, potentially thousands of customers affected by a single outage.  Chassis platforms vary in their architecture, but many of the platforms we ran at the time used a distributed architecture in which the individual line cards ran a subset of IOS.  Thus, unlike a 2600 which had “dumb” WIC cards for interface connections, on chassis systems line cards themselves could crash in addition to the route processors.  Oftentimes, when a line card crashed, the effect would cascade through the box, with multiple line cards crashing, which would result in a massive meltdown.

The 7500 was particularly prone to these.  A workhorse of Cisco’s early product line, the 7500 line cards ran IOS but forwarded packets between each other by placing them into special queues on the route processor.  This was quite unlike later products, such as the Gigabit Switch Router (GSR), which had a fabric architecture enabling line cards to communicate directly.  On the 7500, oftentimes a line card having a problem would write bad data into the shared queues, which the subsequent line cards would read and then crash, causing a cascading failure.

One of our big customers, a Latin American telecommunications company I’ll call LatCom, was a heavy user of 7500’s.  They were a constant source of painful cases, and for some reason had a habit of opening P1 cases on Fridays at 5:55pm.  Back then HTTS day-shift engineers’ shifts ended at 6pm, at which point the night shift took over, but once we accepted a P1 or P2 case, unlike backbone TAC, we had to work it until resolution.  LatCom drove us nuts.  Five minutes was the difference between going home for the weekend and potentially being stuck on the phone until 10pm on a Friday night.  The fact that LatCom’s engineers barely spoke English also proved a challenge and drew out the cases–occasionally we had to work through non-technical translators, and getting them to render “there was a CEF bug causing bad data to be placed into the queue on the RP” into Spanish was problematic.

After years of nightmare 7500 crashes, LatCom finally did what we asked:  they dropped a lot of money to upgrade their routers to GSRs with PRPs, at that time our most modern box.  All the HTTS RP engineers breathed a sigh of relief knowing that the days of nightmare cascading line card failures on 7500’s were coming to an end.  We never had a seen a single case of such a failure on a GSR.

That said, we knew that if anything bad was going to happen, it would happen to these guys.  And sure enough, one day I got a case with…you guessed it, a massive cascading line card failure on a GSR!  The first one I had seen.  In the case notes I described the failure as follows:

  1. Six POS (Packet over Sonet) interfaces went down at once
  2. Fifteen seconds later, slots 1 and 15 started showing CPUHOG messages followed by tracebacks
  3. Everything stabilized until a few hours later, when the POS interfaces go down again
  4. Then, line cards in slots 0, 9, 10, 11, and 13 crashed
  5. Fifteen seconds later, line cards in slots 6 and 2 crash
  6. And so forth

My notes said: “basically we had a meltdown of the box.”  To make matters worse, 4 days later they had an identical crash on another GSR!

When faced with a this sort of mess, TAC agents usually would send the details to an internal mailer, which is exactly what I did.  The usual attempt by some on the mailer to throw hardware at the problem didn’t go far as we saw the exact same crash on another router.  This seemed to be a CEF bug.

Re-reading the rather extensive case notes bring up a lot of pain.  Because the customer had just spent millions of dollars to replace their routers with a new platform that, we assured them, would not be susceptible to the same problem, this went all the way to their top execs and ours.  We were under tremendous pressure to find a solution, and frankly, we all felt bad because we were sure the new platform would be an end to their problems.

There are several ways for a TAC engineer to get rid of a case:  resolve the problem, tell the customer it is not reproducible, wait for it to get re-queued to another engineer.  But after two long years at TAC, two years of constant pressure, a relentless stream of cases, angry customers, and problem after problem, my “dream job” at Cisco was taking a toll.  When my old friend Mike, who had hired me at the San Francisco Chronicle, my first network engineering job, called me and asked me to join him at a gold partner, the call wasn’t hard to make.  And so I took the easiest route to getting rid of cases, a lot of them all at once, and quit.  LatCom would be someone else’s problem.  My newest boss, the fifth in two years, looked at me with disappointment when I gave him my two weeks notice.

I can see the case notes now that I work at Cisco again, and they solved the case, as TAC does.  A bug was filed and the problem fixed.  Still, I can tell you how much of a relief it was to turn in my badge and walk out of Cisco for what I wrongly thought would be the last time.  I felt, in many ways, like a failure in TAC, but at my going away party, our top routing protocols engineer scoffed at my choice to leave.  “Cisco needs good engineers,” he said.  “I could have gotten you any job you wanted here!”  True or not, it was a nice comment to hear.

I started writing these TAC tales back in 2013, when I still worked at Juniper.  I didn’t expect they’d attract much interest, but they’ve been one of the most consistently popular features of this blog. I’ve cranked out 20 of these covering a number of subjects, but I’m afraid my reservoir of stories is running dry.  I’ve decided that number 20 will be the last TAC Tale on my blog.  There are plenty of other stories to tell, of course, but I’m finished with TAC, as I was back in 2007.  My two years in TAC were some of the hardest in my career, but also incredibly rewarding.  I have so much respect for my fellow TAC engineers, past, present, and future, who take on these complex problems without fear, and find answers for our customers.

 

The case came into the routing protocols queue, even though it was simply a line card crash.  The RP queue in HTTS was the dumping ground for anything that did not fit into one of the few other specialized queues we had.  A large US service provider had a Packet over SONET (PoS) line card on a GSR 12000-series router crashing over and over again.

Problem Details: 8 Port ISE Packet Over SONET card continually crashing due to

SLOT 2:Aug  3 03:58:31: %EE48-3-ALPHAERR: TX ALPHA: error: cpu int 1 mask 277FFFFF
SLOT 2:Aug  3 03:58:31: %EE48-4-GULF_TX_SRAM_ERROR: ASIC GULF: TX bad packet header detected. Details=0x4000

A previous engineer had the case, and he did what a lot of TAC engineers do when faced with an inexplicable problem:  he RMA’d the line card.  As I have said before, RMA is the default option for many TAC engineers, and it’s not a bad one.  Hardware errors are frequent and replacing hardware often is a quick route to solving the problem.  Unfortunately the RMA did not fix the problem, the case got requeued to another engineer, and he…RMA’d the line card.  Again.  When that didn’t work, he had them try the card in a different slot, but it continued to generate errors and crash.

The case bounced through two other engineers before getting to me.  Too bad the RMA option was out.  But the simple line card crash and error got even weirder.  The customer had two GSR routers in two different cities that were crashing with the same error.  Even stranger:  the crash was happening at precisely the same time in both cities, down to the second.  It couldn’t be a coincidence, because each crash on the first router was mirrored by a crash at exactly the same time on the second.

The conversation with my fellow engineers ranged from plausible to ludicrous.  There was a legend in TAC, true or not, that solar flares cause parity errors in memory and hence crashes.  Could a solar flare be triggering the same error on both line cards at the same time?  Some of my colleagues thought it was likely, but I thought it was silly.

Meanwhile, internal emails were going back and forth with the business unit to figure out what the errors meant.  Even for experienced network engineers, Cisco internal emails can read like a foreign language.  “The ALPHA errors are side-effects the GULF errors,” one development engineer commented, not so helpfully.  “Engine is feeding invalid packets to GULF and that causes the bad header error being detected on GULF,” another replied, only slightly more helpfully.

The customer, meanwhile, had identified a faulty fabric card on a Juniper router in their core.  Apparently the router was sending malformed packets to multiple provider edge (PE) routers all at once, which explained the simultaneous crashing.  Because all the PEs were in the US, forwarding was a matter of milliseconds, and thus there was very little variation in the timing.  How did the packets manage to traverse the several hops of the provider network without crashing any GSRs in between?  Well, the customer was using MPLS, and the corruption was in the IP header of the packets.  The intermediate hops forwarded the packets, without ever looking at the IP header, to the edge of the network, where the MPLS labels get stripped, and IP forwarding kicks in.  It was at that point that the line card crashed due to the faulty IP headers.  That said, when a line card receives a bad packet, it should drop it, not crash.  We had a bug.

The development engineers could not determine why the line card was crashing based on log info.  By this time, the customer had already replaced the faulty Juniper module and the network was stable.  The DEs wanted us to re-introduce the faulty line card into the core, and load up an engineering special debug image on the GSRs to capture the faulty packet.  This is often where we have a gulf, pun intended, between engineering and TAC.  No major service provider or customer wants to let Cisco engineering experiment on their network.  The customer decided to let it go.  If it came back, at least we could try to blame the issue on sunspots.

No customer is happy if they have to reboot one of their Internet-facing routers periodically, and this was one of our biggest customers.  (At HTTS, they were all big customers.)  This customer had a GSR connecting to the Internet, with partial BGP routes, and he kept getting this error:

%RP-3-ENCAP: Failure to allocate encap table entry, exceeded max number of entries, slot 2

Eventually the router would stop passing traffic and when this happened, he had to reload it.  Needless to say, he wasn’t happy.

The error came with a traceback, which shows what functions the code was executing when the error was generated.  The last function was this:

arp_background(0x5053d290)+0x140

Well, this was obviously some sort of ARP issue.  But why was ARP causing the router to stop forwarding traffic?

Looking up the error, I found that it meant the route processor was unable to allocate a rewrite entry for the slot 2 line card.  As a packet leaves the fabric of a large router like the GSR, the headers are re-written with the destination layer 2 info.  The rewrite table used for this was full.  I had the customer run a hidden command a few times, and we could see the table entries incrementing quickly:

Adjacency Table has 3167 adjacencies

Adjacency Table has 3291 adjacencies

Adjacency Table has 3322 adjacencies

Adjacency Table has 3410 adjacencies

Scrolling through the config, I looked for something that could be the culprit.  Then I saw it.  I remembered a router architecture course I had to take when I first became a TAC agent.  One of the escalation engineers told the story of his first P1 case.  It was a router that kept needing a reload.  He went to another senior escalation engineer, and after looking at the config she said to him, “What are you a f*cking idiot?”  He was quite shocked to be addressed in this manner.  “There is a static route pointed to a broadcast interface!”  she yelled, and then proceeded to chew him out for wasting her time.  This lady was famous in TAC for using bad language in nearly every sentence, and our trainer was able to laugh about it in retrospect.  “Now that I know her I don’t even care when she talks to me like that,” he reported.

Well, I wasn’t going to be called anything like that.  I looked in the config and found this:

ip route 0.0.0.0 0.0.0.0 GigabitEthernet2/0 100

A default route, pointed out a broadcast interface.  With partial BGP routes, this meant that the router was generating an ARP entry for every single destination address on the Internet that was not in the partial BGP table.  Whoops.  There are millions of destinations on the Internet, so it’s no surprise he was filling the capacity on the re-write table on his line card.

He removed the route and replaced it with a static route to the next hop.  The adjacency table immediately dropped below 100.  Problem solved.

Some TAC cases were mind-bogglingly difficult, involving multiple layers of help from engineering, hours in the lab, and major frustration.  Some, like this one, are major problems with major customers that end quickly and easily.  I closed the case with this note:

Customer was seeing RP-3-ENCAP error messages on one of his GSR LC’s. The card would eventually stop passing traffic, requiring reload of the router. Customer had a static default route to the Internet pointed out a broadcast interface–this was causing the router to ARP out that interface and create CEF adjacencies for each destination on the Internet. This was overloading the rewrite table on the LC. Customer removed static route, pointed to next hop address instead. Rewrite table entries went back to normal.