Skip navigation

Category Archives: TAC Tales

I’ve mentioned before that, despite being on the Routing Protocols team, I spent a lot of time handling crash cases in TAC.  At the time, my queue was just a dumping ground for cases that didn’t fit into any other bucket in the High Touch structure.  Backbone TAC had a much more granular division of teams, including a team entirely dedicated to crash.  But in HTTS, we did it all.

Some crashes are minor, like a (back then) 2600-series router reloading due to a bus error.  Some were catastrophic, particularly crashes on large chassis-type routing systems in service provider networks.  These could have hundreds of interfaces, and with sub-interfaces, potentially thousands of customers affected by a single outage.  Chassis platforms vary in their architecture, but many of the platforms we ran at the time used a distributed architecture in which the individual line cards ran a subset of IOS.  Thus, unlike a 2600 which had “dumb” WIC cards for interface connections, on chassis systems line cards themselves could crash in addition to the route processors.  Oftentimes, when a line card crashed, the effect would cascade through the box, with multiple line cards crashing, which would result in a massive meltdown.

The 7500 was particularly prone to these.  A workhorse of Cisco’s early product line, the 7500 line cards ran IOS but forwarded packets between each other by placing them into special queues on the route processor.  This was quite unlike later products, such as the Gigabit Switch Router (GSR), which had a fabric architecture enabling line cards to communicate directly.  On the 7500, oftentimes a line card having a problem would write bad data into the shared queues, which the subsequent line cards would read and then crash, causing a cascading failure.

One of our big customers, a Latin American telecommunications company I’ll call LatCom, was a heavy user of 7500’s.  They were a constant source of painful cases, and for some reason had a habit of opening P1 cases on Fridays at 5:55pm.  Back then HTTS day-shift engineers’ shifts ended at 6pm, at which point the night shift took over, but once we accepted a P1 or P2 case, unlike backbone TAC, we had to work it until resolution.  LatCom drove us nuts.  Five minutes was the difference between going home for the weekend and potentially being stuck on the phone until 10pm on a Friday night.  The fact that LatCom’s engineers barely spoke English also proved a challenge and drew out the cases–occasionally we had to work through non-technical translators, and getting them to render “there was a CEF bug causing bad data to be placed into the queue on the RP” into Spanish was problematic.

After years of nightmare 7500 crashes, LatCom finally did what we asked:  they dropped a lot of money to upgrade their routers to GSRs with PRPs, at that time our most modern box.  All the HTTS RP engineers breathed a sigh of relief knowing that the days of nightmare cascading line card failures on 7500’s were coming to an end.  We never had a seen a single case of such a failure on a GSR.

That said, we knew that if anything bad was going to happen, it would happen to these guys.  And sure enough, one day I got a case with…you guessed it, a massive cascading line card failure on a GSR!  The first one I had seen.  In the case notes I described the failure as follows:

  1. Six POS (Packet over Sonet) interfaces went down at once
  2. Fifteen seconds later, slots 1 and 15 started showing CPUHOG messages followed by tracebacks
  3. Everything stabilized until a few hours later, when the POS interfaces go down again
  4. Then, line cards in slots 0, 9, 10, 11, and 13 crashed
  5. Fifteen seconds later, line cards in slots 6 and 2 crash
  6. And so forth

My notes said: “basically we had a meltdown of the box.”  To make matters worse, 4 days later they had an identical crash on another GSR!

When faced with a this sort of mess, TAC agents usually would send the details to an internal mailer, which is exactly what I did.  The usual attempt by some on the mailer to throw hardware at the problem didn’t go far as we saw the exact same crash on another router.  This seemed to be a CEF bug.

Re-reading the rather extensive case notes bring up a lot of pain.  Because the customer had just spent millions of dollars to replace their routers with a new platform that, we assured them, would not be susceptible to the same problem, this went all the way to their top execs and ours.  We were under tremendous pressure to find a solution, and frankly, we all felt bad because we were sure the new platform would be an end to their problems.

There are several ways for a TAC engineer to get rid of a case:  resolve the problem, tell the customer it is not reproducible, wait for it to get re-queued to another engineer.  But after two long years at TAC, two years of constant pressure, a relentless stream of cases, angry customers, and problem after problem, my “dream job” at Cisco was taking a toll.  When my old friend Mike, who had hired me at the San Francisco Chronicle, my first network engineering job, called me and asked me to join him at a gold partner, the call wasn’t hard to make.  And so I took the easiest route to getting rid of cases, a lot of them all at once, and quit.  LatCom would be someone else’s problem.  My newest boss, the fifth in two years, looked at me with disappointment when I gave him my two weeks notice.

I can see the case notes now that I work at Cisco again, and they solved the case, as TAC does.  A bug was filed and the problem fixed.  Still, I can tell you how much of a relief it was to turn in my badge and walk out of Cisco for what I wrongly thought would be the last time.  I felt, in many ways, like a failure in TAC, but at my going away party, our top routing protocols engineer scoffed at my choice to leave.  “Cisco needs good engineers,” he said.  “I could have gotten you any job you wanted here!”  True or not, it was a nice comment to hear.

I started writing these TAC tales back in 2013, when I still worked at Juniper.  I didn’t expect they’d attract much interest, but they’ve been one of the most consistently popular features of this blog. I’ve cranked out 20 of these covering a number of subjects, but I’m afraid my reservoir of stories is running dry.  I’ve decided that number 20 will be the last TAC Tale on my blog.  There are plenty of other stories to tell, of course, but I’m finished with TAC, as I was back in 2007.  My two years in TAC were some of the hardest in my career, but also incredibly rewarding.  I have so much respect for my fellow TAC engineers, past, present, and future, who take on these complex problems without fear, and find answers for our customers.

 

This one falls into the category of, “I probably shouldn’t post this, especially now that I’m at Cisco again,” but what the heck.

I’ve often mentioned, in this series, the different practices of “backbone TAC” (or WW-TAC) and High Touch Technical Support (HTTS), the group I was a part of.  WW-TAC was the larger TAC organization, where the vast majority of the cases landed.  HTTS was (and still is) a specialized TAC group dedicated to Cisco’s biggest customers, who generally pay for the additional service.  HTTS was supposed to provide a deeper knowledge of the specifics of customer networks and practices, but generally worked the same as TAC.  We had our own queues, and when a high-touch customer would open a case, Cisco’s entitlement tool would automatically route their case to HTTS based on the contract number.

Unlike WW-TAC, HTTS did not use the “follow the sun” model.  This meant that regular TAC cases would be picked up by a region where it was currently daytime, and when a TAC agent’s shift ended, they would find another agent in the next timezone over to pick up a live (P1/P2) case.  At HTTS, we had US-based employees only, at the time, and they had to work P1/P2 cases to resolution.  This meant if your shift ended at 6pm, and a P1 case came in at 5:55, you might be stuck in the office for hours until you resolved it.  We did have a US-based nightshift that came on at 6pm, but they only accepted new cases–we couldn’t hand off a live one to nightshift.

Weekends were covered by a model I hated, called “BIC”.  I asked my boss what it stood for and he explained it was either “Butt In Chair” or “Bullet In the Chamber.”  The HTTS managers would publish a schedule (quarterly if I recall) assigning each engineer one or two 6 hour shifts during the weekends of that quarter.  During those 6 hours, we had to be online and taking cases.

Why did I hate it?  First, I hated working weekends, of course.  Second, the caseload was high.  A normal day on my queue might see 4 cases per engineer, but on BIC you typically took seven or eight.  Third, you had to take cases on every topic.  During the week, only a voice engineer would pick up a voice case.  But on BIC, I, a routing protocols engineer, might pick up a voice case, a firewall case, a switching case…or whatever.  Fourth, because BIC took place on a weekend, normal escalation channels were not available.  If you had a major P1 outage, you couldn’t get help easily.

Remember that a lot of the cases you accepted took weeks or even months to resolve.  Part of a TAC engineer’s day is working his backlog of cases:  researching, working in the lab to recreate a problem, talking to engineering, etc., all to resolve these cases.  When you picked up seven cases on a weekend, you were slammed for weeks after that.

We did get paid extra for BIC, although I don’t remember how much.  It was hundreds of dollars per shift, if I recall.  Because of this, a number of engineers loaded up on BIC shifts and earned thousands of dollars per quarter.  Thankfully, this meant there were plenty of willing recipients when I wanted to give away my shifts, which I did almost always.  (I worked two during my two years at TAC.)  However, sometimes I could not find anyone to take my shift, and in that case I actually would sell my shift, offering a hundred additional dollars if someone would take the shift.  That’s how much I hated BIC.  Of course, this was done without the company knowing about it, as I’m sure they wouldn’t approve of me selling my work!

We had one CSE on our team, I’ll call him Omar, who loaded up on BICs.  Then he would come into his week so overloaded with cases from the weekend that he would hardly take a case during the week.  We’d all get burdened with extra load because Omar was off working his weekend cases.  Finally, as team lead, I called him out on it in our group chat and Omar blew up on me.  Well, I was right of course but I had to let it go.

I don’t know if HTTS still does BIC, although I suspect it’s gone away.  I still work almost every weekend I have, but it’s to stay on top of work rather than taking on more.

The case came into the routing protocols queue, even though it was simply a line card crash.  The RP queue in HTTS was the dumping ground for anything that did not fit into one of the few other specialized queues we had.  A large US service provider had a Packet over SONET (PoS) line card on a GSR 12000-series router crashing over and over again.

Problem Details: 8 Port ISE Packet Over SONET card continually crashing due to

SLOT 2:Aug  3 03:58:31: %EE48-3-ALPHAERR: TX ALPHA: error: cpu int 1 mask 277FFFFF
SLOT 2:Aug  3 03:58:31: %EE48-4-GULF_TX_SRAM_ERROR: ASIC GULF: TX bad packet header detected. Details=0x4000

A previous engineer had the case, and he did what a lot of TAC engineers do when faced with an inexplicable problem:  he RMA’d the line card.  As I have said before, RMA is the default option for many TAC engineers, and it’s not a bad one.  Hardware errors are frequent and replacing hardware often is a quick route to solving the problem.  Unfortunately the RMA did not fix the problem, the case got requeued to another engineer, and he…RMA’d the line card.  Again.  When that didn’t work, he had them try the card in a different slot, but it continued to generate errors and crash.

The case bounced through two other engineers before getting to me.  Too bad the RMA option was out.  But the simple line card crash and error got even weirder.  The customer had two GSR routers in two different cities that were crashing with the same error.  Even stranger:  the crash was happening at precisely the same time in both cities, down to the second.  It couldn’t be a coincidence, because each crash on the first router was mirrored by a crash at exactly the same time on the second.

The conversation with my fellow engineers ranged from plausible to ludicrous.  There was a legend in TAC, true or not, that solar flares cause parity errors in memory and hence crashes.  Could a solar flare be triggering the same error on both line cards at the same time?  Some of my colleagues thought it was likely, but I thought it was silly.

Meanwhile, internal emails were going back and forth with the business unit to figure out what the errors meant.  Even for experienced network engineers, Cisco internal emails can read like a foreign language.  “The ALPHA errors are side-effects the GULF errors,” one development engineer commented, not so helpfully.  “Engine is feeding invalid packets to GULF and that causes the bad header error being detected on GULF,” another replied, only slightly more helpfully.

The customer, meanwhile, had identified a faulty fabric card on a Juniper router in their core.  Apparently the router was sending malformed packets to multiple provider edge (PE) routers all at once, which explained the simultaneous crashing.  Because all the PEs were in the US, forwarding was a matter of milliseconds, and thus there was very little variation in the timing.  How did the packets manage to traverse the several hops of the provider network without crashing any GSRs in between?  Well, the customer was using MPLS, and the corruption was in the IP header of the packets.  The intermediate hops forwarded the packets, without ever looking at the IP header, to the edge of the network, where the MPLS labels get stripped, and IP forwarding kicks in.  It was at that point that the line card crashed due to the faulty IP headers.  That said, when a line card receives a bad packet, it should drop it, not crash.  We had a bug.

The development engineers could not determine why the line card was crashing based on log info.  By this time, the customer had already replaced the faulty Juniper module and the network was stable.  The DEs wanted us to re-introduce the faulty line card into the core, and load up an engineering special debug image on the GSRs to capture the faulty packet.  This is often where we have a gulf, pun intended, between engineering and TAC.  No major service provider or customer wants to let Cisco engineering experiment on their network.  The customer decided to let it go.  If it came back, at least we could try to blame the issue on sunspots.

When you open a TAC case, how exactly does the customer support engineer (CSE) figure out how to solve the case?  After all, CSEs are not super-human.  Just like any engineer, in TAC you have a range of brilliant to not-so-brilliant, and everything in between.  Let me give an example:  I worked at HTTS, or high-touch TAC, serving customers who paid a premium for higher levels of support.  When a top engineer at AT&T or Verizon opened a case, how was it that I, who had never worked professionally in a service provider environment, was able to help them at all?  Usually when those guys opened a case, it was something quite complex and not a misconfigured route map!

TAC CSEs have an arsenal of tools at their disposal that customers, and even partners, do not.  One of the most powerful is well known to anyone who has ever worked in TAC:  Topic.  Topic is an internal search engine.  It can do more now, but at the time I was in TAC, Topic could search bugs, TAC cases, and internal mailers.  If you had a weird error message or were seeing inexplicable behavior, popping the message or symptoms into Topic frequently resulted in a bug.  Failing that, it might pull up another TAC case, which would show the best troubleshooting steps to take.

Topic also searches internal mailers, the email lists used internally by Cisco employees.  TAC agents, sales people, TMEs, product managers, and engineering all exchange emails on these mailers, which are then archived.  Oftentimes a problem would show up in the mailer archives and engineering had already provided an answer.  Sometimes, if Topic failed, we would post the symptoms to the mailers in hopes engineering, a TME, or any expert would have a suggestion.  I was always careful in doing so, as if you posted something that was already answered, or asked too often, flames would be coming your way.

TAC engineers have the ability to file bugs across the Cisco product portfolio.  This is, of course, a powerful way to get engineering attention.  Customer found defects are taken very seriously, and any bug that is opened will get a development engineer (DE) assigned to it quickly.  We were judged on the quality of bugs we filed since TAC does not like to abuse the privilege and waste engineering time.  If a bug is filed for something that is not really a bug, it gets marked “J” for Junk, and you don’t want to have too many junked bugs.  That said, on one or two occasions, when I needed engineering help and the mailers weren’t working, I knowingly filed a Junk bug to get some help from engineering.  Fortunately, I filed a few real bugs that got fixed.

My team was the “routing protocols” team for HTTS, but we were a dumping ground for all sorts of cases.  RP often got crash cases, cable modem problems, and other issues, even though these weren’t strictly RP.  Even within the technical limits of RP, there is a lot of variety among cases.  Someone who knows EIGRP cold may not have a clue about MPLS.  A lot of times, when stuck on a case, we’d go find the “guy who knows that” and ask for help.  We had a number of cases on Asynchronous Transfer Mode (ATM) when I worked at TAC, which was an old WAN (more or less) protocol.  We had one guy who knew ATM, and his job was basically just to help with ATM cases.  He had a desk at the office but almost never came in, never worked a shift, and frankly I don’t know what he did all day.  But when an ATM case came in, day or night, he was on it, and I was glad we had him, since I knew little about the subject.

Some companies have NOCs with tier 1, 2, and 3 engineers, but we just had CSEs.  While we had different pay grades, TAC engineers were not tiered in HTTS.  “Take the case and get help” was the motto.  Backbone (non-HTTS) TAC had an escalation team, with some high-end CSEs who jumped in on the toughest cases.  HTTS did not, and while backbone TAC didn’t always like us pulling on their resources, at the end of the day we were all about killing cases, and a few times I had backbone escalation engineers up in my cube helping me.

The more heated a case gets, the higher the impact, the longer the time to resolve, the more attention it gets.  TAC duty managers can pull in more CSEs, escalation, engineering, and others to help get a case resolved.  Occasionally, a P1 would come in at 6pm on a Friday and you’d feel really lonely.  But Cisco being Cisco, if they need to put resources on an issue, there are a lot of talented and smart people available.

There’s nothing worse than the sinking feeling a CSE gets when realizing he or she has no clue what to do on a case.  When the Topic searches fail, when escalation engineers are stumped, when the customer is frustrated, you feel helpless.  But eventually, the problem is solved, the case is closed, and you move on to the next one.

I’ve mentioned before that EIGRP SIA was my nightmare case at TAC, but there was one other type of case that I hated–QoS problems.  Routing protocol problems tend to be binary.  Either the route is there or it isn’t;  either the pings go through or they don’t.  Even when a route is flapping, that’s just an extreme version of the binary problem.  QoS is different.  QoS cases often involved traffic that was passing sometimes or in certain amounts, but would start having problems when different sizes of traffic went through, or possibly traffic was dropping at a certain rate.  Thus, the routes could be perfectly fine, pings could pass, and yet QoS was behaving incorrectly.

In TAC, we would regularly get cases where the customer claimed traffic was dropping on a QoS policy below the configured rate.  For example, if they configured a policing profile of 1000 Mbps, sometimes the customer would claim the policer was dropping traffic at, say, 800 Mbps.  The standard response for a TAC agent struggling to figure out a QoS policy issue like this was to say that the link was experiencing “microbursting.”  If a link is showing a 800 Mbps traffic rate, this is actually an average rate, meaning the link could be experiencing short bursts above this rate that exceed the policing rate, but are averaged out in the interface counters.  “Microbursting” was a standard response to this problem for two reasons:  first, it was most often the problem;  second, it was an easy way to close the case without an extensive investigation.  The second reason is not as lazy as it may sound, as microbursts are common and are usually the cause of these symptoms.

Thus, when one of our large service provider customers opened a case stating that their LLQ policy was dropping packets before the configured threshold, I was quick to suspect microbursts.  However, working in high-touch TAC, you learn that your customers aren’t pushovers and don’t always accept the easy answer.  In this case, the customer started pushing back, claiming that the call center which was connected to this circuit generated a constant stream of traffic and that he was not experiencing microbursts.  So much for that.

This being the 2000’s, the customer had four T1’s connected in a single multi-link PPP (MLPPP) bundle.  The LLQ policy was dropping traffic at one quarter of the threshold it was configured for.  Knowing I wouldn’t get much out of a live production network, I reluctantly opened a lab case for the recreate, asking for a back-to-back router with the same line cards, a four-link T1 interconnection, and a traffic generator.  As always, I made sure my lab had exactly the same IOS release as the customer.

Once the lab was set up I started the traffic flowing, and much to my surprise, I saw traffic dropping at one quarter of the configured LLQ policy.  Eureka!  Anyone who has worked in TAC will tell you that more often than not, lab recreates fail to recreate the customer problem.  I removed and re-applied the service policy, and the problem went away.  Uh oh.  The only thing worse than not recreating a problem is recreating it and then losing it again before developers get a chance to look at it.

I spent some time playing with the setup, trying to get the problem back.  Finally, I reloaded the router to start over and, sure enough, I got the traffic loss again.  So, the problem occurred at start-up, but when the policy was removed and re-applied, it corrected itself.  I filed a bug and sent it to engineering.

Because it was so easy to recreate, it didn’t take long to find the answer.  The customer was configuring their QoS policy using bandwidth percentages instead of absolute bandwidth numbers.  This means that the policy bandwidth would be determined dynamically by the router based on the links it was applied to.  It turns out that IOS was calculating the bandwidth numbers before the MLPPP bundle was fully up, and hence was using only a single T1 as the reference for the calculation instead of all four.  The fix was to change the priority of operations in IOS, so that the MLPPP bundle came up before the QoS policy was applied.

So much for microbursts.  The moral(s) of the story?  First, the most obvious cause is often not the cause at all.  Second, determined customers are often right.  And third:  even intimidating QoS cases can have an easy fix.

I’ve mentioned in previous TAC Tales that I started on a TAC team dedicated to enterprise, which made sense given my background.  Shortly after I came to Cisco the enterprise team was broken up and its staff distributed among the routing protocols team and LAN switch team.  The RP team at that time consisted of service provider experts with little understanding of LAN switching issues, but deep understanding of technologies like BGP and MPLS.  This was back before the Ethernet-everywhere era, and SP experts had never really spent a lot of time with LAN switches.

This created a big problem with case routing.  Anyone who has worked more than 5 minutes in TAC knows that when you have a routing protocol problem, usually it’s not the protocol itself but some underlying layer 2 issue.  This is particularly the case when adjacencies are resetting.  The call center would see “OSPF adjacencies resetting” and immediately send the case to the protocols team, when in fact the issue was with STP or perhaps a faulty link.  With all enterprise RP issues suddenly coming into the same queue as SP cases, our SP-centric staff were constantly getting into stuff they didn’t understand.

One such case came in to us, priority 1, from a service provider that ran “cell sites”, which are concrete bunkers with radio equipment for cellular transmissions.  “Now wait,” you’re saying, “I thought you just said enterprise RP cases were a problem, but this was a service provider!”  Well, it was a service provider but they ran LAN switches at the cell site, so naturally when OSPF started going haywire it came in to the RP team despite obviously being a switching problem!

A quick look at the logs confirmed this:

Jun 13 01:52:36 LSW38-0 3858130: Jun 13 01:52:32.347 CDT:
%C4K_EBM-4-HOSTFLAPPING: Host 00:AB:DA:EE:0A:FF in vlan 74 is flapping
between port Fa2/37 and port Po1

Here we could see a host MAC address moving between a front-panel port on the switch and a core-facing port channel.  Something’s not right there.  There were tons of messages like these in the logs.

Digging a little further I determined that Spanning Tree was disabled.  Ugh.

Spanning Tree Protocol (STP) is not  popular, and it’s definitely flawed.  With all due respect to the (truly) great Radia Perlman, the inventor of STP, choosing the lowest bridge identifier (usually the MAC address of the switch) as the root, when priorities are set to the default, is a bad idea.  It means that if customers deploy STP with default values, the oldest switch in the network becomes root.  Bad idea, as I said.  However, STP also gets a bad reputation undeservedly.  I cannot tell you how many times there was a layer 2 loop in a customer network, where STP was disabled, and the customer referred to it as a “Spanning Tree loop”.  STP stops layer 2 loops, it does not create them.  And a layer 2 loop out of control is much worse than a 50 second spanning tree outage, which is what you got with the original protocol spec.  When there is no loop in the network, STP doesn’t do anything at all except for send out BPDUs.

As I suspected, the customer had disabled spanning tree due to concerns about the speed of failover.  They had also managed to patch a layer 2 loop into their network during a minor change, causing an unchecked loop to circulate frames out of control, bringing down their entire cell site.

I explained to them the value of STP, and why any outage caused by it would be better than the out of control loop they had.  I was told to mind my own business.  They didn’t want to enable spanning tree because it was slow.  Yes, I said, but only when there is a loop!  And in that case, a short outage is better than a meltdown.  Then I realized the customer and I were in a loop, which I could break by closing the case.

Newer technologies (such as SD-Access) obviate the need for STP, but if you’re doing classic Layer 2, please, use it.

Everyone who’s worked in TAC can tell you their nightmare case–the type of case that, when they see it in the queue, makes them want to run away, take an unexpected lunch break, and hope some other engineer grabs it.  The nightmare case is the case you know you’ll get stuck on for hours, on a conference bridge, escalating to other engineers, trying to find a solution to an impossible problem.  For some it’s unexplained packet loss.  For others, it’s multicast.  For me, it was EIGRP Stuck-in-Active (SIA).

Some customer support engineers (CSEs) thought SIA cases were easy.  Not me.  A number of times I had a network in total meltdown due to SIA with no clue as to where the problem was.  Often the solution required a significant redesign of the network.

As a review, EIGRP is more-or-less a distance-vector routing protocol, which uses an algorithm called DUAL to achieve better performance than a traditional DV protocol like RIP.  I don’t want to get into all the fun CCIE questions on the protocol details, but what matters for this article is how querying works.  When an EIGRP neighbor loses a route, it sets the route as “Active” and then queries its neighbors as to where the route went.  Then, if the neighbors don’t have it, they set it active and query their neighbors.  If those neighbors don’t have the route active, they of course mark it active and query their neighbors.  And so forth.

It should be obvious from this process that in a large network, the queries can multiply quite quickly.  If a router has a lot of neighbors, and its neighbors have a lot of neighbors, the queries multiply exponentially, and can get out of control.  Meanwhile, when a router sets a route active, it sets a timer.  If it doesn’t get a reply before the timer expires, then the router marks the route “Stuck In Active”, and resets the entire EIGRP adjacency.  In a large network with a lot of neighbors, even if the route is present, the time lag between sending a query and getting a response can be so long that the route gets reset before the response makes it to the original querying router.

I’ve ironed out some of the details here, since obviously an EIGRP router can lose a route entirely without going SIA.  For details, see this article.  The main point to remember is that the SIA route happens when the querying route just doesn’t get a response back.

Back in my TAC days, I of course wasn’t happy to see an SIA drop in the queue.  I waited to see if one of my colleagues would take the case and alleviate the burden, but the case turned blue after 20 minutes, meaning someone had to take it.  Darn.

Now I can show my age, because the customer had adjacencies resetting on Token Ring interfaces.  I asked the customer for a topology diagram, some debugs, and to check whether there was packet loss across the network.  Sometimes, if packets are getting dropped, the query responses don’t make it back to the original router, causing SIA.  The logs from the resets looked like this:

rtr1 - 172.16.109.118 - TokenRing1/0
Sep 1 16:58:06: %DUAL-3-SIA: Route 172.16.161.58/32 stuck-in-active state in IP-EIGRP(0) 55555. Cleaning up
Sep 1 16:58:06: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is down: stuck in active
Sep 1 16:58:07: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is up: new adjacency

This is typical of SIA.  The adjacency flapped, but the logs showed no particular reason why.

I thought back to my first troubleshooting experience as a network engineer.  I had brought up a new branch office but it couldn’t talk back to HQ.  Mike, my friend and mentor, showed up and started pinging hop-by-hop until he found a missing route.  “That’s how I learned it,” he said, “just go one hop a time.”  The big clue I had in the SIA case was the missing route:  172.16.161.58/32.  I started tracing it back, hop-by-hop.

I found that the route originated from a router on the edge of the customer network, which had an ISDN PRI connected.  (Showing my age again!)  They had a number of smaller offices that would dial into the ISDN on-demand, and then drop off.  ISDN had per-minute charges and thus, in this pre-VPN era, it was common to setup ISDN in on-demand mode.  ISDN was a digital dial-up technology with very short call setup times.  I discovered that, as these calls were going up and down, the router was generating /32 peer routes for the neighbors and injecting them into EIGRP.  They had a poorly designed network with a huge query domain size, and so as these dial peers were going up and down, routers on the opposite side of the network were going into active on the route and not getting responses back.

They were advertising a /16 for the entire 172.16.x.x network, so sending a /32 per dial peer was totally unnecessary.  I recommended they enable “no peer neighbor-route” on the PRI to suppress the /32’s and the SIAs went away.

I hate to bite the hand that feeds me, but even though I work at Cisco I can say I really never liked EIGRP.  EIGRP is fast, and if the network is designed well, it works fine.  However, networks often grow organically, and the larger the domain, the more unstable EIGRP becomes.  I’ve never seen this sort of problem with OSPF or ISIS.  Fortunately, this case ended up being much less problematic than I expected, but often these cases were far nastier.  Oftentimes it was nearly impossible to find the route causing the problem and why it was going crazy.  Anyhow it’s always good to relive a case with both Token Ring and ISDN for a double case of nostalgia.

A common approach for TAC engineers and customers working on a tough case is to just “throw hardware at it.”  Sometimes this can be laziness:  why troubleshoot a complex problem when you can send an RMA, swap out a line card, and hope it works?  Other times it’s a legitimate step in a complex process of elimination.  RMA the card and if the problem still happens, well, you’ve eliminated the card as one source of the problem.

Hence, it was not an uncommon event the day that I got a P1 case from a major service provider, requeued (reassigned) after multiple RMAs.   The customer had a 12000-series GSR, top of the line back then, and was frustrated because ISIS wasn’t working.

“We just upgraded the GRP to a PRP to speed the router up,” he said, “but now it’s taking 4 hours for ISIS to converge.  Why did we pay all this money on a new route processor when it just slowed our box way down?!”

The GSR router is a chassis-type router, with multiple line cards with ports of different types, a fabric interconnecting them, and a management module (route processor, or RP) acting as the brains of the device.  The original RP was called a GRP, but Cisco had released an improved version called the PRP.

The GSR 12000-series

The customer seemed to think the new PRP had performance issues, but this didn’t make sense.  Performance issues might cause some small delays or possibly packet loss for packets destined to the RP, but not delays of four hours.  Something else was amiss.  I asked the customer to send me the ISIS database, and it was full of LSPs like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
0651.8412.7001.00-00  0x00000000   0x0000        193               0/0/0

ISIS routers periodically send CSNPs, or Complete Sequence Number PDUs, which contain a list of all the link state packets (LSPs) in the router database.  In this case, the GSR was directly attached to a Juniper router which was its sole ISIS adjacency.  It was receiving the entire ISIS database from this router.  Normally an ISIS database entry looks like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
bb1-sjc.00-00         0x0000041E   0xF97D        65365             0/0/0

Note that instead of a router ID, we actually have a router name.  Note also that we have a sequence number and a checksum for each LSP.  As the previous output shows, something was wrong with the LSPs we were receiving.  Not only was the name not resolving, the sequence and checksum were zero.  How can we possibly have an LSP which has no sequence number at all?

Even weirder was that as I refreshed the ISIS outputs, the LSPs started resolving, suddenly popping up with names and non-zero sequences and checksums.  I stayed on the phone with the customer for several hours, before finally every LSP was resolved, and the customer had full reachability.  “Don’t do anything to the router until I get back to you,” I said before hanging up.  If only he had listened.

I was about to pack up for the day and I got called by our hotline.  The customer had called in and escalated to a P1 after reloading the router.  The entire link state database was zero’d out again, and the network was down.  He only had a short maintenance window in which to work, and now he had an outage.  It was 6pm.  I knew I wasn’t going home for a while.

Whatever was happening was well beyond my ISIS expertise.  Even in the routing protocols team, it was hard to find deep knowledge of ISIS.  I needed an expert, and Abe Martey, who sat across from me, literally wrote the book on ISIS.  The Cisco Press book, that is.  The only issue:  Abe had decided to take PTO that week.  Of course.  I pinged a protocols escalation engineer, one of our best BGP guys.  He didn’t want anything to do with it.  Finally I reached out to the duty manager and asked for help.  I also emailed our internal mailers for ISIS, but after 6pm I wasn’t too optimistic.

Why were we seeing what appeared to be invalid LSPs?  How could an LSP even have a zero checksum or sequence number?  Why did they seem to clear out, and why so slowly?  Did the upgrade to the PRP have anything to do with it?  Was it hardware?  A bug?  As a TAC engineer, you have to consider every single possibility, from A to Z.

The duty manager finally got Sanjeev, an “ISIS expert” from Australia on the call.  The customer may not realize this while a case is being handled, but if it’s complex and high priority, there is often a flurry of instant messaging going on behind the scenes.  We had a chat room up, and as the “expert” listened to the description of the problem and looked at the notes, he typed in the window:  “This is way over my head.”  Great, so much for expertise.  Our conversation was getting heated with the customer, as his frustration with the lack of progress escalated.  The so-called expert asked him to run a command, which another TAC engineer suggested.

“Fantastic,” said the customer, “Sanjeev wants us to run a command.  Sanjeev, tell us, why do you want to run this command?  What’s it going to do?”

“Uh, I’m not sure,” said Sanjeev, “I’ll have to get back to you on that.”

Not a good answer.

By 8:30 PM we also had a senior routing protocols engineer in the chat window.  He seemed to think it was a hardware issue and was scraping the error counters on the line cards. The dedicated Advanced Services NCE for the account also signed on and was looking at the errors. It’s a painful feeling knowing you and the customer are stranded, but we honestly had no idea what to do.  Because the other end of the problem was a Juniper router, JTAC came on board as well.  We may have been competitors, but we were professionals and put it aside to best help the customer.

Looking at the chat transcript, which I saved, is painful.  One person suggests physically cleaning the fiber connection.  Another thinks it’s memory corruption.  Another believes it is packet corruption.  We schedule a circuit test with the customer to look for transmission errors.

All the while, the 0x0000 LSPs are re-populating with legitimate information, until, by 9pm, the ISIS database was fully converged and routing was working again.  “This time,” I said, “DO NOT touch the router.”  The customer agreed.  I headed home at 9:12pm, secretly hoping they would reload the router so the case would get requeued to night shift and taken off my hands.

In the morning we got on our scheduled update call with the customer.  I was tired, and not happy to make the call.  We had gotten nowhere in the night, and had not gotten helpful responses to our emails.  I wasn’t sure what I was going to say.  I was surprised to hear the customer in a chipper mood.  “I’m happy to report Juniper has reproduced the problem in their lab and has identified the problem.”

There was a little bit of wounded pride knowing they found the fix before we did, but also a sense of relief to know I could close the case.

It turns out that the customer, around the same time they installed the PRP, had attempted to normalize the configs between the Juniper and Cisco devices.  They had mistakenly configured a timer called the “LSP pacing interval” on the Juniper side.  This controls the rate at which the Juniper box sends out LSPs.  They had thought they were configuring the same timer as the LSP refresh interval on the Cisco side, but they were two different things.  By cranking it way up, they ensured that the hundreds of LSPs in the database would trickle in, taking hours to converge.

Why the 0x0000 entries then?  It turns out that in the initial exchange, the ISIS routers share with each other what LSPs they have, without sending the full LSP.  Thus, in Cisco ISIS databases, the 0x0000 entry acts as a placeholder until complete LSP data is received.  Normally this period is short and you don’t see the entry.  We probably would have found the person who knew that eventually, but we didn’t find him that night and our database of cases, newsgroup postings, and bugs turned up nothing to point us in the right direction.

I touched a couple thousand cases in my time at TAC, but this case I remember even 10 years later because of the seeming complexity, the simplicity of the resolution, the weirdness of the symptoms, and the distractors like the PRP upgrade.  Often a major outage sends you in a lot of directions and down many rat holes.  I don’t think we could have done much differently, since the config error was totally invisible to us.  Anyway, if Juniper and Cisco can work together to solve a customer issue, maybe we should have hope for world peace.

When I first started at Cisco TAC, I was assigned to a team that handled only enterprise customers.  One of the first things my boss said to me when I started there was “At Cisco, if you don’t like your boss or your cubicle, wait three months.”  Three months later, they broke the team up and I had a new boss and a new cubicle.  My new team handled routing protocols for both enterprise and service provider customers, and I had a steep learning curve having just barely settled down in the first job.

A P1 case came into my queue for a huge cable provider.  Often P1’s are easy, requiring just an RMA, but this one was a mess.  It was a coast-to-coast BGP meltdown for one of the largest service provider networks in the country.  Ugh.  I was on the queue at the wrong time and took the wrong case.

The cable company was seeing BGP adjacencies reset across their entire network.  The errors looked like this:

Jun 16 13:48:00.313 EST: %BGP-5-ADJCHANGE: neighbor 172.17.249.17 Down BGP
Notification sent

Jun 16 13:48:00.313 EST: %BGP-3-NOTIFICATION: sent to neighbor 172.17.249.17
3/1 (update malformed) 8 bytes 41A41FFF FFFFFFFF

The cause seemed to be malformed BGP packets, but why?  The GSR routers they had were kind enough to give us a hex dump of the BGP packet when an adjacency reset.  I got out my trusty Doyle book and began decoding the packets on paper, when a colleague was kind enough to point me to an internal Cisco tool that would decode a BGP packet from hex.

We could see that, for some reason, the NLRI portion of the BGP message was getting cut off.  According to my calculations, it should have been 44 bytes, but we were only seeing 32 bytes of information.  NLRI is Network Layer Reachability Information, just a fancy BGP way of saying the paths that go into the routing update.  We also noticed a clue in the router logs:  TCP-6-TOOBIG messages showing up from time to time.

Going over it with engineering, we realized something interesting.  The customer had enabled TCP selective acknowledgement on all their routers.  Also known as SACK, TCP selective acknowledgement is designed to circumvent an inefficiency in TCP.  If, say, 1 of 3 TCP segments gets dropped, the TCP protocol requires re-transmission of all 3 of the segments.  In other words, the receiver keeps ACKing the last segment it received, but it takes time for the sender to realize something is wrong.  When the sender finally realizes something is wrong, it goes back to the last known good segment and re-transmits everything after it.  SACK allows TCP to acknowledge and re-transmit specific segments.  If we are only missing segments 2, 3, and 5, then we can ask for just those to be re-transmitted.  SACK is stored as an option in the TCP header.

The problem is, there is a finite amount of space in the TCP header, and the SACK field can get rather long.  It just so happens that BGP also stores its MD5 authentication hash in the TCP header.  If SACK gets too long, it can crowd the MD5 header and cause BGP errors.  Based on our analysis, this was exactly what had happened.  Thus, the malformed packets.  We had the customer remove the SACK option from all routers and the problem stopped.

We were left with a couple questions.  Why did SACK get so long, and why would it be allowed to overwrite other important values in the TCP header?  In answer to the first question, there was a bug which was causing some linecards to send out malformed packets on occasion, thus causing SACKs.  In answer to the second question, there was a bug in the TCP header options packing that allowed one field (SACK) to crowd out another field (MD5 authentication).  I knew the case wouldn’t close for a long time.  Multiple bugs needed to be filed, and new code qualified and installed.  Fortunately the customer had a workaround (disable SACK) and an HTE.  An HTE was a TAC engineer dedicated to their account.  He grabbed the case from me for babysitting and I moved onto my next case.

In my TAC tales I often make fun of the occasional mistakes of TAC engineers.  However, TAC is a tough job, and the organization is staffed by some top engineers.  Many cases, like this one, required hard core engineering and knowledge that spans protocol details and ASIC-level hardware debugging.  It’s not a job for the faint of heart.  This case required digging into the TCP header, understanding how options are packed, and figuring out how to stop a major meltdown of a service provider network.  A high-stress situation, to be sure, but these cases often were the most rewarding.

 

No customer is happy if they have to reboot one of their Internet-facing routers periodically, and this was one of our biggest customers.  (At HTTS, they were all big customers.)  This customer had a GSR connecting to the Internet, with partial BGP routes, and he kept getting this error:

%RP-3-ENCAP: Failure to allocate encap table entry, exceeded max number of entries, slot 2

Eventually the router would stop passing traffic and when this happened, he had to reload it.  Needless to say, he wasn’t happy.

The error came with a traceback, which shows what functions the code was executing when the error was generated.  The last function was this:

arp_background(0x5053d290)+0x140

Well, this was obviously some sort of ARP issue.  But why was ARP causing the router to stop forwarding traffic?

Looking up the error, I found that it meant the route processor was unable to allocate a rewrite entry for the slot 2 line card.  As a packet leaves the fabric of a large router like the GSR, the headers are re-written with the destination layer 2 info.  The rewrite table used for this was full.  I had the customer run a hidden command a few times, and we could see the table entries incrementing quickly:

Adjacency Table has 3167 adjacencies

Adjacency Table has 3291 adjacencies

Adjacency Table has 3322 adjacencies

Adjacency Table has 3410 adjacencies

Scrolling through the config, I looked for something that could be the culprit.  Then I saw it.  I remembered a router architecture course I had to take when I first became a TAC agent.  One of the escalation engineers told the story of his first P1 case.  It was a router that kept needing a reload.  He went to another senior escalation engineer, and after looking at the config she said to him, “What are you a f*cking idiot?”  He was quite shocked to be addressed in this manner.  “There is a static route pointed to a broadcast interface!”  she yelled, and then proceeded to chew him out for wasting her time.  This lady was famous in TAC for using bad language in nearly every sentence, and our trainer was able to laugh about it in retrospect.  “Now that I know her I don’t even care when she talks to me like that,” he reported.

Well, I wasn’t going to be called anything like that.  I looked in the config and found this:

ip route 0.0.0.0 0.0.0.0 GigabitEthernet2/0 100

A default route, pointed out a broadcast interface.  With partial BGP routes, this meant that the router was generating an ARP entry for every single destination address on the Internet that was not in the partial BGP table.  Whoops.  There are millions of destinations on the Internet, so it’s no surprise he was filling the capacity on the re-write table on his line card.

He removed the route and replaced it with a static route to the next hop.  The adjacency table immediately dropped below 100.  Problem solved.

Some TAC cases were mind-bogglingly difficult, involving multiple layers of help from engineering, hours in the lab, and major frustration.  Some, like this one, are major problems with major customers that end quickly and easily.  I closed the case with this note:

Customer was seeing RP-3-ENCAP error messages on one of his GSR LC’s. The card would eventually stop passing traffic, requiring reload of the router. Customer had a static default route to the Internet pointed out a broadcast interface–this was causing the router to ARP out that interface and create CEF adjacencies for each destination on the Internet. This was overloading the rewrite table on the LC. Customer removed static route, pointed to next hop address instead. Rewrite table entries went back to normal.