TAC

All posts tagged TAC

The case came into the routing protocols queue, even though it was simply a line card crash.  The RP queue in HTTS was the dumping ground for anything that did not fit into one of the few other specialized queues we had.  A large US service provider had a Packet over SONET (PoS) line card on a GSR 12000-series router crashing over and over again.

Problem Details: 8 Port ISE Packet Over SONET card continually crashing due to

SLOT 2:Aug  3 03:58:31: %EE48-3-ALPHAERR: TX ALPHA: error: cpu int 1 mask 277FFFFF
SLOT 2:Aug  3 03:58:31: %EE48-4-GULF_TX_SRAM_ERROR: ASIC GULF: TX bad packet header detected. Details=0x4000

A previous engineer had the case, and he did what a lot of TAC engineers do when faced with an inexplicable problem:  he RMA’d the line card.  As I have said before, RMA is the default option for many TAC engineers, and it’s not a bad one.  Hardware errors are frequent and replacing hardware often is a quick route to solving the problem.  Unfortunately the RMA did not fix the problem, the case got requeued to another engineer, and he…RMA’d the line card.  Again.  When that didn’t work, he had them try the card in a different slot, but it continued to generate errors and crash.

The case bounced through two other engineers before getting to me.  Too bad the RMA option was out.  But the simple line card crash and error got even weirder.  The customer had two GSR routers in two different cities that were crashing with the same error.  Even stranger:  the crash was happening at precisely the same time in both cities, down to the second.  It couldn’t be a coincidence, because each crash on the first router was mirrored by a crash at exactly the same time on the second.

The conversation with my fellow engineers ranged from plausible to ludicrous.  There was a legend in TAC, true or not, that solar flares cause parity errors in memory and hence crashes.  Could a solar flare be triggering the same error on both line cards at the same time?  Some of my colleagues thought it was likely, but I thought it was silly.

Meanwhile, internal emails were going back and forth with the business unit to figure out what the errors meant.  Even for experienced network engineers, Cisco internal emails can read like a foreign language.  “The ALPHA errors are side-effects the GULF errors,” one development engineer commented, not so helpfully.  “Engine is feeding invalid packets to GULF and that causes the bad header error being detected on GULF,” another replied, only slightly more helpfully.

The customer, meanwhile, had identified a faulty fabric card on a Juniper router in their core.  Apparently the router was sending malformed packets to multiple provider edge (PE) routers all at once, which explained the simultaneous crashing.  Because all the PEs were in the US, forwarding was a matter of milliseconds, and thus there was very little variation in the timing.  How did the packets manage to traverse the several hops of the provider network without crashing any GSRs in between?  Well, the customer was using MPLS, and the corruption was in the IP header of the packets.  The intermediate hops forwarded the packets, without ever looking at the IP header, to the edge of the network, where the MPLS labels get stripped, and IP forwarding kicks in.  It was at that point that the line card crashed due to the faulty IP headers.  That said, when a line card receives a bad packet, it should drop it, not crash.  We had a bug.

The development engineers could not determine why the line card was crashing based on log info.  By this time, the customer had already replaced the faulty Juniper module and the network was stable.  The DEs wanted us to re-introduce the faulty line card into the core, and load up an engineering special debug image on the GSRs to capture the faulty packet.  This is often where we have a gulf, pun intended, between engineering and TAC.  No major service provider or customer wants to let Cisco engineering experiment on their network.  The customer decided to let it go.  If it came back, at least we could try to blame the issue on sunspots.

In the last article on technical interviewing, I told the story of how I got my first networking job.  The interview was chaotic and unorganized, and resulted in me getting the job and being quite successful.  In this post, I’d like to start with a very basic question:  Why is it that we interview job candidates in first place?

This may seem like an obvious question, but if you think about it face-to-face interviewing is not necessarily the best way to assess a candidate for a networking position.  To evaluate their technical credentials, why don’t we administer a test? Or, force network engineering candidates to configure a small network? (Some places do!)  What exactly is it that we hope to achieve by sitting down for an hour and talking to this person face-to-face?

Interviewing is fundamentally a subjective process.  Even when an interviewer attempts to bring objectivity to the interview by, say, asking right/wrong questions, interviews are just not structured as objective tests.  The interviewer feedback is usually derived from gut reactions and feelings as much as it is from any objective criteria.  The interviewer has a narrow window into the candidate’s personality and achievements, and frequently an interviewer will make an incorrect assessment in either direction:

  • By turning down a candidate who is qualified for the job.  When I worked at TAC, I remember declining a candidate who didn’t answer some questions about OSPF correctly.  Because he was a friend of a TAC engineer, he got a second chance and did better in his second interview.  He got hired and was quite successful.
  • By hiring a candidate who is unqualified for the job.  This happens all the time.  We pass people through interviews who end up being terrible at the job.  Sometimes we just assess their personality wrong and they end up being complete jerks.  Sometimes, they knew enough technical material to skate through the interview.

Having interviewed hundreds of people in my career, I think I’m a very good judge of people.  I was on the interview team for TAC, and everyone we hired was a successful engineer.  Every TME I’ve hired as a manager has been top notch.  That said, it’s tricky to assess someone in such a short amount of time. As the interviewee, you need to remember that you only have an hour or so to convince this person you are any good, and one misplaced comment could torpedo you unfairly.

I remember when I interviewed for the TME job here at Cisco.  I did really well, and had one final interview with the SVP at the time.  He was very personable, and I felt at ease with him.  He asked me for my proudest accomplishment in my career.  I mentioned how I had hated TAC when I started, but I managed to persevere and left TAC well respected and successful.  He looked at my quizzically.  I realized it was a stupid answer.  I was interviewing for a director-level position.  He wanted to hear some initiative and drive, not that I stuck it out at a crappy job.  I should have told him about how I started the Juniper on Juniper project, for example.  Luckily I got through but that one answer gave him an impression that took me down a bit.

When you are interviewing, you really need to think about the impression you create.  You need empathy.  You need to feel how your interviewer feels, or at least be self-aware enough to know the impression you are creating.  That’s because this is a subjective process.

I remember a couple of years back I was interviewing a candidate for an open position.I asked him why he was interested in the job. The candidate proceeded to give me a depressing account of how bad things were in his current job.”It’s miserable here,” he said.  “Nobody’s going anywhere in his job.  I don’t like the team they’re not motivated.”  And so forth.  He claimed he had programming capabilities and so I asked him what his favorite programming language.”I hate them all,” he said. I actually think that he was technically fairly competent but in my opinion working with this guy would’ve been such a downer that I didn’t hire him.

In my next article I’ll take a look at different things hiring managers and interviewers are looking for in a candidate, and how they assess them in an interview.

 

When you open a TAC case, how exactly does the customer support engineer (CSE) figure out how to solve the case?  After all, CSEs are not super-human.  Just like any engineer, in TAC you have a range of brilliant to not-so-brilliant, and everything in between.  Let me give an example:  I worked at HTTS, or high-touch TAC, serving customers who paid a premium for higher levels of support.  When a top engineer at AT&T or Verizon opened a case, how was it that I, who had never worked professionally in a service provider environment, was able to help them at all?  Usually when those guys opened a case, it was something quite complex and not a misconfigured route map!

TAC CSEs have an arsenal of tools at their disposal that customers, and even partners, do not.  One of the most powerful is well known to anyone who has ever worked in TAC:  Topic.  Topic is an internal search engine.  It can do more now, but at the time I was in TAC, Topic could search bugs, TAC cases, and internal mailers.  If you had a weird error message or were seeing inexplicable behavior, popping the message or symptoms into Topic frequently resulted in a bug.  Failing that, it might pull up another TAC case, which would show the best troubleshooting steps to take.

Topic also searches internal mailers, the email lists used internally by Cisco employees.  TAC agents, sales people, TMEs, product managers, and engineering all exchange emails on these mailers, which are then archived.  Oftentimes a problem would show up in the mailer archives and engineering had already provided an answer.  Sometimes, if Topic failed, we would post the symptoms to the mailers in hopes engineering, a TME, or any expert would have a suggestion.  I was always careful in doing so, as if you posted something that was already answered, or asked too often, flames would be coming your way.

TAC engineers have the ability to file bugs across the Cisco product portfolio.  This is, of course, a powerful way to get engineering attention.  Customer found defects are taken very seriously, and any bug that is opened will get a development engineer (DE) assigned to it quickly.  We were judged on the quality of bugs we filed since TAC does not like to abuse the privilege and waste engineering time.  If a bug is filed for something that is not really a bug, it gets marked “J” for Junk, and you don’t want to have too many junked bugs.  That said, on one or two occasions, when I needed engineering help and the mailers weren’t working, I knowingly filed a Junk bug to get some help from engineering.  Fortunately, I filed a few real bugs that got fixed.

My team was the “routing protocols” team for HTTS, but we were a dumping ground for all sorts of cases.  RP often got crash cases, cable modem problems, and other issues, even though these weren’t strictly RP.  Even within the technical limits of RP, there is a lot of variety among cases.  Someone who knows EIGRP cold may not have a clue about MPLS.  A lot of times, when stuck on a case, we’d go find the “guy who knows that” and ask for help.  We had a number of cases on Asynchronous Transfer Mode (ATM) when I worked at TAC, which was an old WAN (more or less) protocol.  We had one guy who knew ATM, and his job was basically just to help with ATM cases.  He had a desk at the office but almost never came in, never worked a shift, and frankly I don’t know what he did all day.  But when an ATM case came in, day or night, he was on it, and I was glad we had him, since I knew little about the subject.

Some companies have NOCs with tier 1, 2, and 3 engineers, but we just had CSEs.  While we had different pay grades, TAC engineers were not tiered in HTTS.  “Take the case and get help” was the motto.  Backbone (non-HTTS) TAC had an escalation team, with some high-end CSEs who jumped in on the toughest cases.  HTTS did not, and while backbone TAC didn’t always like us pulling on their resources, at the end of the day we were all about killing cases, and a few times I had backbone escalation engineers up in my cube helping me.

The more heated a case gets, the higher the impact, the longer the time to resolve, the more attention it gets.  TAC duty managers can pull in more CSEs, escalation, engineering, and others to help get a case resolved.  Occasionally, a P1 would come in at 6pm on a Friday and you’d feel really lonely.  But Cisco being Cisco, if they need to put resources on an issue, there are a lot of talented and smart people available.

There’s nothing worse than the sinking feeling a CSE gets when realizing he or she has no clue what to do on a case.  When the Topic searches fail, when escalation engineers are stumped, when the customer is frustrated, you feel helpless.  But eventually, the problem is solved, the case is closed, and you move on to the next one.

I’ve mentioned before that EIGRP SIA was my nightmare case at TAC, but there was one other type of case that I hated–QoS problems.  Routing protocol problems tend to be binary.  Either the route is there or it isn’t;  either the pings go through or they don’t.  Even when a route is flapping, that’s just an extreme version of the binary problem.  QoS is different.  QoS cases often involved traffic that was passing sometimes or in certain amounts, but would start having problems when different sizes of traffic went through, or possibly traffic was dropping at a certain rate.  Thus, the routes could be perfectly fine, pings could pass, and yet QoS was behaving incorrectly.

In TAC, we would regularly get cases where the customer claimed traffic was dropping on a QoS policy below the configured rate.  For example, if they configured a policing profile of 1000 Mbps, sometimes the customer would claim the policer was dropping traffic at, say, 800 Mbps.  The standard response for a TAC agent struggling to figure out a QoS policy issue like this was to say that the link was experiencing “microbursting.”  If a link is showing a 800 Mbps traffic rate, this is actually an average rate, meaning the link could be experiencing short bursts above this rate that exceed the policing rate, but are averaged out in the interface counters.  “Microbursting” was a standard response to this problem for two reasons:  first, it was most often the problem;  second, it was an easy way to close the case without an extensive investigation.  The second reason is not as lazy as it may sound, as microbursts are common and are usually the cause of these symptoms.

Thus, when one of our large service provider customers opened a case stating that their LLQ policy was dropping packets before the configured threshold, I was quick to suspect microbursts.  However, working in high-touch TAC, you learn that your customers aren’t pushovers and don’t always accept the easy answer.  In this case, the customer started pushing back, claiming that the call center which was connected to this circuit generated a constant stream of traffic and that he was not experiencing microbursts.  So much for that.

This being the 2000’s, the customer had four T1’s connected in a single multi-link PPP (MLPPP) bundle.  The LLQ policy was dropping traffic at one quarter of the threshold it was configured for.  Knowing I wouldn’t get much out of a live production network, I reluctantly opened a lab case for the recreate, asking for a back-to-back router with the same line cards, a four-link T1 interconnection, and a traffic generator.  As always, I made sure my lab had exactly the same IOS release as the customer.

Once the lab was set up I started the traffic flowing, and much to my surprise, I saw traffic dropping at one quarter of the configured LLQ policy.  Eureka!  Anyone who has worked in TAC will tell you that more often than not, lab recreates fail to recreate the customer problem.  I removed and re-applied the service policy, and the problem went away.  Uh oh.  The only thing worse than not recreating a problem is recreating it and then losing it again before developers get a chance to look at it.

I spent some time playing with the setup, trying to get the problem back.  Finally, I reloaded the router to start over and, sure enough, I got the traffic loss again.  So, the problem occurred at start-up, but when the policy was removed and re-applied, it corrected itself.  I filed a bug and sent it to engineering.

Because it was so easy to recreate, it didn’t take long to find the answer.  The customer was configuring their QoS policy using bandwidth percentages instead of absolute bandwidth numbers.  This means that the policy bandwidth would be determined dynamically by the router based on the links it was applied to.  It turns out that IOS was calculating the bandwidth numbers before the MLPPP bundle was fully up, and hence was using only a single T1 as the reference for the calculation instead of all four.  The fix was to change the priority of operations in IOS, so that the MLPPP bundle came up before the QoS policy was applied.

So much for microbursts.  The moral(s) of the story?  First, the most obvious cause is often not the cause at all.  Second, determined customers are often right.  And third:  even intimidating QoS cases can have an easy fix.

I’ve been in this industry a while now, and I’ve done a lot of jobs.  Certainly not every job, but a lot.  My first full time network engineering job came in 2000, but I was doing some networking for a few years before that.

I often see younger network engineers posting in public forums asking about the pros and cons of different job roles.  I’ve learned over the years that you have to take people’s advice with a grain of salt.  Jobs in one category may have some common characteristics, but a huge amount is dependent on the company, your manager, and the people you work with.  However, we all have a natural tendency to try to figure out the situation at a potential job in advance, and the experience of others can be quite helpful in sizing up a role.  Therefore, I’ve decided to post a summary of the jobs I’ve done, and the advantages/disadvantages of each.

IT Network Engineer

Summary:
This is an in-house engineer at a corporation, government agency, educational institution, or really any entity that runs a network.  The typical job tasks vary depending on level, but they usually involve a large amount of day-to-day network management.  This can be responding to complaints about network performance, patching in network connectivity (less so these days because of wireless), upgrading and maintaining devices, working with carriers, etc.  Larger scale projects could be turning up new buildings and sites, planning for adding new functionality (e.g. multicast), etc.

Pros:

  • Stable and predictable work environment.  You show up at the same place and know the people, unlike consulting.
  • You know the network.  You’re not showing up a new place trying to figure out what’s going on.
  • It can be a great chance to learn if the company is growing and funds new projects

Cons:

  • You only get to see one network and one way of doing things.
  • IT is a cost center, so there is a constant desire to cut personnel/expenses.
  • Automation is reducing the type of on-site work that was once a staple for these engineers.
  • Your fellow employees often hate IT and blame you for everything.
  • Occasionally uncomfortable hours due to maintenance windows.

Key Takeaway:
I often tell people that if you want to do an in-house IT job, try to find an interesting place to work.  Being an IT guy at a law firm can be kind of boring.  Being an IT guy at the Pentagon could be quite interesting.  I worked for a major metropolitan newspaper for five years (when there was such a thing) and it was fascinating to see how newspapers work.  Smaller companies can be better in that you often get to touch more technologies, but the work can be less interesting.  Larger companies can pigeonhole you into a particular area.  You might work only on the WAN and never touch the campus or data center side of things, for example.

Technical Support Engineer

Summary:
Work at a vendor like Cisco or Juniper taking cases when things go wrong.  Troubleshoot problems, recreate them in the lab, file bugs, find solutions for customers.  Help resolve outages.  See my TAC Tales for the gory details.

Pros:

  • Great way to get a vast amount of experience by taking lots of tough cases
  • Huge support organization to help you through trouble
  • Short-term work for the most part–when you close a case you’re done with it and move on to something new
  • Usually works on a shift schedule, with predictable hours.  Maintenance windows can often be handed off.

Cons:

  • Nearly every call involves someone who is unhappy.
  • Complex and annoying technical problems.  Your job is 100% troubleshooting and it gets old.
  • Usually a high case volume which means a mountain of work.

Key Takeway:
Technical Support is a tough and demanding environment, but a great way to get exposure to a constant stream of major technical issues.  Some people actually like tech support and make a career out of it, but most I’ve known can burn out after a while.  I wouldn’t trade my TAC years for anything despite the difficulties, as it was an incredible learning experience for me.

Sales Engineer

Summary:
I’ve only filled this role at a partner, so I cannot speak directly to the experience inside a company like Cisco (although I constantly work with Cisco SE’s).  This is a pre-sales technical role, generally partnered with a less-technical account manager.  SE’s ultimately are responsible for generating sales, but act as a consultant or adviser to the customer to ensure they are selling something that fits.  SE’s do initial architecture of a given solution, work on developing the Bill of Materials (BoM), and in the case of partners, help to write the Statement of Work (SoW) for deployment.  SE’s are often involved in deployment of the solutions they sell but it is not their primary job.

Pros:

  • Architectural work is often very rewarding;  great chance to partner with customer and build networks.
  • Often allows working on a broad range of technologies and customers.
  • Because it involves sales, usually good training on the latest technologies.
  • Unlike pure sales (account managers in Cisco lingo), a large amount of compensation is salary so better financial stability.
  • Often very lucrative.

Cons:

  • Like any account-based job, success/enjoyability is highly dependent on the account(s) you are assigned to.
  • Compensation tied to sales, so while there are good opportunities to make money, there is also a chance to lose a lot of discretionary income.
  • Often take the hit for poor product quality from the company whose products you are selling.
  • Because it is a pre-sales role, often don’t get as much hands-on as post-sales engineers.
  • For some products, building BoM’s can be tedious.
  • Sales pressure.  Your superiors have numbers to make and if you’re not seen to be helping, you could be in trouble.

Key Takeaway:
Pre-sales at a partner or vendor can be a well-paying and enjoyable job.  Working on architecture is rewarding and interesting, and a great chance to stay current on the latest technologies.  However, like any sales/account-based job, the financial and career success of SE’s is highly dependent on the customers they are assigned to and the quality of the sales team they are working with.  Generally SE’s don’t do technical support, but often can get pulled into late-night calls if a solution they sell doesn’t work.  SEs are often the face of the company and can take a lot of hits for things that they sell which don’t work as expected.  Overall I enjoyed being a partner SE for the most part, although the partner I worked for had some problems.

Post-Sales/Advanced Services

Summary:
I’m including both partner post-sales, which I have done, and advanced services at a vendor like Cisco, which are similar.  A post-sales engineer is responsible for deploying a solution once the customer has purchased it, and oftentimes the AS/deployment piece is a part of the sale.  Occasionally these engineers are used for non-project-based work, more so at partners.  In this case, the engineer might be called to be on site to do some regular maintenance, fill in for a vacationing engineer, etc.

Pros:

  • Hands-on network engineering.  This is what we all signed up for, right?  Getting into real networks, setting stuff up, and making things happen.
  • Unlike IT network engineers, this job is more deployment focused so you don’t have to spend as much time on day-to-day administrative tasks.
  • Unlike sales, the designs you work on are lower-level and more detailed, so again, this is a great nuts-and-bolts engineering role.

Cons:

  • As with sales, the quality and enjoyability is highly dependent on the customers you end up with.
  • You can get into some nasty deployment scenarios with very unhappy customers.
  • Often these engagements are short-term, so less of a chance to learn a customer/network.  Often it is get in, do the deployment, and move on to the next one.
  • Can involve a lot of travel.
  • Frequently end up assisting technical support with deployments you have done.
  • Can have odd hours.
  • Often left scrambling when sales messed up the BoM and didn’t order the right gear or parts.

Key Takeway:
I definitely enjoyed many of my post-sales deployments at the VAR.  Being on-site, and doing a live deployment with a customer is great.  I remember one time when I did a total network refresh and  VoIP deployment up at St. Helena Unified School District in Napa, CA.  It was a small school district, but over a week in the summer we went building-by-building replacing the switches and routers and setting up the new system.  The customer was totally easygoing, gave us 100% free reign to do it how we wanted, was understanding of complications, and was satisfied with the result.  Plus, I enjoyed spending a week up in Napa, eating well and loving the peace. However, I also had some nightmare customers who micromanaged me or where things just went south.  It’s definitely a great job to gain experience on a variety of live customer networks.

Technical Marketing Engineer

Summary:
I’m currently a Principal TME and a manager of TMEs.  This is definitely my favorite job in the industry.  I give more details in my post on what a TME does, but generally we work in a business unit of a vendor, on a specific product or product family, both guiding the requirements for the product and doing outbound work to explain the product to others, via white papers, videos, presentations, etc.

Pros:

  • Working on the product side allows a network engineer to actually guide products and see the results.  It’s exciting to see a new feature, CLI, etc., added to a product because you drove it.
  • Get to attend at least several trade shows a year.  Everyone likes getting a free pass to a conference like Cisco Live, but to be a part of making it happen is exhilarating.
  • Great career visibility.  Because the nature of the job requires producing content related to your product, you have an excellent body of work to showcase when you decide to move on.
  • Revenue side.  I didn’t mention this in the sales write-up, but it’s true there too.  Being close to revenue is generally more fun than being in a cost center like IT, because you can usually spend more money.  This means getting new stuff for your lab, etc.
  • Working with products before they are ever released to the public is a lot of fun too.
  • Mostly you don’t work on production networks so not as many maintenance windows and late nights as IT or AS.

Cons:

  • Relentless pace of work.  New software releases are constantly coming;  as soon as one trade show wraps up it’s time to prepare for the next one.  I often say TME work is as relentless as TAC.
  • Can be heavy on the travel.  That could, of course, be a good thing but it gets old.
  • Difficulty of influencing engineering without them reporting to you.  Often it’s a fight to get your ideas implemented when people don’t agree.
  • If you don’t like getting up in front of an audience, or writing documents, this job may not be for you.
  • For new products, often the TMEs are the only resources who are customer facing with a knowledge of the product, so you can end up working IT/AS-type hours anyways.  Less an issue with established/stable products.

Key Takeaway:
As I said, I love this job but it is a frenetic pace.  Most of the posts I manage to squeeze in on the blog are done in five minute intervals over a course of weeks.  But I have to say, I like being a TME more than anything else I’ve done.  Being on the product side is fascinating, especially if you have been on the consumer side.  Going to shows is a lot of fun.  If you like to teach and explain, and mess around with new things in your lab, this is for you.

It’s not a comprehensive list of the jobs you can do as a network engineer, but it covers some of the main ones.  I’m certainly interested in your thoughts on other jobs you’ve done, or if you’ve done one of the above, whether you agree with my assessment.  Please drop a comment–I don’t require registration but do require an email address just to keep spam down.

Everyone who’s worked in TAC can tell you their nightmare case–the type of case that, when they see it in the queue, makes them want to run away, take an unexpected lunch break, and hope some other engineer grabs it.  The nightmare case is the case you know you’ll get stuck on for hours, on a conference bridge, escalating to other engineers, trying to find a solution to an impossible problem.  For some it’s unexplained packet loss.  For others, it’s multicast.  For me, it was EIGRP Stuck-in-Active (SIA).

Some customer support engineers (CSEs) thought SIA cases were easy.  Not me.  A number of times I had a network in total meltdown due to SIA with no clue as to where the problem was.  Often the solution required a significant redesign of the network.

As a review, EIGRP is more-or-less a distance-vector routing protocol, which uses an algorithm called DUAL to achieve better performance than a traditional DV protocol like RIP.  I don’t want to get into all the fun CCIE questions on the protocol details, but what matters for this article is how querying works.  When an EIGRP neighbor loses a route, it sets the route as “Active” and then queries its neighbors as to where the route went.  Then, if the neighbors don’t have it, they set it active and query their neighbors.  If those neighbors don’t have the route active, they of course mark it active and query their neighbors.  And so forth.

It should be obvious from this process that in a large network, the queries can multiply quite quickly.  If a router has a lot of neighbors, and its neighbors have a lot of neighbors, the queries multiply exponentially, and can get out of control.  Meanwhile, when a router sets a route active, it sets a timer.  If it doesn’t get a reply before the timer expires, then the router marks the route “Stuck In Active”, and resets the entire EIGRP adjacency.  In a large network with a lot of neighbors, even if the route is present, the time lag between sending a query and getting a response can be so long that the route gets reset before the response makes it to the original querying router.

I’ve ironed out some of the details here, since obviously an EIGRP router can lose a route entirely without going SIA.  For details, see this article.  The main point to remember is that the SIA route happens when the querying route just doesn’t get a response back.

Back in my TAC days, I of course wasn’t happy to see an SIA drop in the queue.  I waited to see if one of my colleagues would take the case and alleviate the burden, but the case turned blue after 20 minutes, meaning someone had to take it.  Darn.

Now I can show my age, because the customer had adjacencies resetting on Token Ring interfaces.  I asked the customer for a topology diagram, some debugs, and to check whether there was packet loss across the network.  Sometimes, if packets are getting dropped, the query responses don’t make it back to the original router, causing SIA.  The logs from the resets looked like this:

rtr1 - 172.16.109.118 - TokenRing1/0
Sep 1 16:58:06: %DUAL-3-SIA: Route 172.16.161.58/32 stuck-in-active state in IP-EIGRP(0) 55555. Cleaning up
Sep 1 16:58:06: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is down: stuck in active
Sep 1 16:58:07: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is up: new adjacency

This is typical of SIA.  The adjacency flapped, but the logs showed no particular reason why.

I thought back to my first troubleshooting experience as a network engineer.  I had brought up a new branch office but it couldn’t talk back to HQ.  Mike, my friend and mentor, showed up and started pinging hop-by-hop until he found a missing route.  “That’s how I learned it,” he said, “just go one hop a time.”  The big clue I had in the SIA case was the missing route:  172.16.161.58/32.  I started tracing it back, hop-by-hop.

I found that the route originated from a router on the edge of the customer network, which had an ISDN PRI connected.  (Showing my age again!)  They had a number of smaller offices that would dial into the ISDN on-demand, and then drop off.  ISDN had per-minute charges and thus, in this pre-VPN era, it was common to setup ISDN in on-demand mode.  ISDN was a digital dial-up technology with very short call setup times.  I discovered that, as these calls were going up and down, the router was generating /32 peer routes for the neighbors and injecting them into EIGRP.  They had a poorly designed network with a huge query domain size, and so as these dial peers were going up and down, routers on the opposite side of the network were going into active on the route and not getting responses back.

They were advertising a /16 for the entire 172.16.x.x network, so sending a /32 per dial peer was totally unnecessary.  I recommended they enable “no peer neighbor-route” on the PRI to suppress the /32’s and the SIAs went away.

I hate to bite the hand that feeds me, but even though I work at Cisco I can say I really never liked EIGRP.  EIGRP is fast, and if the network is designed well, it works fine.  However, networks often grow organically, and the larger the domain, the more unstable EIGRP becomes.  I’ve never seen this sort of problem with OSPF or ISIS.  Fortunately, this case ended up being much less problematic than I expected, but often these cases were far nastier.  Oftentimes it was nearly impossible to find the route causing the problem and why it was going crazy.  Anyhow it’s always good to relive a case with both Token Ring and ISDN for a double case of nostalgia.

A common approach for TAC engineers and customers working on a tough case is to just “throw hardware at it.”  Sometimes this can be laziness:  why troubleshoot a complex problem when you can send an RMA, swap out a line card, and hope it works?  Other times it’s a legitimate step in a complex process of elimination.  RMA the card and if the problem still happens, well, you’ve eliminated the card as one source of the problem.

Hence, it was not an uncommon event the day that I got a P1 case from a major service provider, requeued (reassigned) after multiple RMAs.   The customer had a 12000-series GSR, top of the line back then, and was frustrated because ISIS wasn’t working.

“We just upgraded the GRP to a PRP to speed the router up,” he said, “but now it’s taking 4 hours for ISIS to converge.  Why did we pay all this money on a new route processor when it just slowed our box way down?!”

The GSR router is a chassis-type router, with multiple line cards with ports of different types, a fabric interconnecting them, and a management module (route processor, or RP) acting as the brains of the device.  The original RP was called a GRP, but Cisco had released an improved version called the PRP.

The GSR 12000-series

The customer seemed to think the new PRP had performance issues, but this didn’t make sense.  Performance issues might cause some small delays or possibly packet loss for packets destined to the RP, but not delays of four hours.  Something else was amiss.  I asked the customer to send me the ISIS database, and it was full of LSPs like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
0651.8412.7001.00-00  0x00000000   0x0000        193               0/0/0

ISIS routers periodically send CSNPs, or Complete Sequence Number PDUs, which contain a list of all the link state packets (LSPs) in the router database.  In this case, the GSR was directly attached to a Juniper router which was its sole ISIS adjacency.  It was receiving the entire ISIS database from this router.  Normally an ISIS database entry looks like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
bb1-sjc.00-00         0x0000041E   0xF97D        65365             0/0/0

Note that instead of a router ID, we actually have a router name.  Note also that we have a sequence number and a checksum for each LSP.  As the previous output shows, something was wrong with the LSPs we were receiving.  Not only was the name not resolving, the sequence and checksum were zero.  How can we possibly have an LSP which has no sequence number at all?

Even weirder was that as I refreshed the ISIS outputs, the LSPs started resolving, suddenly popping up with names and non-zero sequences and checksums.  I stayed on the phone with the customer for several hours, before finally every LSP was resolved, and the customer had full reachability.  “Don’t do anything to the router until I get back to you,” I said before hanging up.  If only he had listened.

I was about to pack up for the day and I got called by our hotline.  The customer had called in and escalated to a P1 after reloading the router.  The entire link state database was zero’d out again, and the network was down.  He only had a short maintenance window in which to work, and now he had an outage.  It was 6pm.  I knew I wasn’t going home for a while.

Whatever was happening was well beyond my ISIS expertise.  Even in the routing protocols team, it was hard to find deep knowledge of ISIS.  I needed an expert, and Abe Martey, who sat across from me, literally wrote the book on ISIS.  The Cisco Press book, that is.  The only issue:  Abe had decided to take PTO that week.  Of course.  I pinged a protocols escalation engineer, one of our best BGP guys.  He didn’t want anything to do with it.  Finally I reached out to the duty manager and asked for help.  I also emailed our internal mailers for ISIS, but after 6pm I wasn’t too optimistic.

Why were we seeing what appeared to be invalid LSPs?  How could an LSP even have a zero checksum or sequence number?  Why did they seem to clear out, and why so slowly?  Did the upgrade to the PRP have anything to do with it?  Was it hardware?  A bug?  As a TAC engineer, you have to consider every single possibility, from A to Z.

The duty manager finally got Sanjeev, an “ISIS expert” from Australia on the call.  The customer may not realize this while a case is being handled, but if it’s complex and high priority, there is often a flurry of instant messaging going on behind the scenes.  We had a chat room up, and as the “expert” listened to the description of the problem and looked at the notes, he typed in the window:  “This is way over my head.”  Great, so much for expertise.  Our conversation was getting heated with the customer, as his frustration with the lack of progress escalated.  The so-called expert asked him to run a command, which another TAC engineer suggested.

“Fantastic,” said the customer, “Sanjeev wants us to run a command.  Sanjeev, tell us, why do you want to run this command?  What’s it going to do?”

“Uh, I’m not sure,” said Sanjeev, “I’ll have to get back to you on that.”

Not a good answer.

By 8:30 PM we also had a senior routing protocols engineer in the chat window.  He seemed to think it was a hardware issue and was scraping the error counters on the line cards. The dedicated Advanced Services NCE for the account also signed on and was looking at the errors. It’s a painful feeling knowing you and the customer are stranded, but we honestly had no idea what to do.  Because the other end of the problem was a Juniper router, JTAC came on board as well.  We may have been competitors, but we were professionals and put it aside to best help the customer.

Looking at the chat transcript, which I saved, is painful.  One person suggests physically cleaning the fiber connection.  Another thinks it’s memory corruption.  Another believes it is packet corruption.  We schedule a circuit test with the customer to look for transmission errors.

All the while, the 0x0000 LSPs are re-populating with legitimate information, until, by 9pm, the ISIS database was fully converged and routing was working again.  “This time,” I said, “DO NOT touch the router.”  The customer agreed.  I headed home at 9:12pm, secretly hoping they would reload the router so the case would get requeued to night shift and taken off my hands.

In the morning we got on our scheduled update call with the customer.  I was tired, and not happy to make the call.  We had gotten nowhere in the night, and had not gotten helpful responses to our emails.  I wasn’t sure what I was going to say.  I was surprised to hear the customer in a chipper mood.  “I’m happy to report Juniper has reproduced the problem in their lab and has identified the problem.”

There was a little bit of wounded pride knowing they found the fix before we did, but also a sense of relief to know I could close the case.

It turns out that the customer, around the same time they installed the PRP, had attempted to normalize the configs between the Juniper and Cisco devices.  They had mistakenly configured a timer called the “LSP pacing interval” on the Juniper side.  This controls the rate at which the Juniper box sends out LSPs.  They had thought they were configuring the same timer as the LSP refresh interval on the Cisco side, but they were two different things.  By cranking it way up, they ensured that the hundreds of LSPs in the database would trickle in, taking hours to converge.

Why the 0x0000 entries then?  It turns out that in the initial exchange, the ISIS routers share with each other what LSPs they have, without sending the full LSP.  Thus, in Cisco ISIS databases, the 0x0000 entry acts as a placeholder until complete LSP data is received.  Normally this period is short and you don’t see the entry.  We probably would have found the person who knew that eventually, but we didn’t find him that night and our database of cases, newsgroup postings, and bugs turned up nothing to point us in the right direction.

I touched a couple thousand cases in my time at TAC, but this case I remember even 10 years later because of the seeming complexity, the simplicity of the resolution, the weirdness of the symptoms, and the distractors like the PRP upgrade.  Often a major outage sends you in a lot of directions and down many rat holes.  I don’t think we could have done much differently, since the config error was totally invisible to us.  Anyway, if Juniper and Cisco can work together to solve a customer issue, maybe we should have hope for world peace.

When you work at TAC, you are required to be “on-shift” for 4 hours each day.  This doesn’t mean that you work four hours a day, just that you are actively taking cases only four hours per day.  The other four (or more) hours you work on your existing backlog, calling customers, chasing down engineering for bug fixes, doing recreates, and, if you’re lucky, doing some training on the side.  While you were on shift, you would still work on the other stuff, but you were responsible for monitoring your “queue” and taking cases as they came in.  On our queue we generally liked to have four customer support engineers (CSE’s) on shift at any time.  Occasionally we had more or less, but never less than two.  We didn’t like to run with two engineers for very long;  if a P1 comes in, a CSE can be tied up for hours unable to deal with the other cases that come in, and the odds are not low that more than one P1 come in.  With all CSE’s on-shift tied up, it was up to the duty manager to start paging off-shift engineers as cases came in, never a good thing.  If ever you were on hold for a long time with a P1, there is a good chance the call center agent was simply unable to find a CSE because they were all tied up.  Sometimes it was due to bad planning, sometimes lack of staff.  Sometimes you would start a shift with five CSE’s on the queue and they’d all get on P1’s in the first five minutes.  The queue was always unpredictable.

At TAC, when you were on-shift, you could never be far from your desk.  You were expected to stay put, and if you had to get up to use the bathroom or go to the lab, you notified your fellow on-shift engineers so they knew you wouldn’t be available.  Since I preferred the 10am-2pm shift, in 2 years I took lunch away from my desk maybe 5 times.  Most days I told the other guys I was stepping out, ran to the cafeteria, and ran back to my desk to eat while taking cases.

Thus, I was quite happy one day when I had a later shift and my colleague Delvin showed up at my desk and asked if I wanted to go to a nice Chinese lunch.  Eddy Lau, one of our new CSE’s who had recently emigrated from China, had found an excellent and authentic restaurant.  We hopped into Eddy’s car and drove over to the restaurant, where we proceeded to have a two-hour long Chinese feast.  “It’s so great to actually go to lunch,” I said to Delvin, “since I eat at my desk every day.”  Eddy was happy to help his new colleagues out.

As we were driving back, Devlin asked Eddy, “When are you on shift?”

“Right now,” said Eddy.

“You’re on shift now?!” Delvin asked incredulously.  “Dude, you can’t leave for two hours if you’re on shift.  Who are you on shift with?”

“Just me and Sarah,” Eddy said, not really comprehending the situation.

“You left Sarah on shift by herself?!” Devlin asked.  “What if a P1 comes in?  What if she gets swamped by cases?  You can’t leave someone alone on the queue!”

We hurried back to the office and pulled up WebMonitor, which showed not only active cases, but who had taken cases that shift and how many.  Sarah had taken a single case.  By some amazing stroke of luck, it had been a very quiet shift.

I walked by Eddy’s desk and he gave me a thumbs up and a big smile.  I figured he wouldn’t last long.  A couple months later, after blowing a case, Eddy got put on RMA duty and subsequently quit.

If you ever wonder why you had to wait so long on the phone, it could be a busy day.  Or it could be your CSE’s decided to take a long lunch without telling anyone.

When I was still a new engineer, a fellow customer support engineer (CSE) asked a favor of me. I’ll call him Andy.

“I’m going on PTO, could you cover a case for me? I’ve filed a bug and while I’m gone there will be a conference call. Just jump on it and tell them that the bug has been filed an engineering is working on it.” The case was with one of our largest service provider clients. I won’t say which, but they were a household name.

When you’re new and want to make a good impression, you jump on chances like this. It was a simple request and would prove I’m a team player. Of course I accepted the case and went about my business with the conference call on my calendar for the next week.

Before I got on the call I took a brief look at the case notes and the DDTS (what Cisco calls a bug.) Everything seemed to be in order. The bug was filed and in engineering’s hands. Nothing to do but hop on the call and report that the bug was filed and we were working on it.

I dialed the bridge and after I gave my name the automated conference bridge said “there are 20 other parties in the conference.” Uh oh. Why did they need so many?

After I joined, someone asked for introductions. As they went around the call, there were a few engineers, several VP’s, and multiple senior directors. Double uh oh.

“Jeff is calling from Cisco,” the leader of the call said. “He is here to report on the P1 outage we had last week affecting multiple customers. I’m happy to tell you that Cisco has been working diligently on the problem and is here to report their findings and their solution. Cisco, take it away.”

I felt my heart in my throat. I cleared my voice, and sheepishly said: “Uh, we’ve, uh, filed a bug for your problem and, uh, engineering is looking into it.”

It was dead silence, followed by a VP chiming in: “That’s it?”

I was then chewed out thoroughly for not doing enough and wasting everyone’s time.

When Andy got back he grabbed the case back from me. “How’d the call go?” he asked.

I told him how it went horribly, how they were expecting more than I delivered, and how I took a beating for him.

Andy just smiled. Welcome to TAC.

My job as a customer support engineer (CSE) at TAC was the most quantified I’ve ever had.  Every aspect of our job performance was tracked and measured.  We live in the era of big data, and while numbers can be helpful, they can also mislead.  In TAC, there were many examples of that.

Take, for example, our customer satisfaction rating, known as a “bingo” score.  Every time a customer filled out a survey at the end of a TAC case, the engineer was notified and the bingo score recorded and averaged with all his previous scores.  While this would seem to be an effective measure of an engineer’s performance, it often wasn’t.

In TAC, we often ended up taking cases that were “requeues.”  These were cases that were previously worked by another engineer.  Imagine you got a requeue of a case that another CSE had handled terribly.  You close the case quickly, but the customer is still angry at the first CSE, so he gives a low bingo score.  That score was credited against the CSE who closed the case, so even though you took care of it, you got stuck with the low numbers.

This also happened with create-to-close numbers.  We were measured on how quickly we closed cases.  Imagine another CSE had been sitting on a case for six months doing nothing.  The customer requeues it, and you end up with the case, closing it immediately.  You end up with a six month create-to-close number even though it wasn’t your fault.

Even worse, if you think about it, the create-to-close number discouraged engineers from taking hard cases.  Easy cases close quickly, but hard ones stay open while recreates are done and bugs are filed.  The engineers who took the hardest cases and were very skilled often had terrible create-to-close numbers.

The bottom line is that you need more than data to understand a person.  Most things in life don’t lend themselves to easy quantification.  Numbers always need to be in context.  Corporate managers are obsessed with quantification, and the Google’s of the world are helping to drive our number-love even further.  Meanwhile, reducing people to numbers is a great way to treat them less humanly.