tac tales

All posts tagged tac tales

When I first started at Cisco TAC, I was assigned to a team that handled only enterprise customers.  One of the first things my boss said to me when I started there was “At Cisco, if you don’t like your boss or your cubicle, wait three months.”  Three months later, they broke the team up and I had a new boss and a new cubicle.  My new team handled routing protocols for both enterprise and service provider customers, and I had a steep learning curve having just barely settled down in the first job.

A P1 case came into my queue for a huge cable provider.  Often P1’s are easy, requiring just an RMA, but this one was a mess.  It was a coast-to-coast BGP meltdown for one of the largest service provider networks in the country.  Ugh.  I was on the queue at the wrong time and took the wrong case.

The cable company was seeing BGP adjacencies reset across their entire network.  The errors looked like this:

Jun 16 13:48:00.313 EST: %BGP-5-ADJCHANGE: neighbor 172.17.249.17 Down BGP
Notification sent

Jun 16 13:48:00.313 EST: %BGP-3-NOTIFICATION: sent to neighbor 172.17.249.17
3/1 (update malformed) 8 bytes 41A41FFF FFFFFFFF

The cause seemed to be malformed BGP packets, but why?  The GSR routers they had were kind enough to give us a hex dump of the BGP packet when an adjacency reset.  I got out my trusty Doyle book and began decoding the packets on paper, when a colleague was kind enough to point me to an internal Cisco tool that would decode a BGP packet from hex.

We could see that, for some reason, the NLRI portion of the BGP message was getting cut off.  According to my calculations, it should have been 44 bytes, but we were only seeing 32 bytes of information.  NLRI is Network Layer Reachability Information, just a fancy BGP way of saying the paths that go into the routing update.  We also noticed a clue in the router logs:  TCP-6-TOOBIG messages showing up from time to time.

Going over it with engineering, we realized something interesting.  The customer had enabled TCP selective acknowledgement on all their routers.  Also known as SACK, TCP selective acknowledgement is designed to circumvent an inefficiency in TCP.  If, say, 1 of 3 TCP segments gets dropped, the TCP protocol requires re-transmission of all 3 of the segments.  In other words, the receiver keeps ACKing the last segment it received, but it takes time for the sender to realize something is wrong.  When the sender finally realizes something is wrong, it goes back to the last known good segment and re-transmits everything after it.  SACK allows TCP to acknowledge and re-transmit specific segments.  If we are only missing segments 2, 3, and 5, then we can ask for just those to be re-transmitted.  SACK is stored as an option in the TCP header.

The problem is, there is a finite amount of space in the TCP header, and the SACK field can get rather long.  It just so happens that BGP also stores its MD5 authentication hash in the TCP header.  If SACK gets too long, it can crowd the MD5 header and cause BGP errors.  Based on our analysis, this was exactly what had happened.  Thus, the malformed packets.  We had the customer remove the SACK option from all routers and the problem stopped.

We were left with a couple questions.  Why did SACK get so long, and why would it be allowed to overwrite other important values in the TCP header?  In answer to the first question, there was a bug which was causing some linecards to send out malformed packets on occasion, thus causing SACKs.  In answer to the second question, there was a bug in the TCP header options packing that allowed one field (SACK) to crowd out another field (MD5 authentication).  I knew the case wouldn’t close for a long time.  Multiple bugs needed to be filed, and new code qualified and installed.  Fortunately the customer had a workaround (disable SACK) and an HTE.  An HTE was a TAC engineer dedicated to their account.  He grabbed the case from me for babysitting and I moved onto my next case.

In my TAC tales I often make fun of the occasional mistakes of TAC engineers.  However, TAC is a tough job, and the organization is staffed by some top engineers.  Many cases, like this one, required hard core engineering and knowledge that spans protocol details and ASIC-level hardware debugging.  It’s not a job for the faint of heart.  This case required digging into the TCP header, understanding how options are packed, and figuring out how to stop a major meltdown of a service provider network.  A high-stress situation, to be sure, but these cases often were the most rewarding.

 

My job as a customer support engineer (CSE) at TAC was the most quantified I’ve ever had.  Every aspect of our job performance was tracked and measured.  We live in the era of big data, and while numbers can be helpful, they can also mislead.  In TAC, there were many examples of that.

Take, for example, our customer satisfaction rating, known as a “bingo” score.  Every time a customer filled out a survey at the end of a TAC case, the engineer was notified and the bingo score recorded and averaged with all his previous scores.  While this would seem to be an effective measure of an engineer’s performance, it often wasn’t.

In TAC, we often ended up taking cases that were “requeues.”  These were cases that were previously worked by another engineer.  Imagine you got a requeue of a case that another CSE had handled terribly.  You close the case quickly, but the customer is still angry at the first CSE, so he gives a low bingo score.  That score was credited against the CSE who closed the case, so even though you took care of it, you got stuck with the low numbers.

This also happened with create-to-close numbers.  We were measured on how quickly we closed cases.  Imagine another CSE had been sitting on a case for six months doing nothing.  The customer requeues it, and you end up with the case, closing it immediately.  You end up with a six month create-to-close number even though it wasn’t your fault.

Even worse, if you think about it, the create-to-close number discouraged engineers from taking hard cases.  Easy cases close quickly, but hard ones stay open while recreates are done and bugs are filed.  The engineers who took the hardest cases and were very skilled often had terrible create-to-close numbers.

The bottom line is that you need more than data to understand a person.  Most things in life don’t lend themselves to easy quantification.  Numbers always need to be in context.  Corporate managers are obsessed with quantification, and the Google’s of the world are helping to drive our number-love even further.  Meanwhile, reducing people to numbers is a great way to treat them less humanly.

When I first started at TAC, I wasn’t allowed to take cases by myself.  If I grabbed a case, I had to get an experienced engineer to help me out.  One day I grabbed a case on a Catalyst 6k power supply, and asked Veena (not her real name) to help me on the case.

We got the customer on the phone.  He was an engineer at a New York financial institution, and sounded like he came from Brooklyn.  I lived in Williamsburg for a while with my mom back in the 1980’s before it was cool, and I know the accent.  He explained that he had a new 6k, and it wasn’t recognizing the power supply he had bought for it.  All of the modules had a “power denied” message on them.

I put the customer on speaker phone in my cube and Veena looked at the case notes.  As was often the case in TAC, we put the customer on mute while discussing the issues.  Veena thought it was a bad connection between the power supply and the switch.

“Here’s what I want you to do,” Veena said to the customer, un-muting the phone.  “I used to work in the BU, and a lot of times these power supplies don’t connect to the backplane.  You need to put it in hard.  Pull the power supply out and slam it in to the chassis.  I want to hear it crack!”

The customer seemed surprised.  “You want me to do what?!” he bristled.

“Slam it in!  Just slam it in as hard as you can!  We saw this in the BU all the time!”

“Hey lady,” he responded, “we paid a couple hundred grand for this box and I don’t want to break it.”

“It’s ok,” she said, “it’ll be fine.  I want to hear the crack!”

“Well, ok,” he said with resignation.  He put the phone down and we heard him shuffle off to the switch.  Meanwhile Veena looked at me and said “Pull up the release notes.”  I pulled up the notes, and we saw that the power supply wasn’t supported in his version of Catalyst OS.

Meanwhile in the background:  CRACK!!!

The customer came back on the line.  “Lady, I slammed that power supply into the chassis as hard as I could.  I think I broke something on it, and it still doesn’t work!”

“Yes,” Veena replied.  “We’ve discovered that your software doesn’t support the power supply and you will need to do an upgrade…”