All posts tagged p1

When I first started at Cisco TAC, I was assigned to a team that handled only enterprise customers.  One of the first things my boss said to me when I started there was “At Cisco, if you don’t like your boss or your cubicle, wait three months.”  Three months later, they broke the team up and I had a new boss and a new cubicle.  My new team handled routing protocols for both enterprise and service provider customers, and I had a steep learning curve having just barely settled down in the first job.

A P1 case came into my queue for a huge cable provider.  Often P1’s are easy, requiring just an RMA, but this one was a mess.  It was a coast-to-coast BGP meltdown for one of the largest service provider networks in the country.  Ugh.  I was on the queue at the wrong time and took the wrong case.

The cable company was seeing BGP adjacencies reset across their entire network.  The errors looked like this:

Jun 16 13:48:00.313 EST: %BGP-5-ADJCHANGE: neighbor Down BGP
Notification sent

Jun 16 13:48:00.313 EST: %BGP-3-NOTIFICATION: sent to neighbor
3/1 (update malformed) 8 bytes 41A41FFF FFFFFFFF

The cause seemed to be malformed BGP packets, but why?  The GSR routers they had were kind enough to give us a hex dump of the BGP packet when an adjacency reset.  I got out my trusty Doyle book and began decoding the packets on paper, when a colleague was kind enough to point me to an internal Cisco tool that would decode a BGP packet from hex.

We could see that, for some reason, the NLRI portion of the BGP message was getting cut off.  According to my calculations, it should have been 44 bytes, but we were only seeing 32 bytes of information.  NLRI is Network Layer Reachability Information, just a fancy BGP way of saying the paths that go into the routing update.  We also noticed a clue in the router logs:  TCP-6-TOOBIG messages showing up from time to time.

Going over it with engineering, we realized something interesting.  The customer had enabled TCP selective acknowledgement on all their routers.  Also known as SACK, TCP selective acknowledgement is designed to circumvent an inefficiency in TCP.  If, say, 1 of 3 TCP segments gets dropped, the TCP protocol requires re-transmission of all 3 of the segments.  In other words, the receiver keeps ACKing the last segment it received, but it takes time for the sender to realize something is wrong.  When the sender finally realizes something is wrong, it goes back to the last known good segment and re-transmits everything after it.  SACK allows TCP to acknowledge and re-transmit specific segments.  If we are only missing segments 2, 3, and 5, then we can ask for just those to be re-transmitted.  SACK is stored as an option in the TCP header.

The problem is, there is a finite amount of space in the TCP header, and the SACK field can get rather long.  It just so happens that BGP also stores its MD5 authentication hash in the TCP header.  If SACK gets too long, it can crowd the MD5 header and cause BGP errors.  Based on our analysis, this was exactly what had happened.  Thus, the malformed packets.  We had the customer remove the SACK option from all routers and the problem stopped.

We were left with a couple questions.  Why did SACK get so long, and why would it be allowed to overwrite other important values in the TCP header?  In answer to the first question, there was a bug which was causing some linecards to send out malformed packets on occasion, thus causing SACKs.  In answer to the second question, there was a bug in the TCP header options packing that allowed one field (SACK) to crowd out another field (MD5 authentication).  I knew the case wouldn’t close for a long time.  Multiple bugs needed to be filed, and new code qualified and installed.  Fortunately the customer had a workaround (disable SACK) and an HTE.  An HTE was a TAC engineer dedicated to their account.  He grabbed the case from me for babysitting and I moved onto my next case.

In my TAC tales I often make fun of the occasional mistakes of TAC engineers.  However, TAC is a tough job, and the organization is staffed by some top engineers.  Many cases, like this one, required hard core engineering and knowledge that spans protocol details and ASIC-level hardware debugging.  It’s not a job for the faint of heart.  This case required digging into the TCP header, understanding how options are packed, and figuring out how to stop a major meltdown of a service provider network.  A high-stress situation, to be sure, but these cases often were the most rewarding.


When you work at TAC, you are required to be “on-shift” for 4 hours each day.  This doesn’t mean that you work four hours a day, just that you are actively taking cases only four hours per day.  The other four (or more) hours you work on your existing backlog, calling customers, chasing down engineering for bug fixes, doing recreates, and, if you’re lucky, doing some training on the side.  While you were on shift, you would still work on the other stuff, but you were responsible for monitoring your “queue” and taking cases as they came in.  On our queue we generally liked to have four customer support engineers (CSE’s) on shift at any time.  Occasionally we had more or less, but never less than two.  We didn’t like to run with two engineers for very long;  if a P1 comes in, a CSE can be tied up for hours unable to deal with the other cases that come in, and the odds are not low that more than one P1 come in.  With all CSE’s on-shift tied up, it was up to the duty manager to start paging off-shift engineers as cases came in, never a good thing.  If ever you were on hold for a long time with a P1, there is a good chance the call center agent was simply unable to find a CSE because they were all tied up.  Sometimes it was due to bad planning, sometimes lack of staff.  Sometimes you would start a shift with five CSE’s on the queue and they’d all get on P1’s in the first five minutes.  The queue was always unpredictable.

At TAC, when you were on-shift, you could never be far from your desk.  You were expected to stay put, and if you had to get up to use the bathroom or go to the lab, you notified your fellow on-shift engineers so they knew you wouldn’t be available.  Since I preferred the 10am-2pm shift, in 2 years I took lunch away from my desk maybe 5 times.  Most days I told the other guys I was stepping out, ran to the cafeteria, and ran back to my desk to eat while taking cases.

Thus, I was quite happy one day when I had a later shift and my colleague Delvin showed up at my desk and asked if I wanted to go to a nice Chinese lunch.  Eddy Lau, one of our new CSE’s who had recently emigrated from China, had found an excellent and authentic restaurant.  We hopped into Eddy’s car and drove over to the restaurant, where we proceeded to have a two-hour long Chinese feast.  “It’s so great to actually go to lunch,” I said to Delvin, “since I eat at my desk every day.”  Eddy was happy to help his new colleagues out.

As we were driving back, Devlin asked Eddy, “When are you on shift?”

“Right now,” said Eddy.

“You’re on shift now?!” Delvin asked incredulously.  “Dude, you can’t leave for two hours if you’re on shift.  Who are you on shift with?”

“Just me and Sarah,” Eddy said, not really comprehending the situation.

“You left Sarah on shift by herself?!” Devlin asked.  “What if a P1 comes in?  What if she gets swamped by cases?  You can’t leave someone alone on the queue!”

We hurried back to the office and pulled up WebMonitor, which showed not only active cases, but who had taken cases that shift and how many.  Sarah had taken a single case.  By some amazing stroke of luck, it had been a very quiet shift.

I walked by Eddy’s desk and he gave me a thumbs up and a big smile.  I figured he wouldn’t last long.  A couple months later, after blowing a case, Eddy got put on RMA duty and subsequently quit.

If you ever wonder why you had to wait so long on the phone, it could be a busy day.  Or it could be your CSE’s decided to take a long lunch without telling anyone.