When I first started at Cisco TAC, I was assigned to a team that handled only enterprise customers. One of the first things my boss said to me when I started there was “At Cisco, if you don’t like your boss or your cubicle, wait three months.” Three months later, they broke the team up and I had a new boss and a new cubicle. My new team handled routing protocols for both enterprise and service provider customers, and I had a steep learning curve having just barely settled down in the first job.
A P1 case came into my queue for a huge cable provider. Often P1’s are easy, requiring just an RMA, but this one was a mess. It was a coast-to-coast BGP meltdown for one of the largest service provider networks in the country. Ugh. I was on the queue at the wrong time and took the wrong case.
The cable company was seeing BGP adjacencies reset across their entire network. The errors looked like this:
Jun 16 13:48:00.313 EST: %BGP-5-ADJCHANGE: neighbor 172.17.249.17 Down BGP Notification sent Jun 16 13:48:00.313 EST: %BGP-3-NOTIFICATION: sent to neighbor 172.17.249.17 3/1 (update malformed) 8 bytes 41A41FFF FFFFFFFF
The cause seemed to be malformed BGP packets, but why? The GSR routers they had were kind enough to give us a hex dump of the BGP packet when an adjacency reset. I got out my trusty Doyle book and began decoding the packets on paper, when a colleague was kind enough to point me to an internal Cisco tool that would decode a BGP packet from hex.
We could see that, for some reason, the NLRI portion of the BGP message was getting cut off. According to my calculations, it should have been 44 bytes, but we were only seeing 32 bytes of information. NLRI is Network Layer Reachability Information, just a fancy BGP way of saying the paths that go into the routing update. We also noticed a clue in the router logs: TCP-6-TOOBIG messages showing up from time to time.
Going over it with engineering, we realized something interesting. The customer had enabled TCP selective acknowledgement on all their routers. Also known as SACK, TCP selective acknowledgement is designed to circumvent an inefficiency in TCP. If, say, 1 of 3 TCP segments gets dropped, the TCP protocol requires re-transmission of all 3 of the segments. In other words, the receiver keeps ACKing the last segment it received, but it takes time for the sender to realize something is wrong. When the sender finally realizes something is wrong, it goes back to the last known good segment and re-transmits everything after it. SACK allows TCP to acknowledge and re-transmit specific segments. If we are only missing segments 2, 3, and 5, then we can ask for just those to be re-transmitted. SACK is stored as an option in the TCP header.
The problem is, there is a finite amount of space in the TCP header, and the SACK field can get rather long. It just so happens that BGP also stores its MD5 authentication hash in the TCP header. If SACK gets too long, it can crowd the MD5 header and cause BGP errors. Based on our analysis, this was exactly what had happened. Thus, the malformed packets. We had the customer remove the SACK option from all routers and the problem stopped.
We were left with a couple questions. Why did SACK get so long, and why would it be allowed to overwrite other important values in the TCP header? In answer to the first question, there was a bug which was causing some linecards to send out malformed packets on occasion, thus causing SACKs. In answer to the second question, there was a bug in the TCP header options packing that allowed one field (SACK) to crowd out another field (MD5 authentication). I knew the case wouldn’t close for a long time. Multiple bugs needed to be filed, and new code qualified and installed. Fortunately the customer had a workaround (disable SACK) and an HTE. An HTE was a TAC engineer dedicated to their account. He grabbed the case from me for babysitting and I moved onto my next case.
In my TAC tales I often make fun of the occasional mistakes of TAC engineers. However, TAC is a tough job, and the organization is staffed by some top engineers. Many cases, like this one, required hard core engineering and knowledge that spans protocol details and ASIC-level hardware debugging. It’s not a job for the faint of heart. This case required digging into the TCP header, understanding how options are packed, and figuring out how to stop a major meltdown of a service provider network. A high-stress situation, to be sure, but these cases often were the most rewarding.