I’ve mentioned in previous TAC Tales that I started on a TAC team dedicated to enterprise, which made sense given my background. Shortly after I came to Cisco the enterprise team was broken up and its staff distributed among the routing protocols team and LAN switch team. The RP team at that time consisted of service provider experts with little understanding of LAN switching issues, but deep understanding of technologies like BGP and MPLS. This was back before the Ethernet-everywhere era, and SP experts had never really spent a lot of time with LAN switches.
This created a big problem with case routing. Anyone who has worked more than 5 minutes in TAC knows that when you have a routing protocol problem, usually it’s not the protocol itself but some underlying layer 2 issue. This is particularly the case when adjacencies are resetting. The call center would see “OSPF adjacencies resetting” and immediately send the case to the protocols team, when in fact the issue was with STP or perhaps a faulty link. With all enterprise RP issues suddenly coming into the same queue as SP cases, our SP-centric staff were constantly getting into stuff they didn’t understand.
One such case came in to us, priority 1, from a service provider that ran “cell sites”, which are concrete bunkers with radio equipment for cellular transmissions. “Now wait,” you’re saying, “I thought you just said enterprise RP cases were a problem, but this was a service provider!” Well, it was a service provider but they ran LAN switches at the cell site, so naturally when OSPF started going haywire it came in to the RP team despite obviously being a switching problem!
A quick look at the logs confirmed this:
Jun 13 01:52:36 LSW38-0 3858130: Jun 13 01:52:32.347 CDT: %C4K_EBM-4-HOSTFLAPPING: Host 00:AB:DA:EE:0A:FF in vlan 74 is flapping between port Fa2/37 and port Po1
Here we could see a host MAC address moving between a front-panel port on the switch and a core-facing port channel. Something’s not right there. There were tons of messages like these in the logs.
Digging a little further I determined that Spanning Tree was disabled. Ugh.
Spanning Tree Protocol (STP) is not popular, and it’s definitely flawed. With all due respect to the (truly) great Radia Perlman, the inventor of STP, choosing the lowest bridge identifier (usually the MAC address of the switch) as the root, when priorities are set to the default, is a bad idea. It means that if customers deploy STP with default values, the oldest switch in the network becomes root. Bad idea, as I said. However, STP also gets a bad reputation undeservedly. I cannot tell you how many times there was a layer 2 loop in a customer network, where STP was disabled, and the customer referred to it as a “Spanning Tree loop”. STP stops layer 2 loops, it does not create them. And a layer 2 loop out of control is much worse than a 50 second spanning tree outage, which is what you got with the original protocol spec. When there is no loop in the network, STP doesn’t do anything at all except for send out BPDUs.
As I suspected, the customer had disabled spanning tree due to concerns about the speed of failover. They had also managed to patch a layer 2 loop into their network during a minor change, causing an unchecked loop to circulate frames out of control, bringing down their entire cell site.
I explained to them the value of STP, and why any outage caused by it would be better than the out of control loop they had. I was told to mind my own business. They didn’t want to enable spanning tree because it was slow. Yes, I said, but only when there is a loop! And in that case, a short outage is better than a meltdown. Then I realized the customer and I were in a loop, which I could break by closing the case.
Newer technologies (such as SD-Access) obviate the need for STP, but if you’re doing classic Layer 2, please, use it.