All posts tagged stp

I’ve mentioned my first job as a network engineer several times on this blog.  I worked at the San Francisco Chronicle, the biggest newspaper in Northern California.   I was brought in to manage the network as a Cisco-certified engineer, having just passed a four-day CCNA bootcamp.  Right before the dot-bomb economic crash, network engineers were in short supply.

The Chronicle’s network had recently been completely re-engineered, and the vendor selected was Foundry Networks.  Foundry was an up-and-coming vendor famous for selling high-speed switches to internet service providers.  They weren’t known for selling into enterprises, but they had convinced the previous network manager to install their hardware in nearly all of the Chronicle’s wiring closets.

It didn’t go very well.  The network had become incredibly unstable.  No company wants an unstable network, but newspapers are a particularly high-pressure environment since they have tight deadlines in order to get the paper out every single day, without fail.  Management of the data network was taken away from the previous manager and assigned to the head of the telecom department.  The plan was to rip out the Foundry and replace it with Cisco.

Foundry, of course, had other ideas.  Their account manager, whom I’ll call Bill, was quite aggressive in trying to restore the good name of Foundry.  I’ll give him credit for his doomed mission.

We had several problems.  The first was that we had only a single core router.  The router had two management modules, but failover between them was not fast, and our reporters and advertising people used Tandem systems which were sensitive to even slight network outages.  Foundry was well known for their fast IP switches, but we used AppleTalk and IPX as well, and their protocol stacks were not well implemented.  The BigIron 8000 was prone to crashing and taking out a lot of our users.  We had only one because the previous manager had been trying to save money.

The second problem was not Foundry’s fault entirely, although I do blame the SE in part.  Nobody ever set the spanning tree bridge priority on the core box.  By default, STP selects the bridge with the lowest bridge identifier as root.  Since the BID is comprised of a user-configured priority and the MAC address, if no priority is configured, the oldest switch in the network becomes the root bridge, since MAC addresses and OUI’s are sequential.

It turned out our Windows guys had been hauling around an ancient Cabletron switch to multiplex switch ports when working on end users’ computers.  (This was before wireless).  They would plug in, the Cabletron would dutifully assume STP root, and the entire network would reconverge for 50 seconds, spanning tree roots not being sticky.  I remember once paying a bill at a nearby restaurant before we were finished and running with the other engineers back to the office, hoping to catch an outage in progress after our pagers went off.  Foundry’s logs were not very good and we didn’t know why the network kept going down.  Eventually I figured it out, I don’t remember how.

The third problem was that the Foundry FastIron switches we used in the wiring closets had bad optics.  The Molex optics Foundry had selected for its management modules were flaky, and so we had to replace every single one with modules using Finisar optics.  I remember Bill, our account manager, coming in for our middle-of-the-night maintenance window several weekends in a row, blades in tow, and helping us to swap out the cards.

All of these problems created a bad reputation for Foundry within the Chronicle.  I remember Bill walking out of the front door carrying a Foundry box with an RMA’d management module.  A non-technical employee, perhaps a reporter or advertising salesman, saw the box and shouted, “Hey, they’re getting rid of Foundry!”  People in the lobby started cheering.  Bill looked at me and said, “soon they’ll be cheering when I come into the building with a Foundry box.”

It never happened.  We ripped out Foundry and replaced everything with Cisco Catalyst 4k and 6k switches.

The fact of the matter is, had we added a second BigIron in the core, fixed the root bridge problem, and replaced all the faulty modules, we probably would have had a solid network.  But there often comes a point when a vendor has destroyed their reputation with a customer.  It takes a multitude of factors to reach this point, but there is definitely a point of no return.  Once that line is crossed, the customer will often allow cordial meetings, listen with sympathy to the account team and execs, and then go their separate way.

A few years later I was laid off from my job at a Gold Partner, and was interviewing with another Gold Partner.  The technical interviewer looked at my resume and said, “I see you worked at the San Francisco Chronicle.”

“Yes,” I said, “I was brought in to replace the Foundry network they had with Cisco.  The whole thing was a disaster, poorly designed and bad products.”

“I designed that network,” he replied, “when I worked for another partner.  I also installed it.”

I didn’t get the job.

I’ve mentioned in previous TAC Tales that I started on a TAC team dedicated to enterprise, which made sense given my background.  Shortly after I came to Cisco the enterprise team was broken up and its staff distributed among the routing protocols team and LAN switch team.  The RP team at that time consisted of service provider experts with little understanding of LAN switching issues, but deep understanding of technologies like BGP and MPLS.  This was back before the Ethernet-everywhere era, and SP experts had never really spent a lot of time with LAN switches.

This created a big problem with case routing.  Anyone who has worked more than 5 minutes in TAC knows that when you have a routing protocol problem, usually it’s not the protocol itself but some underlying layer 2 issue.  This is particularly the case when adjacencies are resetting.  The call center would see “OSPF adjacencies resetting” and immediately send the case to the protocols team, when in fact the issue was with STP or perhaps a faulty link.  With all enterprise RP issues suddenly coming into the same queue as SP cases, our SP-centric staff were constantly getting into stuff they didn’t understand.

One such case came in to us, priority 1, from a service provider that ran “cell sites”, which are concrete bunkers with radio equipment for cellular transmissions.  “Now wait,” you’re saying, “I thought you just said enterprise RP cases were a problem, but this was a service provider!”  Well, it was a service provider but they ran LAN switches at the cell site, so naturally when OSPF started going haywire it came in to the RP team despite obviously being a switching problem!

A quick look at the logs confirmed this:

Jun 13 01:52:36 LSW38-0 3858130: Jun 13 01:52:32.347 CDT:
%C4K_EBM-4-HOSTFLAPPING: Host 00:AB:DA:EE:0A:FF in vlan 74 is flapping
between port Fa2/37 and port Po1

Here we could see a host MAC address moving between a front-panel port on the switch and a core-facing port channel.  Something’s not right there.  There were tons of messages like these in the logs.

Digging a little further I determined that Spanning Tree was disabled.  Ugh.

Spanning Tree Protocol (STP) is not  popular, and it’s definitely flawed.  With all due respect to the (truly) great Radia Perlman, the inventor of STP, choosing the lowest bridge identifier (usually the MAC address of the switch) as the root, when priorities are set to the default, is a bad idea.  It means that if customers deploy STP with default values, the oldest switch in the network becomes root.  Bad idea, as I said.  However, STP also gets a bad reputation undeservedly.  I cannot tell you how many times there was a layer 2 loop in a customer network, where STP was disabled, and the customer referred to it as a “Spanning Tree loop”.  STP stops layer 2 loops, it does not create them.  And a layer 2 loop out of control is much worse than a 50 second spanning tree outage, which is what you got with the original protocol spec.  When there is no loop in the network, STP doesn’t do anything at all except for send out BPDUs.

As I suspected, the customer had disabled spanning tree due to concerns about the speed of failover.  They had also managed to patch a layer 2 loop into their network during a minor change, causing an unchecked loop to circulate frames out of control, bringing down their entire cell site.

I explained to them the value of STP, and why any outage caused by it would be better than the out of control loop they had.  I was told to mind my own business.  They didn’t want to enable spanning tree because it was slow.  Yes, I said, but only when there is a loop!  And in that case, a short outage is better than a meltdown.  Then I realized the customer and I were in a loop, which I could break by closing the case.

Newer technologies (such as SD-Access) obviate the need for STP, but if you’re doing classic Layer 2, please, use it.