radia perlman

All posts tagged radia perlman

There’s a lot of talk about networking simplicity these days.  There’s been a lot of talk about networking simplicity, in fact, for as long as I can remember.  The drive to simplify networking has certainly been the catalyst for many new products, most (but not all) unsuccessful.  Sometimes we forget that networking has some inherent complexities (a large distributed system with multiple os’s, protocols, media types), but that much of the complexity can be attributed to humans and their choices.  IPv4 is a good example of this.

When I got into network engineering, I had assumed that network protocols were handed down from God and were immaculate in their perfection.  Reading Radia Perlman’s classic book Interconnections changed my understanding.  Aside from her ability to explain complex topics with utter clarity, Perlman also exposed the human side of protocol development.  Protocols are the result of committees, power politics, and the limitations of human personality.  Some protocols are obviously flawed.  Some flaws get fixed, but widely deployed protocols, like IPv4, are hard to fix.  Of course, v6 does remedy many of the problems of v4, but it’s still IP.

My vote for simplest protocol goes to AppleTalk.  When I was a young network guy, I mostly worked on Mac networks.  This was in the beige-box era before Jobs made Apple “cool” again.  The computers may have been lame, but Apple really had the best networking available in the 1990’s.  I’ve written about my love for LocalTalk, and its eminently flexible alternative PhoneNet in the past.  But the AppleTalk protocol suite was phenomenal as well.

N.B.  My description of AppleTalk protocol mechanics is largely from memory.  Even the Wikipedia article is a bit sparse on details.  So please don’t shoot me if I misremember something.

In the first place, you didn’t need to do anything to set up an AppleTalk network.  You just connected the computers together and switched either the printer or modem port into a network port.  Auto-configuration was flawless.  Without any DHCP server, AppleTalk devices figured out what network they were on, and acquired an address.  This was done by first probing for a router on the network, and then randomly grabbing an address.  The host then broadcast its address, and if another host was already using it, it would back off and try another one.  AppleTalk addresses consisted of a two byte network address which was equivalent to the “network” portion of an IP subnet, and a one-byte host address (equivalent to the “host” portion of an IP subnet.)  If this host portion of the address is only one byte, aren’t you limited to 255 (or so) addresses?  No!  AppleTalk (Phase 2) allowed aggregation of contiguous networks into “cable ranges”.  So I could have a cable range of 60001-60011, multiple networks on the same media, and now I could have 2530 end stations, at least in theory.

Routers did need some minimal configuration, and support for dynamic routing protocols was a bit light.  Once the router was up and running, it would create “zones” in the end-user’s computer in an application called “Chooser”.  They might see “1st floor”, “2nd floor”, “3rd floor”, for example, or “finance”, “HR”, “accounting”.  However you chose to divide things.  If they clicked on zone, they would see all of the AppleTalk file shares and printers.  You didn’t need to point end stations at their “default gateway”.  They simply discovered their router by broadcasting for it upon start up.

AppleTalk networks were a breeze to set up and simple to administer.  Were there downsides?  The biggest one was the chattiness of the protocols.  Auto-configuration was accomplished by using a lot of broadcast traffic, and in those days bandwidth was at a premium.  (I believe PhoneNet was around 200 Kbps or so.)  Still, I administered several large AppleTalk networks and was never able to quantify any performance hit from the broadcasts.  Like any network, it required at least some thinking to contain network (cable range) sizes.

AppleTalk was done away with as the Internet arose and IP became the dominant protocol.  For hosts on LocalTalk/PhoneNet networks, which did not support IP, we initially tunneled it over AppleTalk.  Ethernet-connected Macs had a native IP stack.  The worst thing about AppleTalk was the flaky protocol stack (called OpenTransport) in System 7.5, but this was a flaw in implementation, not protocol design.

I’ll end with my favorite Radia Perlman quote:  “We need more people in this industry who hate computers.”  If we did, more protocols might look like AppleTalk, and industry MBAs would need something else to talk about.

I’ve mentioned in previous TAC Tales that I started on a TAC team dedicated to enterprise, which made sense given my background.  Shortly after I came to Cisco the enterprise team was broken up and its staff distributed among the routing protocols team and LAN switch team.  The RP team at that time consisted of service provider experts with little understanding of LAN switching issues, but deep understanding of technologies like BGP and MPLS.  This was back before the Ethernet-everywhere era, and SP experts had never really spent a lot of time with LAN switches.

This created a big problem with case routing.  Anyone who has worked more than 5 minutes in TAC knows that when you have a routing protocol problem, usually it’s not the protocol itself but some underlying layer 2 issue.  This is particularly the case when adjacencies are resetting.  The call center would see “OSPF adjacencies resetting” and immediately send the case to the protocols team, when in fact the issue was with STP or perhaps a faulty link.  With all enterprise RP issues suddenly coming into the same queue as SP cases, our SP-centric staff were constantly getting into stuff they didn’t understand.

One such case came in to us, priority 1, from a service provider that ran “cell sites”, which are concrete bunkers with radio equipment for cellular transmissions.  “Now wait,” you’re saying, “I thought you just said enterprise RP cases were a problem, but this was a service provider!”  Well, it was a service provider but they ran LAN switches at the cell site, so naturally when OSPF started going haywire it came in to the RP team despite obviously being a switching problem!

A quick look at the logs confirmed this:

Jun 13 01:52:36 LSW38-0 3858130: Jun 13 01:52:32.347 CDT:
%C4K_EBM-4-HOSTFLAPPING: Host 00:AB:DA:EE:0A:FF in vlan 74 is flapping
between port Fa2/37 and port Po1

Here we could see a host MAC address moving between a front-panel port on the switch and a core-facing port channel.  Something’s not right there.  There were tons of messages like these in the logs.

Digging a little further I determined that Spanning Tree was disabled.  Ugh.

Spanning Tree Protocol (STP) is not  popular, and it’s definitely flawed.  With all due respect to the (truly) great Radia Perlman, the inventor of STP, choosing the lowest bridge identifier (usually the MAC address of the switch) as the root, when priorities are set to the default, is a bad idea.  It means that if customers deploy STP with default values, the oldest switch in the network becomes root.  Bad idea, as I said.  However, STP also gets a bad reputation undeservedly.  I cannot tell you how many times there was a layer 2 loop in a customer network, where STP was disabled, and the customer referred to it as a “Spanning Tree loop”.  STP stops layer 2 loops, it does not create them.  And a layer 2 loop out of control is much worse than a 50 second spanning tree outage, which is what you got with the original protocol spec.  When there is no loop in the network, STP doesn’t do anything at all except for send out BPDUs.

As I suspected, the customer had disabled spanning tree due to concerns about the speed of failover.  They had also managed to patch a layer 2 loop into their network during a minor change, causing an unchecked loop to circulate frames out of control, bringing down their entire cell site.

I explained to them the value of STP, and why any outage caused by it would be better than the out of control loop they had.  I was told to mind my own business.  They didn’t want to enable spanning tree because it was slow.  Yes, I said, but only when there is a loop!  And in that case, a short outage is better than a meltdown.  Then I realized the customer and I were in a loop, which I could break by closing the case.

Newer technologies (such as SD-Access) obviate the need for STP, but if you’re doing classic Layer 2, please, use it.