Skip navigation

Tag Archives: cisco

I’ve mentioned my first job as a network engineer several times on this blog.  I worked at the San Francisco Chronicle, the biggest newspaper in Northern California.   I was brought in to manage the network as a Cisco-certified engineer, having just passed a four-day CCNA bootcamp.  Right before the dot-bomb economic crash, network engineers were in short supply.

The Chronicle’s network had recently been completely re-engineered, and the vendor selected was Foundry Networks.  Foundry was an up-and-coming vendor famous for selling high-speed switches to internet service providers.  They weren’t known for selling into enterprises, but they had convinced the previous network manager to install their hardware in nearly all of the Chronicle’s wiring closets.

It didn’t go very well.  The network had become incredibly unstable.  No company wants an unstable network, but newspapers are a particularly high-pressure environment since they have tight deadlines in order to get the paper out every single day, without fail.  Management of the data network was taken away from the previous manager and assigned to the head of the telecom department.  The plan was to rip out the Foundry and replace it with Cisco.

Foundry, of course, had other ideas.  Their account manager, whom I’ll call Bill, was quite aggressive in trying to restore the good name of Foundry.  I’ll give him credit for his doomed mission.

We had several problems.  The first was that we had only a single core router.  The router had two management modules, but failover between them was not fast, and our reporters and advertising people used Tandem systems which were sensitive to even slight network outages.  Foundry was well known for their fast IP switches, but we used AppleTalk and IPX as well, and their protocol stacks were not well implemented.  The BigIron 8000 was prone to crashing and taking out a lot of our users.  We had only one because the previous manager had been trying to save money.

The second problem was not Foundry’s fault entirely, although I do blame the SE in part.  Nobody ever set the spanning tree bridge priority on the core box.  By default, STP selects the bridge with the lowest bridge identifier as root.  Since the BID is comprised of a user-configured priority and the MAC address, if no priority is configured, the oldest switch in the network becomes the root bridge, since MAC addresses and OUI’s are sequential.

It turned out our Windows guys had been hauling around an ancient Cabletron switch to multiplex switch ports when working on end users’ computers.  (This was before wireless).  They would plug in, the Cabletron would dutifully assume STP root, and the entire network would reconverge for 50 seconds, spanning tree roots not being sticky.  I remember once paying a bill at a nearby restaurant before we were finished and running with the other engineers back to the office, hoping to catch an outage in progress after our pagers went off.  Foundry’s logs were not very good and we didn’t know why the network kept going down.  Eventually I figured it out, I don’t remember how.

The third problem was that the Foundry FastIron switches we used in the wiring closets had bad optics.  The Molex optics Foundry had selected for its management modules were flaky, and so we had to replace every single one with modules using Finisar optics.  I remember Bill, our account manager, coming in for our middle-of-the-night maintenance window several weekends in a row, blades in tow, and helping us to swap out the cards.

All of these problems created a bad reputation for Foundry within the Chronicle.  I remember Bill walking out of the front door carrying a Foundry box with an RMA’d management module.  A non-technical employee, perhaps a reporter or advertising salesman, saw the box and shouted, “Hey, they’re getting rid of Foundry!”  People in the lobby started cheering.  Bill looked at me and said, “soon they’ll be cheering when I come into the building with a Foundry box.”

It never happened.  We ripped out Foundry and replaced everything with Cisco Catalyst 4k and 6k switches.

The fact of the matter is, had we added a second BigIron in the core, fixed the root bridge problem, and replaced all the faulty modules, we probably would have had a solid network.  But there often comes a point when a vendor has destroyed their reputation with a customer.  It takes a multitude of factors to reach this point, but there is definitely a point of no return.  Once that line is crossed, the customer will often allow cordial meetings, listen with sympathy to the account team and execs, and then go their separate way.

A few years later I was laid off from my job at a Gold Partner, and was interviewing with another Gold Partner.  The technical interviewer looked at my resume and said, “I see you worked at the San Francisco Chronicle.”

“Yes,” I said, “I was brought in to replace the Foundry network they had with Cisco.  The whole thing was a disaster, poorly designed and bad products.”

“I designed that network,” he replied, “when I worked for another partner.  I also installed it.”

I didn’t get the job.

I wrote this post on Feb 20, 2020, and I always thought it was an entertaining episode.  FBI Special Agent Elvis Chan, who features prominently in the post, has been in the news lately as he played a major role in the Twitter Files.  I will stay out of politics, except to note that Elvis was indeed a liaison to the business community, as seen here.

I was working at Juniper when the CIO asked me to apply for a government security clearance.  There were a number of hacking attempts on our network, and a security clearance would make me eligible for briefings from the government on the nature and scope of the threats against the United States’ networks.  Being one of the few US citizens in our department, and having a security background, it made sense.

I met with our “FSO”, the on-site liaison to the clearance-granting agency, in this case the Department of Defense.  I’ll call him Billy.  Billy pointed me to the government web site which housed the application, called “OPM”.  The OPM application was extensive, requiring me to input huge amounts of information about myself and my family.  It required a bit of work to track down some of the information, and when I printed the PDF copy of the application it totaled around eighty pages.

One day Billy called me into his office and told me I had been awarded a secret clearance.  He let me know that I could be subject to the death penalty if I divulged any classified information.  I signed some documents, and that was it. “Don’t I get a card for my wallet or anything?” I asked Billy.  He just smiled.

Shortly after getting my clearance, one of our other cleared employees brought me into a secure office in one of Juniper’s buildings where we could look at classified information.  He pulled a secured laptop out of a locked drawer, and a password out of a sealed envelope.  We began perusing classified information.  None of it was relevant to us, and none of it was particularly memorable.  For example, we read an article about several criminal gangs, the existence of which was unclassified.  The only classified information in the article happened to be the names of particular gangs.  They didn’t mean much to me, and I probably forgot them within a day or two.

One day I was invited to the San Francisco FBI office, to receive a classified briefing.  Billy had to fax the clearance over, because the DoD and FBI didn’t have an electronic way to exchange clearances.  I showed up, excited, to the federal building in San Francisco and proceeded up to the floor where the briefing was to take place.  Nobody was there.  I wandered around the white hallway with locked doors unable to make contact with anyone.  The elevator opened after a few minutes, and another equally confused attendee emerged.  We were wandering around for several minutes before someone showed up and told us to go to a different floor.

On the new floor a couple of young-looking FBI agents setup a table, checked our ID’s, and then took our cell phones.  The security did not seem very rigorous.  They then admitted us to the SCIF, or Sensitive Compartmented Information Facility.  The room we were led into was just a conference room, with a low ceiling and no windows.  Another young-looking FBI agent approached me, wearing a tie but no coat.  “Hi, I’m Elvis,” he said.

“I’m a special agent and the coordinator of the briefing today.  We’re very excited to have you here.”

We had a brief conversation about my job and role, and then I asked to use the bathroom.

“Go out the back door of the SCIF and hang a right, he said.”

I did this, and found myself walking with a wall on my right, and a row of waist-level cubicles on my left.  Nobody was in the the cubes, but paperwork was sitting on most of the desks. I wanted to peer at the paperwork as I walked by.  I have a clearance, I figured, so if I had a right to at least take a peek and see if the names of anyone I knew appeared.  Unfortunately, without pausing and staring, a chance I didn’t want to take, I couldn’t read anything.

I found the bathroom, and as I was participating in nature’s call, a couple of guys came in wearing ties but no sport coats.  They each had side-arms on their belts.  I wondered why these agents, who are basically office workers, needed to walk around armed.

As I came out of the bathroom, a female FBI agent was standing there, tapping her foot in anticipation of my emergence.  She looked like my school librarian.  Diminutive in stature, she had a side-arm that looked as big as she was.

“Are you FBI?” she asked pointedly.

“No,” I replied, thinking the answer was obvious.

She let out a long sigh, looking like a satisfied cop who has caught a perp.  “You can’t be here without an escort,” she scolded me.

“But Elvis told me I could!” was my retort.  I had a sudden realization that, in a large FBI office like San Francisco’s, it was entirely possible that not every FBI agent knew every other FBI agent, and that my host agent may have been entirely unknown to her.  Here I was, by myself, in the inner sanctum of an FBI office, explaining to an armed federal agent that I happened to be there because Elvis had sent me.

Fortunately, a glimmer of recognition flashed across her stern countenance.  “Oh, Elvis!” she said, exasperated.  “Come on,” she snapped, and led me back to the SCIF.

Back in the SCIF, the briefing began.  The first presenter was an FBI agent wearing a tie, with a coat this time.  Whatever he had learned at the FBI training center in Quantico, VA apparently did not include the fundamentals of haberdashery.  Anyone who buys a suit knows that you immediately have it tailored, as the pant legs are way too long.  Apparently this agent bought his cream-colored suit, with piping, and never sent it for alterations.  The trouser legs were so long he was actually walking on the bottom of his pant legs.  His presentation was no better than his tailoring.  Presenting on computer security, it was clear this was not somebody with even a basic knowledge of computing.

After him, two Homeland Security analysts presented.  They wore rumpled khakis with jacket and tie, and sported similar pyramid mustaches.  They presented on SCADA systems, a subject I could care less about.  Almost all of it was unclassified.

Shortly after my briefing, I learned that the OPM database had been hacked by the Chinese military.  All the personal information about myself and my family is in their hands now.  When I left Juniper, Cisco declined to renew my security clearance.

Some people hide that they have/had a clearance, as they can be targeted by foreign governments.  Personally, I don’t care.  What little classified information I saw, I can’t remember.  You could waterboard me and I wouldn’t be able to tell you a thing.

When I first started at Cisco (the second time), I remember being in a customer meeting where I had no idea what was going on.  As is typical for vendor meetings, Cisco employees outnumbered the customer by 3 to 1.  Someone from our side was presenting, though I don’t really remember about what.  I didn’t say anything because I was still pretty new, and frankly didn’t have much to say.  A Distinguished engineer, whom I know opposed my hiring, pulled me aside after the meeting and said to me with a smile:  “You know at Cisco we judge people on how much they speak in meetings.”  He was obviously implying that by keeping my mouth shut, I was proving my lack of value to the company.  I never really held it against that engineer, who wasn’t a bad guy.  But he reminded me of a problem in the corporate world, this belief that you have to always be talking.

I default to keeping my mouth shut when I don’t have anything important to say.  This probably is the result of being a child of divorce, but regardless I’ve always hated how much noise and talk are valued in our society.  Twitter is just a permanent gripe session, talk radio (whether the in-your-face conservative or sedate NPR variety) is just ceaseless hot air, and the more channels of communication we open up, the worse it becomes.  In the corporate world, decisions are often made in meetings based on the opinions of the most verbally aggressive in the room.  There is an underlying assumption to this approach to decision making:  that the loudest have the most valuable opinions, and the quietest the least.  But isn’t the opposite the case?  How many loudmouths have you known who spout nonsense, and how many super-intelligent quiet people do you know?  Some of the smartest people I’ve met are introverts.  And yet we seem to think if you’re willing to express an opinion loudly, then you’re worth following.

The problem with meeting culture in particular is that it doesn’t value thought.  I once had a VP complain to me that someone couldn’t “think on his feet”.  Most of the time, thinking on your feet doesn’t mean thinking at all.  It is simply reaction.  Thought takes time.  It requires reflection.  In corporate culture, we often prize how quickly you react, not how deeply you think.

This is not to say introverts should never try to overcome their shyness, nor that vocal people are always less intelligent.  However, I think as leaders in the corporate community, we can take steps to improve our meeting culture so that unheard but important voices have their chance to contribute.  This can be done a few ways:

  • Ensuring equal air time for participants in meetings.  If someone is talking too much, limit his time.  Call on the quiet folks and introverts explicitly to get their opinions.
  • Don’t make major decisions in meetings unless there is legitimate time pressure to do so.  At the end of a meeting, allow people to go back and reflect on what was discussed, possibly run through it in a chat room, and reconvene later when people have had time to think about the subject at hand.
  • Stop evaluating people simply on how much they speak in meetings.  Realize, particularly if you are a vocal-type, that people contribute in different ways.
  • Try to minimize interruptions in presentations and to save questions for the end.  I like to hear a presenter lay out a story in a logical fashion, and when presenters are constantly interrupted, it disturbs my ability to follow.  For those of us who are more contemplative thinkers, our ability to participate is hampered when presenters can never finish a thought.

Part of the problem is that many, if not most, who rise to leadership positions in the corporate world are the verbally aggressive and highly vocal type.  They often cannot understand how anyone could possibly approach things in a different way, and take quietness as a sign of weakness, indecision, or unintelligence.  For those leaders, recognizing the value of quieter individual contributors and leaders will help them and their organization.

Now that I’m done spouting off it’s time to log off for some silence of my own.

2
1

It seems to be rank heresy for someone working in the valley to say it, but let me say it anyways.  I don’t agree with the axiom of the technology industry which states that all technological progress is always good.  Many in our society instinctively realize this, which is why they oppose genetic engineering and plastics.  Still, the technology industry is so persistently in love with itself, and so optimistic about its potential to solve every human problem, that when anyone points out the consequences of technological progress, we quickly respond with AI’s potential to solve the problems it’s bound to create.  (Sorry for the long sentence, but I’m going to quote Plato in this essay, and by Platonic standards that last sentence is short.)  AI is the solution to everything.  AI will unlock the mysteries of human existence.  AI will allow human beings to live forever.  AI will cure cancer.  AI will solve the dangers of, well, genetic engineering and plastics.

An example of this is the extraordinarily concerning essay in the Wall Street Journal a few weeks ago by computer scientist and 60’s icon Jerry Kaplan.  Dr. Kaplan reviews the recent accomplishments of functional brain imaging technologies, which are starting to become more precise in identifying how people are feeling, and even which words they are thinking.  “With improved imaging technology, it may become possible to ‘eavesdrop’ on a person’s internal dialogue, to the extent that they are thinking in words,” says Kaplan.  With a predictable dose of technological optimism, Kaplan sees nothing concerning in the possibility of machines being able to read people’s minds.  Instead, he thinks it opens up a world of possibilities.  For example, in civil lawsuits it’s difficult to ascertain how much pain and suffering an individual has undergone, and hence to assign damages.  Why, we could use brain imaging and AI to calculate precisely how much somebody was harmed!

I may not hold a doctorate, but I spend a lot of time working with computers in the real world, not the world of researchers.  I’m skeptical that functional brain imaging will be able to read people’s minds, but the possibility is alarming.  In today’s era of instant social media “viral” lynching, we all have to be quite careful what we say.  Even with innocent intentions, a slip of the tongue can set off Twitter mobs that will destroy your life and career.  Now, even guarding your speech won’t help you.  You may walk by an AI mind-reading machine and have your life ruined for thought-crime.  And we’re celebrating this?  Even Dr. Kaplan’s scenario of determining pain and suffering in lawsuits is ludicrous.  How quickly will people learn to game the machine, produce artificial emotional trauma, and reap the rewards?

And now to Plato.  In his Phaedrus, Plato tells the story of an inventor named Theuth who came to the Egyptian king and was showing off some of his creations.  This was in the time before writing existed.  After showing the king a number of inventions, Theuth showed him letters and writing:

And when he was talking about writing, Theuth said:  “King, this learning will make the Egyptians wiser and give them better memories.  For, I have found a medicine of both memory and wisdom.

The King said:  “Oh most artful Theuth!  While one person has the ability to create things skillfully, it takes another person to judge those things, and whether their use brings harm or help.  Now you, being the father of letters, through your love of them, have stated the opposite of their capability.  For, in the minds of those who learn this art, it will produce forgetfulness, by neglect of the memory, inasmuch as they have faith in writing, which consists of inscriptions outside of themselves, rather than remembering for themselves.”

Phaedrus 274E-275A.  (My own admittedly rough translation)

In other words, the inventor thinks writing will help memory, whereas the king points out it will hinder it!  I love this quote because it shows how the arrogance of inventors clouds their perception of their own inventions.  This is particularly true in Silicon Valley, where the pressure to always innovate removes any clear thinking about the consequences of the inventions. When confronted with the possibility of huge swaths of jobs being eliminated by their inventions, the lame response of the Silicon Valley innovators is to propose a universal basic income, hence making the movie Wall-E appear all too real.

This is a blog about network engineering, so how is this related?  Aren’t I involved in the automation of network systems?  Isn’t Cisco bringing AI to the world of networking?

Indeed we are, but I like to think that we’re a bit more realistic about it.  As a network engineer, and former TAC guy, I’ve spent countless hours doing nasty troubleshooting that, frankly, was hard and not particularly enjoyable.  Having executives looking over your shoulder, with the possibility of getting fired, with countless users freaking out, while trying to hunt down why the Internet just doesn’t work…  Well, if ML and AI will help me to locate the problem faster and restore operation to my network, I’m all for that.  If AI starts reading minds, I’m breaking out the tinfoil hat.

When I first started at Cisco TAC, I was assigned to a team that handled only enterprise customers.  One of the first things my boss said to me when I started there was “At Cisco, if you don’t like your boss or your cubicle, wait three months.”  Three months later, they broke the team up and I had a new boss and a new cubicle.  My new team handled routing protocols for both enterprise and service provider customers, and I had a steep learning curve having just barely settled down in the first job.

A P1 case came into my queue for a huge cable provider.  Often P1’s are easy, requiring just an RMA, but this one was a mess.  It was a coast-to-coast BGP meltdown for one of the largest service provider networks in the country.  Ugh.  I was on the queue at the wrong time and took the wrong case.

The cable company was seeing BGP adjacencies reset across their entire network.  The errors looked like this:

Jun 16 13:48:00.313 EST: %BGP-5-ADJCHANGE: neighbor 172.17.249.17 Down BGP
Notification sent

Jun 16 13:48:00.313 EST: %BGP-3-NOTIFICATION: sent to neighbor 172.17.249.17
3/1 (update malformed) 8 bytes 41A41FFF FFFFFFFF

The cause seemed to be malformed BGP packets, but why?  The GSR routers they had were kind enough to give us a hex dump of the BGP packet when an adjacency reset.  I got out my trusty Doyle book and began decoding the packets on paper, when a colleague was kind enough to point me to an internal Cisco tool that would decode a BGP packet from hex.

We could see that, for some reason, the NLRI portion of the BGP message was getting cut off.  According to my calculations, it should have been 44 bytes, but we were only seeing 32 bytes of information.  NLRI is Network Layer Reachability Information, just a fancy BGP way of saying the paths that go into the routing update.  We also noticed a clue in the router logs:  TCP-6-TOOBIG messages showing up from time to time.

Going over it with engineering, we realized something interesting.  The customer had enabled TCP selective acknowledgement on all their routers.  Also known as SACK, TCP selective acknowledgement is designed to circumvent an inefficiency in TCP.  If, say, 1 of 3 TCP segments gets dropped, the TCP protocol requires re-transmission of all 3 of the segments.  In other words, the receiver keeps ACKing the last segment it received, but it takes time for the sender to realize something is wrong.  When the sender finally realizes something is wrong, it goes back to the last known good segment and re-transmits everything after it.  SACK allows TCP to acknowledge and re-transmit specific segments.  If we are only missing segments 2, 3, and 5, then we can ask for just those to be re-transmitted.  SACK is stored as an option in the TCP header.

The problem is, there is a finite amount of space in the TCP header, and the SACK field can get rather long.  It just so happens that BGP also stores its MD5 authentication hash in the TCP header.  If SACK gets too long, it can crowd the MD5 header and cause BGP errors.  Based on our analysis, this was exactly what had happened.  Thus, the malformed packets.  We had the customer remove the SACK option from all routers and the problem stopped.

We were left with a couple questions.  Why did SACK get so long, and why would it be allowed to overwrite other important values in the TCP header?  In answer to the first question, there was a bug which was causing some linecards to send out malformed packets on occasion, thus causing SACKs.  In answer to the second question, there was a bug in the TCP header options packing that allowed one field (SACK) to crowd out another field (MD5 authentication).  I knew the case wouldn’t close for a long time.  Multiple bugs needed to be filed, and new code qualified and installed.  Fortunately the customer had a workaround (disable SACK) and an HTE.  An HTE was a TAC engineer dedicated to their account.  He grabbed the case from me for babysitting and I moved onto my next case.

In my TAC tales I often make fun of the occasional mistakes of TAC engineers.  However, TAC is a tough job, and the organization is staffed by some top engineers.  Many cases, like this one, required hard core engineering and knowledge that spans protocol details and ASIC-level hardware debugging.  It’s not a job for the faint of heart.  This case required digging into the TCP header, understanding how options are packed, and figuring out how to stop a major meltdown of a service provider network.  A high-stress situation, to be sure, but these cases often were the most rewarding.

 

There is one really nice thing about having a blog whose readership consists mainly of car insurance spambots:  I don’t have to feel guilty when I don’t post anything for a while.  I had started a series on programmability, but I managed to get sidetracked by the inevitable runup to Cisco Live that consumes Cisco TME’s, and so that thread got a bit neglected.

Meanwhile, an old article by the great Ivan Pepelnjak got me out of post-CL recuperation and back onto the blog.  Ivan’s article talks about how vendor lock-in is inevitable.  Thank you, Ivan.  Allow me to go further, and write a paean in praise of vendor lock-in.  Now this might seem predicable given that I work at Cisco, and previously worked at Juniper.  Of course, lock-in is very good for the vendor who gets the lock.  However, I also spent many years in IT, and also worked at a partner, and I can say from experience that I prefer to manage single vendor networks.  At least, as single vendor as is possible in a modern network.  Two stories will help to illustrate this.

In my first full-fledged network engineer job, I managed the network for a large metropolitan newspaper (back when such a thing existed.)  The previous network team had installed a bunch of Foundry gear.  They also had a fair amount of Cisco.  It was all first generation, and the network was totally unstable.  Foundry actually had some decent hardware, but their early focus was IP.  We were running a typical 1990’s multi-protocol network, with AppleTalk, IPX, SNA, and a few other things thrown in.  The AppleTalk/IPX stack on the Foundry was particularly bad, and when it interacted with Cisco devices we had a real mess.

We ended up tossing the Foundry and going 100% Cisco.  We managed to stabilize the network, and now we were dealing with a single vendor instead of two.  This made support and maintenance contract management far easier.

Second story:  When I worked for the partner, I had to do a complete retrofit of the network for a small school district.  They had a ton of old HP, and were upgrading their PBX to a Cisco VoIP solution.  This was in the late 2000’s.  I did the data network, and my partner did the voice setup.  The customer didn’t have enough money to replace all their switches, so a couple of classrooms were left with HP.

Well, guess what.  In all the Cisco-based rooms, I plugged in the phones and they came up fine.  The computers hanging off the internal phone switch port also came up fine, on the correct data VLAN.  But on the classrooms with the HP switches, I spent hours trying to get the phones and switches working together.

There is a point here which is obvious, but needs restating.  If Cisco makes the switches, the routers, the firewalls, and the phones, the chances of them all working together is much higher than if several vendors are in the mix.  Even with Cisco’s internal BU structure, it is far easier to call a meeting with different departments within a company than to fix problems that occur between vendors.  Working on Software Defined-Access, I learned very quickly how well we can pull together a team from different product groups, since our product involves switching (my BU), ISE, wireless, and APIC-EM.

As I mentioned above, the other advantage is easier management of the non-technical side of things.  Managing support contracts, and simply having one throat to choke when things go wrong are big advantages of a single-vendor environment.

All this being said, from a programmability perspective we are committed to open standards.  We realize that many customers want a multi-vendor environment and tools like OpenConfig with which to manage it.  Despite Cisco’s reputation, we’re here to make our customers happy and not force them into anything.  From my point of view, however, if I ever go back to managing a network I hope it is a single-vendor network and not a Fraken-network.

Meanwhile, if you’d like to hear my podcast with Ivan, click here.

Since I finished my series of articles on the CCIE, I thought I would kick off a new series on my current area of focus:  network programmability.  The past year at Cisco, programmability and automation have been my focus, first on Nexus and now on Catalyst switches.  I did do a two-part post on DCNM, a product which I am no longer covering, but it’s well worth a read if you are interested in learning the value of automation.

One thing I’ve noticed about this topic is that many of the people working on and explaining programmability have a background in software engineering.  I, on the other hand, approach the subject from the perspective of a network engineer.  I did do some programming when I was younger, in Pascal (showing my age here) and C.  I also did a tiny bit of C++ but not enough to really get comfortable with object-oriented programming.  Regardless, I left programming (now known as “coding”) behind for a long time, and the field has advanced in the meantime.  Because of this, when I explain these concepts I don’t bring the assumptions of a professional software engineer, but assume you, the reader, know nothing either.

Thus, it seems logical that in starting out this series, I need to explain what exactly programmability means in the context of network engineering, and what it means to do something programmatically.

Programmability simply means the capacity for a network device to be configured and managed by a computer program, as opposed to being configured and managed directly by humans.  This is a broad definition, but technically using an interface like Expect (really just CLI) or SNMP qualifies as a type of programmability.  Thus, we can qualify this by saying that programmability in today’s world includes the design of interfaces that are optimized for machine-to-machine control.

To manage a network device programmatically really just means using a computer program to control that network device.  However, when we consider a computer program, it has certain characteristics over and above simply controlling a device.  Essential to programming is the design of control structures that make decisions based on certain pieces of information.

Thus, we could use NETCONF to simply push configuration to a router or switch, but this isn’t the most optimal reason to use it.  It would be a far more effective use of NETCONF if we read some piece of data from the device (say interface errors) and took an action based on that data (say, shutting the interface down when the counters got too high.)  The other advantage of programmability is the ability to tie together multiple systems.  For example, we could read a device inventory out of APIC-EM, and then push config to devices based on the device type.  In other words, the decision-making capability of programmability is most important.

Network programmability encompasses a number of technologies:

  • Day 0 technologies to bring network devices up with an appropriate software version and configuration, with little to no human intervention.  Examples:  ZTP, PoAP, PnP.
  • Technologies to push and read configuration and operational data from devices.  Examples:  SNMP, NETCONF.
  • Automation systems such as Puppet, Chef, and Ansible, which are not strictly programming languages, but allow for configuration of numerous devices based on the role of the device.
  • The use of external programming languages, such as Ruby and Python, to interact with network devices.
  • The use of on-box programming technologies, such as on-box Python and EEM, to control network devices.

In this series of articles we will cover all of these topics as well as the mysteries of data models, YANG, YAML, JSON, XML, etc., all within the context of network engineering.  I know when I first encountered YANG and data models, I was quite confused and I hope I clear up some of this confusion.

1
1

Introduction

My role at Cisco is transitioning to enterprise so I won’t be working on Nexus switches much any more.  I figured it would be a good time to finish this article on DCNM.  In my previous article, I talked about DCNM’s overlay provisioning capabilities, and explained the basic structure DCNM uses to describe multi-tenancy data centers.  In this article, we will look at the details of 802.1q-triggered auto-configuration, as well as VMtracker-based triggered auto-configuration.  Please be aware that the types of triggers and their behaviors depends on the platform you are using.  For example, you cannot do dot1q-based triggers on Nexus 9k, and on Nexus 5k, while I can use VMTracker, it will not prune unneeded VLANs.  If you have not read my previous article, please review it so the terminology is clear.

Have a look at the topology we will use:

autoconfig

The spine switches are not particularly relevant, since they are just passing traffic and not actively involved in the auto-configuration.  The Nexus 5K leaves are, of course, and attached to each is an ESXi server.  The one on the left has two VMs in two different VLANs, 501 and 502.  The 5k will learn about the active hosts via 802.1q triggering.  The rightmost host has only one VM, and in this case the switch will learn about the host via VMtracker.  In both cases the switches will provision the required configuration in response to the workloads, without manual intervention, pulling their configs from DCNM as described in part 1.

Underlay

Because we are focused on overlay provisioning, I won’t go through the underlay piece in detail.  However, when you set up the underlay, you need to configure some parameters that will be used by the overlay.  Since you are using DCNM, I’m assuming you’ll be using the Power-on Auto-Provision feature, which allows a switch to get its configuration on bootup without human intervention.

config-fabric

Recall that a fabric is the highest level construct we have in DCNM.  The fabric is a collection of switches running an encapsulation like VXLAN or FabricPath together.  Before we create any PoAP definitions, we need to set up a fabric.  During the definition of the fabric, we choose the type of provisioning we want.  Since we are doing auto-config, we choose this option as our Fabric Provision Mode.  The previous article describes the Top Down option.

Next, we need to build our PoAP definitions.  Each switch that is configured via PoAP needs a definition, which tells DCNM what software image and what configuration to push.  This is done from the Configure->PoAP->PoAP Definitions section of DCNM.  Because generating a lot of PoAP defs for a large fabric is tedious, DCNM10 also allows you to build a fabric plan, where you specify the overall parameters for your fabric and then DCNM generates the PoAP definitions automatically, incrementing variables such as management IP address for you.  We won’t cover fabric plans here, but if you go that route the auto-config piece is basically the same.config-poap-defs

Once we are in the PoAP definition for the individual switch, we can enable auto-configuration and select the type we want.

poap-def

In this case I have only enabled the 802.1q trigger.  If I want to enable VMTracker, I just check the box for it and enter my vCenter server IP address and credentials in the box below.  I won’t show the interface configuration, but please note that it is very important that you choose the correct access interfaces in the PoAP defs.  As we will see, DCNM will add some commands under the interfaces to make the auto-config work.

Once the switch has been powered on and has pulled down its configuration, you will see the relevant config under the interfaces:

n5672-1# sh run int e1/33
interface Ethernet1/33
switchport mode trunk
encapsulation dynamic dot1q
spanning-tree port type edge trunk

If the encapsulation command is not there, auto-config will not work.

Overlay Definition

Remember from the previous article that, after we define the Fabric, we need to define the Organization (Tenant), the Partition (VRF), and then the Network.  Defining the organization is quite easy: just navigate to the organizations screen, click the plus button, and give it a name.  You may only have one tenant in your data center, but if you have more than one you can define them here.  (I am using extremely creative and non-trademark-violating names here.)  Be sure to pick the correct Fabric name in the drop-down at the top of the screen;  often when you don’t see what you are expecting in DCNM, it is because you are not on the correct fabric.

config-organization

Next, we need to add the partition, which is DCNM’s name for a VRF.  Remember, we are talking about mutlitenancy here.  Not only do we have the option to create multiple tenants, but each tenant can have multiple VRFs.  Adding a VRF is just about as easy as adding an organization.  DCNM does have a number of profiles that can be used to build the VRFs, but for most VXLAN fabrics, the default EVPN profile is fine.  You only need to enter the VRF name.  The partition ID is already populated for you, and there is no need to change it.

partition

There is something important to note in the above screen shot.  The name given to the VRF is prepended with the name of the organization.  This is because the switches themselves have no concept of organization.  By prepending the org name to the VRF, you can easily reuse VRF names in different organizations without risk of conflict on the switch.

Finally, let’s provision the network.  This is where most of the configuration happens.  Under the same LAN Fabric Automation menu we saw above, navigate to Networks.  As before, we need to pick a profile, but the default is fine for most layer 3 cases.

network

Once we specify the organization and partition that we already created, we tell DCNM the gateway address.  This is the Anycast gateway address that will be configured on any switch that has a host in this VLAN.  Remember that in VXLAN/EVPN, each leaf switch acts as a default gateway for the VLANs it serves.  We also specify the VLAN ID, of course.

Once this is saved, the profile is in DCNM and ready to go.  Unlike with the underlay config, nothing is actually deployed on the switch at this point.  The config is just sitting in DCNM, waiting for a workload to become active that requires it.  If no workload requires the configuration we specified, it will never make it to a switch.  And, if switch-1 requires the config while switch-2 does not, well, switch-2 will never get it.  This is the power of auto-configuration.  It’s entirely likely that when you are configuring your data center switches by hand, you don’t configure VLANs on switches that don’t require them, but you have to figure that out yourself.  With auto-config, we just deploy as needed.

Let’s take a step back and review what we have done:

  1. We have told DCNM to enable 802.1q triggering for switches that are configured with auto-provisioning.
  2. We have created an organization and partition for our new network.
  3. We have told DCNM what configuration that network requires to support it.

Auto-Config in Action

Now that we’ve set DCNM up, let’s look at the switches.  First of all, I verify that there is no VRF or SVI configuration for this partition and network:


jemclaug-hh14-n5672-1# sh vrf all
VRF-Name VRF-ID State Reason
default 1 Up --
management 2 Up --

jemclaug-hh14-n5672-1# sh ip int brief vrf all | i 192.168
jemclaug-hh14-n5672-1#

We can see here that there is no VRF other than the default and management VRFs, and there are no SVI’s with the 192.168.x.x prefix. Now I start a ping from my VM1, which you will recall is connected to this switch:

jeffmc@ABC-VM1:~$ ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=9 ttl=255 time=0.794 ms
64 bytes from 192.168.1.1: icmp_seq=10 ttl=255 time=0.741 ms
64 bytes from 192.168.1.1: icmp_seq=11 ttl=255 time=0.683 ms

Notice from the output that the first ping I get back is sequence #9. Back on the switch:

jemclaug-hh14-n5672-1# sh vrf all
VRF-Name VRF-ID State Reason
ABCCorp:VRF1 4 Up --
default 1 Up --
management 2 Up --
jemclaug-hh14-n5672-1# sh ip int brief vrf all | i 192.168
Vlan501 192.168.1.1 protocol-up/link-up/admin-up
jemclaug-hh14-n5672-1#

Now we have a VRF and an SVI! As I stated before, the switch itself has no concept of organization, which is really just a tag DCNM applies to the front of the VRF. If I had created a VRF1 in the XYZCorp organization, the switch would not see it as a conflict because it would be XYZCorp:VRF1 instead of ABCCorp:VRF1.

If we want to look at the SVI configuration, we need to use the expand-port-profile option. The profile pulled down from DCNM is not shown in the running config:


jemclaug-hh14-n5672-1# sh run int vlan 501 expand-port-profile

interface Vlan501
no shutdown
vrf member ABCCorp:VRF1
ip address 192.168.1.1/24 tag 12345
fabric forwarding mode anycast-gateway

VMTracker

Let’s have a quick look at VMTracker. As I mentioned in this blog and previous one, dot1q triggering requires the host to actually send data before it auto-configures the switch. The nice thing about VMTracker is that it will configure the switch when a VM becomes active, regardless of whether it is actually sending data. The switch itself is configured with the address of and credentials for your vCenter server, so it becomes aware when a workload is active.

Note: Earlier I said you have to configure the vCenter address and credentials in DCNM. Don’t be confused! DCNM is not talking to vCenter, the Nexus switch actually is. You only put it in DCNM if you are using Power-on Auto-Provisioning. In other words, DCNM will not establish a connection to vCenter, but will push the address and credentials down to the switch, and the switch establishes the connection.

We can see the VMTracker configuration on the second Nexus 5K:


jemclaug-hh14-n5672-2(config-vmt-conn)# sh run | sec vmtracker
feature vmtracker
encapsulation dynamic vmtracker
vmtracker fabric auto-config
vmtracker connection vc
remote ip address 172.26.244.120
username administrator@vsphere.local password 5 Qxz!12345
connect

The feature is enabled, and the “encapsulation dynamic vmtracker” command is applied to the relevant interfaces. (You can see the command here, but because I used the “| sec” option to view the config, you cannot see what interface it is applied under. We can see that I also supplied the vCenter IP address and login credentials. (The password is sort-of encrypted.) Notice also the connect statement. The Nexus will not connect to the vCenter server until this is applied. Now we can look at the vCenter connection:

jemclaug-hh14-n5672-2(config-vmt-conn)# sh vmtracker status
Connection Host/IP status
-----------------------------------------------------------------------------
vc 172.26.244.120 Connected

We have connected successfully!

As with dot1q triggering, there is no VRF or SVI configured yet for our host:

jemclaug-hh14-n5672-2# sh vrf all
VRF-Name VRF-ID State Reason
default 1 Up --
management 2 Up --
jemclaug-hh14-n5672-2# sh ip int brief vrf all | i 192.168

We now go to vSphere (or vCenter) and power up the VM connected to this switch:

 

vcenter-power-on

Once we bring up the VM, we can see the switch has been notified, and the VRF has been automatically provisioned, along with the SVI.

jemclaug-hh14-n5672-2# sh vmtracker event-history | i VM2
761412 Dec 21 2016 13:43:02:572793 ABC-VM2 on 172.26.244.177 in DC4 is powered on
jemclaug-hh14-n5672-2# sh vrf all | i ABC
ABCCorp:VRF1 4 Up --
jemclaug-hh14-n5672-2# sh ip int brief vrf all | i 192
Vlan501 192.168.1.1 protocol-up/link-up/admin-up
jemclaug-hh14-n5672-2#

Thus, we had the same effect as with dot1q triggering, but we didn’t need to wait for traffic!

I hope these articles have been helpful. Much of the documentation on DCNM right now is not in the form of a walk-through, and while I don’t offer as much detail, hopefully these articles should get you started. Remember, with DCNM, you get advanced features free for 30 days, so go ahead and download and play with it.

When I was still a new engineer, a fellow customer support engineer (CSE) asked a favor of me. I’ll call him Andy.

“I’m going on PTO, could you cover a case for me? I’ve filed a bug and while I’m gone there will be a conference call. Just jump on it and tell them that the bug has been filed an engineering is working on it.” The case was with one of our largest service provider clients. I won’t say which, but they were a household name.

When you’re new and want to make a good impression, you jump on chances like this. It was a simple request and would prove I’m a team player. Of course I accepted the case and went about my business with the conference call on my calendar for the next week.

Before I got on the call I took a brief look at the case notes and the DDTS (what Cisco calls a bug.) Everything seemed to be in order. The bug was filed and in engineering’s hands. Nothing to do but hop on the call and report that the bug was filed and we were working on it.

I dialed the bridge and after I gave my name the automated conference bridge said “there are 20 other parties in the conference.” Uh oh. Why did they need so many?

After I joined, someone asked for introductions. As they went around the call, there were a few engineers, several VP’s, and multiple senior directors. Double uh oh.

“Jeff is calling from Cisco,” the leader of the call said. “He is here to report on the P1 outage we had last week affecting multiple customers. I’m happy to tell you that Cisco has been working diligently on the problem and is here to report their findings and their solution. Cisco, take it away.”

I felt my heart in my throat. I cleared my voice, and sheepishly said: “Uh, we’ve, uh, filed a bug for your problem and, uh, engineering is looking into it.”

It was dead silence, followed by a VP chiming in: “That’s it?”

I was then chewed out thoroughly for not doing enough and wasting everyone’s time.

When Andy got back he grabbed the case back from me. “How’d the call go?” he asked.

I told him how it went horribly, how they were expecting more than I delivered, and how I took a beating for him.

Andy just smiled. Welcome to TAC.

Introduction

I’ve been side-tracked for a while doing personal articles, so I thought it would be a good time to get back to some technical explanations.  Seeing that I work for Cisco now, I thought it would be a good time to cover some Cisco technology.  My focus here has been on programmability and automation.  Some of this work has involved using tools like Puppet and Ansible to configure switches, as well as Python and NETCONF.  I also recently had a chance to present BRKNMS-2002 at Cisco Live in Las Vegas, on LAN management with Data Center Network Manager 10.  It was my first Cisco Live breakout, and of course I had a few problems, from projector issues to live demo failures.  Ah well.  But for those of you who don’t have access to the CL library, or the time to watch the breakout, I thought I’d cover an important DCNM concept for you here on my blog.

Read More »