automation

All posts tagged automation

I have to give AWS credit for posting a fairly detailed technical description of the cause of their recent outage.  Many companies rely on crisis PR people to phrase vague and uninformative announcements that do little to inform customers and put their minds at ease.  I must admit, having read the AWS post-mortem a couple times, I don’t fully understand what happened, but it seems my previous article on automation running wild was not far off.  Of course, the point of the article was not to criticize automation.  An operation the size of AWS would be simply impossible without it.  The point was to illustrate the unintended consequences of automation systems.  As a pilot and aviation buff, I can think of several examples of airplanes crashing due to out-of-control automation as well.

AWS tells us that “an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network.”  What’s interesting here is that the automation event was not itself a provisioning of network devices.  Rather, the capacity increase caused “a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network…”  This is just the old problem of overwhelming link capacity.  I remember one time, when I was at Juniper, and a lab device started sending a flood of traffic to the Internet, crushing the Internet-facing firewalls.  It’s nice to know that an operation like Amazon faces the same challenges.  At the end of the day, bandwidth is finite, and enough traffic will ruin any network engineer’s day.

“This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.”  This is the age-old problem, isn’t it?  Monitoring our networks requires network connectivity.  How else do we get logs, telemetry, traps, and other information from our devices?  And yet, when our network is down, we can’t get this data.  Most large-scale customers do maintain a separate out-of-band network just for monitoring.  I would assume Amazon does the same, but perhaps somehow this got crushed too?  Or perhaps what they refer to as their “internal network” was the OOB network?  I can’t tell from the post.

“Operators continued working on a set of remediation actions to reduce congestion on the internal network including identifying the top sources of traffic to isolate to dedicated network devices, disabling some heavy network traffic services, and bringing additional networking capacity online. This progressed slowly…”  I don’t want to take pleasure in others’ pain, but this makes me smile.  I’ve spent years telling networking engineers that no matter how good their tooling, they are still needed, and they need to keep their skills sharp.  Here is Amazon, with presumably the best automation and monitoring capabilities of any network operator, and they were trying to figure out top talkers and shut them down.  This reminds me of the first broadcast storm I faced, in the mid-1990’s.  I had to walk around the office unplugging things until I found the source.  Hopefully it wasn’t that bad for AWS!

Outages happen, and Amazon has maintained a high-level of service with AWS since the beginning.  The resiliancy of such a complex environment should be astounding to anyone who has built and managed complex systems.  Still, at the end of the day, no matter how much you automate (and you should), no matter how much you assure (and you should), sometimes you have to dust off the packet sniffer and figure out what’s actually going down the wire.  For network engineers, that should be a reminder that you’re still relevant in a software-defined world.

As I write this, a number of sites out on the Internet are down because of an outage at Amazon Web Services.  Delta Airlines is suffering a major outage.  On a personal note, my wife’s favorite radio app and my Lutron lighting system are not operating correctly.  Of course, this outage is a reminder of the simple principle of not putting one’s eggs in a single basket.  AWS became the dominant web provider early on, but there are multiple viable alternatives now.  Long before the modern cloud emerged, I regularly ran disaster recovery exercises to ensure business continuity when a data center or service provider failed.  Everyone who uses a cloud provider better have a backup, and you better figure out a way to periodically test that backup.  A few startups have emerged to make this easier.

While the cause of the outage is yet unknown, there was an interesting comment in an Newsweek article on the outage.  Doug Madory, director of internet analysis an Kentik Inc, said:  “More and more these outages end up being the product of automation and centralization of administration…”  I’ve been involved in automation in some form or another for my entire six years at Cisco, and one aspect of automation is not talked about enough:  automation gone wild.  Let me give a non-computer example.

Back when I worked at the San Francisco Chronicle, the production department installed a new machine in our Union City printing plant.  The Sunday paper, back then, had a large number of inserts with advertisements and circulars that needed to be stuffed into the paper.  They were doing this manually, if you can believe it.

The new machine had several components.  One part of the process involved grabbing the inserts and carrying them in a conveyor system high above the plant floor, before dropping them down into the inserter.  It’s hard to visualize, so I’ve included a picture of a similar machine.

You can see the inserts coming in via the conveyor, hanging vertically.  This conveyor extended quite far.  One day I was in the plant, working on some networking thing or other, and the insert machine was running.  I looked back and saw the conveyor glitch somehow, and then a giant ball of paper started to form in the corner of the room, before finally exploding and raining paper down on the floor of the plant.  There was a commotion and one of the workers had to shut the machine down.

The point is, automation is great until it doesn’t work.  When it fails, it fails big.  You don’t just get a single problem, but a compounding problem.  It wasn’t just a single insert that got hit by the glitch, but dozens of them, if not more.  When you use manual processes, failures are contained.

Let’s tie this back to networking.  Say you need to configure hundreds of devices with some new code, perhaps adding a new routing protocol.  If you do it by hand in one device, and suddenly routes start dropping out of the routing table, chances are you won’t proceed with the other devices.  You’ll check your config to see what happened and why.  But if you set up, say, a Python script to run around and do this via NETCONF to 100 devices, suddenly you might have a massive outage on your hands.  The same could happen using a tool like Ansible, or even a vendor network management platform.

There are ways to combat this problem, of course.  Automated checks and validation after changes is an important one, but the problem with this approach is you cannot predict every failure.  If you program 10 checks, it’s going to fail in way #11, and you’re out of luck.

As I said, I’ve spent years promoting automation.  You simply couldn’t build a network like Amazon’s without it.  And it’s critical for network engineers to continue developing skills in this area.  We, as vendors and promoters of automation tools, need to be careful how we build and sell these tools to limit customer risk.

Eventually they got the inserter running again.  Whatever the cause of Amazon’s outage, let’s hope it’s not automation gone wild.

“Progress might have been alright once, but it has gone on too long.”
–  Ogden Nash

The book The Innovator’s Dilemma appears on the desk of a lot of Silicon Valley executives.  Its author, Clayton Christiensen, is famous for having coined the term “disruptive innovation.”  The term has always bothered me, and I keep waiting for the word “disruption” to die a quiet death.  I have the disadvantage of having studied Latin quite a bit.  The word “disrupt” comes from the Latin verb rumperewhich means to “break up”, “tear”, “rend”, “break into pieces.”  The word, as does our English derivative, connotes something quite bad.  If you think “disruption” is good, what would you think if I disrupted a presentation you were giving?  What if I disrupted the electrical system of your heart?

Side note:  I’m fascinated with the tendency of modern English to use “bad” words to connote something good.  In the 1980’s the word “bad” actually came to mean its opposite.  “Wow, that dude is really bad!” meant he was good.  Cool people use the word “sick” in this way.  “That’s a sick chopper” does not mean the motorcycle is broken.

The point, then, of disruption is to break up something that already exists, and this is what lies beneath the b-school usage of it.  If you innovate, in a disruptive way, then you are destroying something that came before you–an industry, a way of working, a technology.  We instantly assume this is a good thing, but what if it’s not?  Beneath any industry, way of working, or technology are people, and disruption is disruption of them, personally.

The word “innovate” also has a Latin root.  It comes from the word novus, which means “new”.  In industry in general, but particularly the tech industry, we positively worship the “new”.  We are constantly told we have to always be innovating.  The second one technology is invented and gets established, we need to replace it.  Frame Relay gave way to MPLS, MPLS is giving way to SD-WAN, and now we’re told SD-WAN has to give way…  The life of a technology professional, trying to understand all of this, is like a man trying to walk on quicksand.  How do you progress when you cannot get a firm footing?

We seem to have forgotten that a journey is worthless unless you set out on it with an end in mind.  One cannot simply worship the “new” because it is new–this is self-referential pointlessness.  There has to be a goal, or an end–a purpose, beyond simply just cooking up new things every couple years.

Most tech people and b-school people have little philosophical education outside of, perhaps (and unfortunately) Atlas Shrugged.  Thus, some of them, realizing the pointlessness of endless innovation cycles, have cooked up ludicrous ideas about the purpose of it all.  Now we have transhumanists telling us we’ll merge our brains with computers and evolve into some sort of new God-species, without apparently realizing how ridiculous they sound.  COVID-19 should disabuse us of any notion that we’re not actually human beings, constrained by human limitations.

On a practical level, the furious pace of innovation, or at least what is passed off as such, has made the careers of technology people challenging.  Lawyers and accountants can master their profession and then worry only about incremental changes.  New laws are passed every year, but fundamentally the practice of their profession remains the same.  For us, however, we seem to face radical disruption every couple of years.  Suddenly, our knowledge is out-of-date.  Technologies and techniques we understood well are yesterday’s news, and we have to re-invent ourselves yet again.

The innovation imperative is driven by several factors:  Wall Street constantly pushes public companies to “grow”, thus disparaging companies that simply figure out how to do something and do it well.  Companies are pressured into expanding to new industries, or into expanding their share of existing industries, and hence need to come up with ways to differentiate themselves.  On an individual level, many technologists are enamored of innovation, and constantly seek to invent things for personal satisfaction or for professional gain.  Wall Street seems to have forgotten the natural law of growth.  Name one thing in nature that can grow forever.  Trees, animals, stars…nothing can keep growing indefinitely.  Why should a company be any different?  Will Amazon simply take over every industry and then take over governing the planet?  Then what?

This may seem a strange article coming from a leader of a team in a tech company that is handling bleeding edge technologies.  And indeed it would seem to be a heresy for someone like me to say these things.  But I’m not calling for an end to inventing new products or technologies.  Having banged out CLI for thousands of hours, I can tell you that automating our networks is a good thing.  Overlays do make sense in that they can abstract complexity out of networks.  TrustSec/Scalable Group Tags are quite helpful, and something like this should have been in IP from the beginning.

What I am saying is that innovation needs a purpose other than just…innovation.  Executives need to stop waxing eloquent about “disrupting” this or that, or our future of fusing our brains with an AI Borg.  Wall Street needs to stop promoting growth at all costs.  And engineers need time to absorb and learn new things, so that they can be true professionals and not spend their time chasing ephemera.

Am I optimistic?  Well, it’s not in my nature, I’m afraid.  As I write this we are in the midst of the Coronavirus crisis.  I don’t know what the world will look like a year from now.  Business as usual, with COVID a forgotten memory?  Perhaps.  Great Depression due to economic shutdown?  Perhaps.  Total societal, governmental, and economic collapse, with rioting in the streets?  I hope not, but perhaps.  Whatever happens, I do hope we remember that word “novel”, as in “novel Coronavirus”, comes from the same Latin root as the word “innovation”.  New isn’t always the best.

I recently replied to a comment that I think warrants a full blog post.

I’ve been here at Cisco working on programmability for a few years.  Brian Turner wrote in to say, essentially:  Hang on!  I became a network engineer precisely because I don’t want to be a coder!  I tried programming and hated it!  Now you’re telling me to become a programmer!

As I said in my reply, I have a lot of sympathy for him.  It reminds me of a story.

Back when I was at Juniper, I met with the IT department’s head of automation to discuss using some of his tools for network automation.  Jeremy was an expert in all things Puppet and Ansible, and a rather enthusiastic promoter of these tools on the server/app side of the house.  He had also managed to get Puppet running on a Junos device.  I was meeting with him because, frankly, the wind seemed to be blowing in his direction.  That said, I did not share his enthusiasm.  He told me about a server guy he had worked with, Stephane.  When Jeremy proposed to Stephane that he should use automation tools to make his life easier, Stephane vehemently rejected the idea, and the meeting ended with Stephane banging his fists on the table and shouting “I am not a coder!”

Flash forward a couple years and Stephane ended up the head of automation for a major company.  Apparently he finally bought into the idea.

Frankly I had no desire to become a coder either.  When I interviewed at Cisco, most of my discussions were around the controllers I was working with at the time, data center fabrics, etc.  When I arrived, my new boss assigned me as his Principal TME for programmability.  I never claimed to be an expert in this area.  Two months later I was presenting to Tech Field Day, and experienced automation guys like Jason Edelman and Matt Oswalt on how to run Puppet on a Nexus switch.  Three years later and I’m known as a NETCONF/YANG guy.  I’d barely heard of them when I started.

As I replied to Brian, Cisco doesn’t want him or anyone to learn Python or YANG or whatever.  Think about it from my perspective in product management.  Implementing YANG models for all of IOS XE is a massive undertaking.  Engineering devoted a huge amount of effort to pull this off.  Huge.  Mandating YANG models for their ongoing development burns cycles.  Product marketing and engineering would never prioritize this unless we thought there was a high probability someone would use it.  In other words, we don’t want people to use it so much as customers want us to develop it.  We have demand for programmable interfaces for network devices, and hence we’ve delivered on it.   My job as a TME is not to push NETCONF/YANG on anyone, but to provide the enablement to make it easier for someone to use this technology if they themselves want to.

As I often say in my presentations, the why is important.  Why do some customers demand these interfaces?  Well, because they know Notepad is a horrible automation tool, and it’s what 90% of network engineers use.  If you want to configure 50 switches, you’re going to configure one, paste the config into Notepad, tweak a few values, and then paste it into the next switch.  Do this 48 more times and tell me if this is the best use of your time as a highly skilled network engineer.  You can write a script to do this and save yourself a lot of trouble.  Or use Ansible to do it.  Or Cisco DNAC.  Whatever you want.  But if you want any of these tools to work efficiently, you need a machine interface, which CLI is not.  If you don’t believe me, try writing a script to do regular expression-based parsing of CLI outputs.  It’s a lot easier with YANG.

The point is not for network engineers to become programmers.  The point is to add some tools to your toolbox to help you focus on what you do well.  One weekend spent with a Python course and one more weekend with a DevNet course on YANG will give you a tool you can use to make your life easier.  That’s it.  Some customers may take it a lot further, of course, and go way into CI/CD workflows and that’s fine.  If you want to do 95% of your work in CLI and write a few scripts to do the other 5%, that’s fine.  If you want to use Cisco DNAC to do almost everything, knock yourself out.  It’s about what works best for you, as a network engineer.

I often point out how lousy my code quality is.  I’m sometimes ashamed to show the code for some of the scripts I’ve written.  I’m not a coder!  That’s a point I often make.  I don’t want to be a full-time software developer.  I’m a network engineer.  So for Brian and all the other CCIE’s out there, keep doing what you do best, but don’t close yourself off to some additional tools that will make your life easier.

An old networking friend whom I mentored for his CCIE a long time ago wrote me an email:  I’ve been a CCIE for 10 years now, he said, and I’m feeling like a dinosaur.  Everyone wants people who know AWS and automation and they don’t want old-school CLI guys.

It takes me back to a moment in my career that has always stuck with me.  I was in my early twenties at my first job as a full-time network engineer.  I was working at the San Francisco Chronicle, at the time (early 2000’s) a large newspaper with a wide circulation.  The company had a large newsroom, a huge advertising call center, three printing plants, and numerous circulation offices across the bay area.  We had IP, IPX, AppleTalk and SNA on the network, typical of the multi-protocol environments of the time.

My colleague Tony and I were up in the MIS area on the second floor of the old Chronicle building on 5th and Mission St. in downtown San Francisco.  The area we were in contained armies of mainframe programmers, looking at the black screens of COBOL code that were the backbone of the newspaper systems in those days.  Most of the programmers were in their fifties, with gray hair and beards.  Tony and I were young, and TCP/IP networking was new to these guys.

I was telling Tony how I always wanted to be technical.  I loved CLI, and it was good at it.  I was working on my first CCIE.  I was at the top of my game, and if any weird problem cropped up on our network I dove in and got it fixed, no matter how hard.  As I explained to Tony, this was all I wanted to do in my career, to be a CLI guy, working with Cisco routers and switches.

Tony gestured at the mainframe programmers, sitting in their cubes typing their COBOL.  “Is this what you want to be when you’re in your fifties,” he said under his breath, “a dinosaur?  Do you just want to be typing obscure code into systems that are probably going to be one step away from being shut down?  How long do you think these guys will have their jobs anyways?”

Well, I haven’t been to the Chronicle in a while but those jobs are almost certainly gone.  Fortunately for the COBOL guys, they’re all retirement age anyways.

We live in a world and an industry that worships the young and the new.  If you’re in your twenties, and totally current on the latest DevOps tools, be warned:  someday you’ll be in your forties and people will think DevOps is for dinosaurs.  The tech industry is under constant pressure to innovate, and innovating usually means getting machines to do things people used to do.  This is why some tech titans are pushing for universal basic income.  They realize that their innovations eliminate jobs at such a rate that people won’t be able to afford to live anymore.  I think it’s a terrible idea, but that’s a subject for another post.  The point is, in this industry, when you think you’ve mastered something and are relevant, be ready:  your obsolescence commeth.

This is an inversion of the natural respect for age and experience we’ve had throughout human history.  I don’t say this as a 40-something feeling some bitterness for the changes to his industry;  in fact, I actually had this thought when I was much younger.  In the West, at least,  in the 1960’s there developed a sense that, to paraphrase Hunter Thompson, old is evil.  This was of course born from legitimately bad things that were perpetuated by previous generations, but it’s interesting to see how the attitude has taken hold in every aspect of our culture.  If you look at medieval guilds, the idea was that the young spent years going through apprentice and journeyman stages before being considered a master of their craft.  This system is still in place in many trades that do not experience innovation at the rate of our industry, and there is a lot to be said for it.  The older members of the trade get security and the younger get experience.

I’ve written a bit about the relevance of the CCIE, and of networking skills in general, in the new age.  Are we becoming the COBOL programmers of the early 2000’s?  Is investing in networking skills about the same as studying mainframe programming back then, a waste of cycles on dying systems?

I’ve made the point many times on this blog that I don’t think that’s (yet) the case.  At the end of the day, we still need to move packets around, and we’re still doing it in much the same way as we did in 1995.  Most of the protocols are the same, and even the newer ones like VXLAN are not that different from the old ones.  Silicon improves, speeds increase, but fundamentally we’re still doing the same thing.  What changing is how we’re managing those systems, and as I say in my presentations, that’s not a bad thing.  Using Notepad to copy/paste across a large number of devices is not a good use of network engineers’ time.  Automating can indeed help us to do things better and focus on what matters.

I’ve often used the example of airline pilots.  A modern airplane cockpit looks totally different from a cockpit in the 1980’s or even 1990’s.  The old dials and switches have been replaced by LCD panels and much greater automation.  And yet we still have pilots, and the pilot today still needs to understand engine systems, weather, aerodynamics, and navigation.  What’s changed is how that pilot interacts with the machine.  As a pilot myself, I can tell you how much better a glass cockpit is than the old dials.  I get better information presented in a much more useful way and don’t have to waste my time on unnecessary tasks.  This is how network automation should work.

When I raised this point to some customer execs at a recent briefing, one of them said that the pilots could be eliminated since automation is so good now.  I’m skeptical we will ever reach that level of automation, despite the futurists’ love of making such predictions.  The pilots aren’t there for the 99% of the time when things work as expected, but for the 1% when they don’t, and it will be a long time, if ever, before AI can make judgement calls like a human can.  And in order to make those 1% of calls, the pilots need to be flying the 99% of the time when it’s routine, so they know what to do.

So, are we dinosaurs?  Are we the COBOL programmers of the late 2010’s, ready to be eliminated in the next wave of layoffs?  I don’t think so, but we have to adapt.  We need to learn the glass cockpit.  We need to stay on top of developments, and learn how those developments help us to manage the systems we know well how to manage.  Mainframes and operating systems will come and go, but interconnecting those systems will still be relevant for a long time.

Meanwhile, an SVP at Cisco told me he saw someone with a ballcap at Cisco Live:  “Make CLI Great Again”.  Gotta love that.  Some dinosaurs don’t want to go extinct.

Introduction

My role at Cisco is transitioning to enterprise so I won’t be working on Nexus switches much any more.  I figured it would be a good time to finish this article on DCNM.  In my previous article, I talked about DCNM’s overlay provisioning capabilities, and explained the basic structure DCNM uses to describe multi-tenancy data centers.  In this article, we will look at the details of 802.1q-triggered auto-configuration, as well as VMtracker-based triggered auto-configuration.  Please be aware that the types of triggers and their behaviors depends on the platform you are using.  For example, you cannot do dot1q-based triggers on Nexus 9k, and on Nexus 5k, while I can use VMTracker, it will not prune unneeded VLANs.  If you have not read my previous article, please review it so the terminology is clear.

Have a look at the topology we will use:

autoconfig

The spine switches are not particularly relevant, since they are just passing traffic and not actively involved in the auto-configuration.  The Nexus 5K leaves are, of course, and attached to each is an ESXi server.  The one on the left has two VMs in two different VLANs, 501 and 502.  The 5k will learn about the active hosts via 802.1q triggering.  The rightmost host has only one VM, and in this case the switch will learn about the host via VMtracker.  In both cases the switches will provision the required configuration in response to the workloads, without manual intervention, pulling their configs from DCNM as described in part 1.

Underlay

Because we are focused on overlay provisioning, I won’t go through the underlay piece in detail.  However, when you set up the underlay, you need to configure some parameters that will be used by the overlay.  Since you are using DCNM, I’m assuming you’ll be using the Power-on Auto-Provision feature, which allows a switch to get its configuration on bootup without human intervention.

config-fabric

Recall that a fabric is the highest level construct we have in DCNM.  The fabric is a collection of switches running an encapsulation like VXLAN or FabricPath together.  Before we create any PoAP definitions, we need to set up a fabric.  During the definition of the fabric, we choose the type of provisioning we want.  Since we are doing auto-config, we choose this option as our Fabric Provision Mode.  The previous article describes the Top Down option.

Next, we need to build our PoAP definitions.  Each switch that is configured via PoAP needs a definition, which tells DCNM what software image and what configuration to push.  This is done from the Configure->PoAP->PoAP Definitions section of DCNM.  Because generating a lot of PoAP defs for a large fabric is tedious, DCNM10 also allows you to build a fabric plan, where you specify the overall parameters for your fabric and then DCNM generates the PoAP definitions automatically, incrementing variables such as management IP address for you.  We won’t cover fabric plans here, but if you go that route the auto-config piece is basically the same.config-poap-defs

Once we are in the PoAP definition for the individual switch, we can enable auto-configuration and select the type we want.

poap-def

In this case I have only enabled the 802.1q trigger.  If I want to enable VMTracker, I just check the box for it and enter my vCenter server IP address and credentials in the box below.  I won’t show the interface configuration, but please note that it is very important that you choose the correct access interfaces in the PoAP defs.  As we will see, DCNM will add some commands under the interfaces to make the auto-config work.

Once the switch has been powered on and has pulled down its configuration, you will see the relevant config under the interfaces:

n5672-1# sh run int e1/33
interface Ethernet1/33
switchport mode trunk
encapsulation dynamic dot1q
spanning-tree port type edge trunk

If the encapsulation command is not there, auto-config will not work.

Overlay Definition

Remember from the previous article that, after we define the Fabric, we need to define the Organization (Tenant), the Partition (VRF), and then the Network.  Defining the organization is quite easy: just navigate to the organizations screen, click the plus button, and give it a name.  You may only have one tenant in your data center, but if you have more than one you can define them here.  (I am using extremely creative and non-trademark-violating names here.)  Be sure to pick the correct Fabric name in the drop-down at the top of the screen;  often when you don’t see what you are expecting in DCNM, it is because you are not on the correct fabric.

config-organization

Next, we need to add the partition, which is DCNM’s name for a VRF.  Remember, we are talking about mutlitenancy here.  Not only do we have the option to create multiple tenants, but each tenant can have multiple VRFs.  Adding a VRF is just about as easy as adding an organization.  DCNM does have a number of profiles that can be used to build the VRFs, but for most VXLAN fabrics, the default EVPN profile is fine.  You only need to enter the VRF name.  The partition ID is already populated for you, and there is no need to change it.

partition

There is something important to note in the above screen shot.  The name given to the VRF is prepended with the name of the organization.  This is because the switches themselves have no concept of organization.  By prepending the org name to the VRF, you can easily reuse VRF names in different organizations without risk of conflict on the switch.

Finally, let’s provision the network.  This is where most of the configuration happens.  Under the same LAN Fabric Automation menu we saw above, navigate to Networks.  As before, we need to pick a profile, but the default is fine for most layer 3 cases.

network

Once we specify the organization and partition that we already created, we tell DCNM the gateway address.  This is the Anycast gateway address that will be configured on any switch that has a host in this VLAN.  Remember that in VXLAN/EVPN, each leaf switch acts as a default gateway for the VLANs it serves.  We also specify the VLAN ID, of course.

Once this is saved, the profile is in DCNM and ready to go.  Unlike with the underlay config, nothing is actually deployed on the switch at this point.  The config is just sitting in DCNM, waiting for a workload to become active that requires it.  If no workload requires the configuration we specified, it will never make it to a switch.  And, if switch-1 requires the config while switch-2 does not, well, switch-2 will never get it.  This is the power of auto-configuration.  It’s entirely likely that when you are configuring your data center switches by hand, you don’t configure VLANs on switches that don’t require them, but you have to figure that out yourself.  With auto-config, we just deploy as needed.

Let’s take a step back and review what we have done:

  1. We have told DCNM to enable 802.1q triggering for switches that are configured with auto-provisioning.
  2. We have created an organization and partition for our new network.
  3. We have told DCNM what configuration that network requires to support it.

Auto-Config in Action

Now that we’ve set DCNM up, let’s look at the switches.  First of all, I verify that there is no VRF or SVI configuration for this partition and network:


jemclaug-hh14-n5672-1# sh vrf all
VRF-Name VRF-ID State Reason
default 1 Up --
management 2 Up --

jemclaug-hh14-n5672-1# sh ip int brief vrf all | i 192.168
jemclaug-hh14-n5672-1#

We can see here that there is no VRF other than the default and management VRFs, and there are no SVI’s with the 192.168.x.x prefix. Now I start a ping from my VM1, which you will recall is connected to this switch:

jeffmc@ABC-VM1:~$ ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=9 ttl=255 time=0.794 ms
64 bytes from 192.168.1.1: icmp_seq=10 ttl=255 time=0.741 ms
64 bytes from 192.168.1.1: icmp_seq=11 ttl=255 time=0.683 ms

Notice from the output that the first ping I get back is sequence #9. Back on the switch:

jemclaug-hh14-n5672-1# sh vrf all
VRF-Name VRF-ID State Reason
ABCCorp:VRF1 4 Up --
default 1 Up --
management 2 Up --
jemclaug-hh14-n5672-1# sh ip int brief vrf all | i 192.168
Vlan501 192.168.1.1 protocol-up/link-up/admin-up
jemclaug-hh14-n5672-1#

Now we have a VRF and an SVI! As I stated before, the switch itself has no concept of organization, which is really just a tag DCNM applies to the front of the VRF. If I had created a VRF1 in the XYZCorp organization, the switch would not see it as a conflict because it would be XYZCorp:VRF1 instead of ABCCorp:VRF1.

If we want to look at the SVI configuration, we need to use the expand-port-profile option. The profile pulled down from DCNM is not shown in the running config:


jemclaug-hh14-n5672-1# sh run int vlan 501 expand-port-profile

interface Vlan501
no shutdown
vrf member ABCCorp:VRF1
ip address 192.168.1.1/24 tag 12345
fabric forwarding mode anycast-gateway

VMTracker

Let’s have a quick look at VMTracker. As I mentioned in this blog and previous one, dot1q triggering requires the host to actually send data before it auto-configures the switch. The nice thing about VMTracker is that it will configure the switch when a VM becomes active, regardless of whether it is actually sending data. The switch itself is configured with the address of and credentials for your vCenter server, so it becomes aware when a workload is active.

Note: Earlier I said you have to configure the vCenter address and credentials in DCNM. Don’t be confused! DCNM is not talking to vCenter, the Nexus switch actually is. You only put it in DCNM if you are using Power-on Auto-Provisioning. In other words, DCNM will not establish a connection to vCenter, but will push the address and credentials down to the switch, and the switch establishes the connection.

We can see the VMTracker configuration on the second Nexus 5K:


jemclaug-hh14-n5672-2(config-vmt-conn)# sh run | sec vmtracker
feature vmtracker
encapsulation dynamic vmtracker
vmtracker fabric auto-config
vmtracker connection vc
remote ip address 172.26.244.120
username administrator@vsphere.local password 5 Qxz!12345
connect

The feature is enabled, and the “encapsulation dynamic vmtracker” command is applied to the relevant interfaces. (You can see the command here, but because I used the “| sec” option to view the config, you cannot see what interface it is applied under. We can see that I also supplied the vCenter IP address and login credentials. (The password is sort-of encrypted.) Notice also the connect statement. The Nexus will not connect to the vCenter server until this is applied. Now we can look at the vCenter connection:

jemclaug-hh14-n5672-2(config-vmt-conn)# sh vmtracker status
Connection Host/IP status
-----------------------------------------------------------------------------
vc 172.26.244.120 Connected

We have connected successfully!

As with dot1q triggering, there is no VRF or SVI configured yet for our host:

jemclaug-hh14-n5672-2# sh vrf all
VRF-Name VRF-ID State Reason
default 1 Up --
management 2 Up --
jemclaug-hh14-n5672-2# sh ip int brief vrf all | i 192.168

We now go to vSphere (or vCenter) and power up the VM connected to this switch:

 

vcenter-power-on

Once we bring up the VM, we can see the switch has been notified, and the VRF has been automatically provisioned, along with the SVI.

jemclaug-hh14-n5672-2# sh vmtracker event-history | i VM2
761412 Dec 21 2016 13:43:02:572793 ABC-VM2 on 172.26.244.177 in DC4 is powered on
jemclaug-hh14-n5672-2# sh vrf all | i ABC
ABCCorp:VRF1 4 Up --
jemclaug-hh14-n5672-2# sh ip int brief vrf all | i 192
Vlan501 192.168.1.1 protocol-up/link-up/admin-up
jemclaug-hh14-n5672-2#

Thus, we had the same effect as with dot1q triggering, but we didn’t need to wait for traffic!

I hope these articles have been helpful. Much of the documentation on DCNM right now is not in the form of a walk-through, and while I don’t offer as much detail, hopefully these articles should get you started. Remember, with DCNM, you get advanced features free for 30 days, so go ahead and download and play with it.

In the final post in my “Ten Years a CCIE” series, I take a look at the age-old question: Is a CCIE really worth it? I conclude the series with some thoughts on the value of this journey.

I’ve written this series of posts in the hope that others considering the pursuit of a CCIE  would have some idea of the process, the struggles, and the reward of passing this notorious exam.  I would like to finish this series with a final article examining the age-old question:  What is the value of a CCIE?  In other words, was it worth it?  All of the other articles in the series were written a couple of years ago, back when I worked for Juniper in IT, with only some slight revisions before publishing.  However, this article is being written now (late 2016), in very different circumstances.  I work at Cisco now, not Juniper.  I work in the switching business unit, not in IT.  I work in product management, and as a Principal Engineer have direct influence on the direction of our products.  I specifically work on programmability, automation, and SDN solutions.  As I write this, I see many potential CCIE candidates wondering if it is even worth pursuing a career in networking any more.  After all, won’t SDN and automation just eliminate their jobs? Wouldn’t they be better off learning Python?  Isn’t Cisco a company in its death throes, facing extinction at the hands of upstarts like Arista and Viptela?

Well, I can’t predict the future.  But I can look at the issues from the perspective of someone who has been in the industry a long time now, and at least set your mind at ease.  The short answer:  it’s worth it.  The longer answer requires us to examine the question from a few different angles.

Technical Knowledge

As I’ve pointed out in previous articles, during the CCIE preparation process, you will master a vast amount of material, if you approach the test honestly.  But, as anyone in this industry knows, the second you pass, a timer starts counting down the value of the material you learned.  On my R&S exam (2004), I studied DLSw+, ATM, ISDN, and Frame Relay.  On my Security (2008) exam, I studied the VPN 3k concentrator, PIX, and NAC Framework.  How much of that is valuable now?

True, I studied many things that are still in use today.  OSPF, BGP, EIGRP, and ISIS were all heavily tested on the R&S exam even back in 2004.  But how much of it do I remember?  I certainly have a good knowledge of each of them, but even after taking the JNCIE-SP exam less than two years ago, I find my knowledge of the details of routing protocols fading away.  If I were to take the JNCIE exam today, I would certainly fail.

Thus, some technologies I learned are obsolete, and some are not, but I have forgotten a lot.  However, there are also technologies that I never learned.  For example, ISE and Cisco TrustSec are huge topics that I am just starting to play with.  Despite my newness to these technologies, I do have the same right to call myself a “Security CCIE” as a guy who passed it yesterday, and knows those subjects cold!

So, then, what about the value of the technical knowledge?  Is it worth it?

First, it is because no matter what becomes obsolete, and no matter what you forget, the intensity of your study will guarantee you still have a core of knowledge that will be with you for a long time.  Fortunately, despite the rapidity of change our industry is supposedly undergoing, networking is conservative by nature.  The Internet is a large distributed system consisting of many systems running disparate operating systems and they have to work together.  Just ask the guys pushing IPv6.  The core of networking doesn’t change very much.  You will also learn that even new technologies, like VXLAN, build on concepts you already know.

Second, even obsolete knowledge is valuable.  Newer engineers often don’t realize how certain technologies developed, or why we do things the way that we do.  They don’t know the things that have been tried and which failed in the past.  Despite the premium our industry places on youth, old-timers do have important insight gleaned from playing with things like, say, DLSw+.  It’s important for us not to simply wax nostalgic (as I am here) about the glorious days of IPX, but to explain to younger engineers how these dinosaurs actually worked, so that they can understand what decisions protocol designers made and why, and why some technologies fell into disuse or were replaced.  And, millennials, it is your responsibility to sit up and listen, if not to seek out such information.

Technical “soft skills”

In addition to specific technical knowledge, CCIE preparation simply makes you better at configuring and troubleshooting in general.  The hours of frustration in the lab and the broken configs that have you pulling your hair out ultimately help you to learn what makes a network break, and how to go about systematically debugging it.  Certainly you can learn those skills elsewhere, but when you are faced with the challenge of building a lab in a short amount of time, you have to be able to think quickly on your feet.  Even troubleshooting a live network outage doesn’t quite compare the intensity of the CCIE crucible, the eight hour slog during which one mistake can cost you months of studying.  The exams where I took multiple attempts required me to seriously up my game between tries, and each time I’ve come out a stronger thinker.

Non-Technical Skills

Passing the CCIE exam honestly is a challenge, plain and simple.  It requires discipline, persistence, extreme attention to detail, resourcefulness, ability to read documentation carefully, and time management skills.  While you won’t pass the CCIE without having those skills to begin with, there is no question that the rigors of CCIE preparation will help to hone them.  You will be a better person all around if you submit yourself to the discipline of passing the exam honestly.

It’s a bit like studying the martial arts.  Most martial artists will tell you that, in addition to being able to deliver a mean axe kick to the head, they have achieved self-discipline and confidence from their practice of martial arts, and that they find these skills apply elsewhere in life.  It’s true of any rigorous pursuit, really, and definitely true of the CCIE.  Personally I have no doubt that the many hours of laborious study have made me a more detail-oriented and better engineer.

Employment value:  Employee’s perspective

In my own case, my CCIE certification was directly responsible for my getting hired at Cisco TAC in 2005.  It was absolutely essential for me to move to a VAR in 2007.  Believe it or not, it was critical in my getting a job at Juniper in 2009, and having two CCIE’s and a JNCIE helped open the door to be re-hired at Cisco in 2015.  Obviously, employers see value in it.

You still see CCIE certification listed as a requirement in many job descriptions.  Without one you are cut out of those positions.  Many employers see it as a key differentiator and qualification for a senior position in network engineering.  VARs are still required to have a number of CCIEs on staff, so for some positions they will only talk to someone who has a CCIE.

Rarely is it the only qualification, however, and rarely is it enough, on its own, to get you hired.  Shortly after I passed my second CCIE I got laid off from the VAR where I worked.  I ended up interviewing at Nexus IS (now Dimension Data), another VAR, a couple months later.  Having been out of work I was rusty, and the technical interviewer grilled me mercilessly.  I’ve stated in the past that I don’t find this sort of thing productive, and I was uncomfortable, and didn’t have a great performance.  Despite the fact that I had two CCIE’s, and despite the value of such credentials for a reseller, I got a call from HR in a few days telling me that they chose not to hire me.  (These things often work out in the end;  I am in a good place now, and I wouldn’t have made it here if I had stayed in the VAR world.)  The entire experience reinforces a point I made in my Cheaters post:  don’t think an ugly plaque and a number are a guarantee of employment.

That said, there is little doubt that a person with X skill-set and a CCIE is in a better place than a person with just X skill-set.  As I said above, the certification simply opens doors.  You cannot rely on it alone, but combined with good experience, the certification is invaluable.  I would never have made it this far without one.  Now it’s true that there are some very talented and knowledgeable engineers who don’t have, and in some cases disdain, the CCIE.  Many of these folks do quite well and have successful careers.  But again, add the famous number and you will always do better in the end.

Employment value:  Employer’s perspective

Let’s turn it around now, and look at it from the perspective of a hiring manager.  When I have the stack of resumes in front of me, how important is it for me to look for that famous CCIE number?

This is a much harder question to answer, and I’ve touched on some of the problems of the CCIE above.  If you are looking at somebody who has a CCIE #1xxx, you know you are getting someone with age and experience.  But if you’re looking for someone to do hands-on implementation work, will this be the right candidate?  Maybe, of course, if Mr. #1xxx has been doing a lot of that lately.  The point is, you can’t tell from the number alone.

Now let’s say you are looking at CCIE #50xxx.  Does her CCIE number mean she will be the perfect fit for the hands-on job?  Quite possibly, because she has a lot of recent hands-on experience.  Will she be unqualified for the management role because she lacks the experience of CCIE #1xxx?  Who knows?

The point is, from a hiring manager’s perspective, the CCIE is simply one piece of data in the overall picture of the candidate.  It tells you something, something important, but it’s not nearly enough to be sure you are picking the right person.  If you are hiring for a VAR and just need a number to get your Gold status, then the value of the CCIE jumps up a little higher.  If you are just hiring a network engineer for IT, you need to take a host of other factors into account.

At the end of the day, I hope that people don’t make hiring decisions based on that one criterion.  I had one job where we hired in a CCIE (against my advice) because the hiring manager believed anyone in possession of a CCIE number was a genius.  (See “The CCIE Mystique”, here.)  The guy was a disaster.  Did he pass by cheating, or was he just good at taking tests?  I don’t know.  I’ve also known many, many extremely smart and talented network engineers who never got their CCIE, including several who just couldn’t pass it.  Keep that in mind when you are getting overly impressed with the famous four letters.

SDN and the New World Order

I would like to conclude this post, and this series, with a few thoughts on the relevance of CCIE certification in the future.  Various people in the industry, most of whom hold MBA’s in finance or marketing, tell us that CLI is dead and that the Ciscos and Junipers will be replaced by cheap, “white-label” hardware.  Google and Facebook are managing thousands of network devices with a handful of staff, using scripting and automation.  A CCIE certification is a waste of time, according to this thinking, because Cisco is spiraling down into oblivion.  Soon, it will all be code.  Learn Python instead.

I’d like to present my thoughts with a caveat.  I’m not always a great technology prognosticator.  When a friend showed me a web browser for the first time in 1994, I told him this Web thing would never take off.  Oops.  However, I have a lot of experience in the industry, and right now my focus at Cisco is programmability and automation.  I think I have a right to an opinion here.

First of all, there is no question that interest in automation and programmability is increasing.  All of the vendors, including my employer, are devoting a lot of resources towards developing and increasing their programmability and automation capabilities.  I’ve spent my first year here at Cisco working on automating network devices with Puppet, Ansible, Python, and our own tools like DCNM.  We’ve been working hard to build out YANG data models for our features.  I barely touched Expect scripting before I came here!  Customers are interested in managing their networks more efficiently, sadly, sometimes with fewer people.

Let me look at this from a wider perspective.  As computing power increases, are humans redundant?  For example, as a pilot I know that it’s entirely possible to replace human pilots with computers.  Air Traffic Control could relay instructions digitally to the computers that now control most airplanes.  With ILS and precision approaches, airplanes could land themselves at most airports.  But would you get into an airplane that had no human pilots?

The problem is, even in the age of Watson, computers react predictably to predictable circumstances, but predictably badly to unpredictable circumstances.  Many of the alleged errors that are introduced with CLI will still be introduced with NETCONF when the operator puts in the wrong data.  Scripts and automation tools have to power to replicate errors across huge numbers of devices faster than CLI.

At the end of the day, you still need to know what it is that you’re automating.  Networks cannot go away.  If all the Cisco and Juniper boxes out there vanished suddenly, our digital world would come to a screeching halt.  We still need people who understand what switches, routers, and firewalls do, regardless of whether they are managing them with CLI or Python.  Look at the cockpit of a modern airplane.  The old dial gauges have been replaced by flat-panel displays.  Do you think pilots no longer need to study weather systems, aerodynamics, and engine operation?  Of course they do!  Just the same, we need network engineers, a lot of them, to make networks work correctly.

As to whether Cisco itself goes away, who knows?  Obviously returning here, I have a great deal of faith in the company and its ability to execute.  I think that a lot of the industry hype is just hype.  Sure, some of the largest network operators will displace our hardware with Open Compute boxes.  Sure, it will hurt us.  But we are still in most networks and I don’t see that changing for a long time.

Closing thoughts, and a postscript

I began this series with an article on what I called, “The CCIE Mystique.”  I don’t know if that quite exists the same way it did in 2001 when I started this journey, but I suspect it is still there a least a little bit.  Those of us who have been through it have had the veil pulled back a bit, and we see it for what it is:  a very hard, but ultimately very surmountable challenge.  A test that measures some, but not even close to all of the skill required to be a network engineer.  A ticket to a better career, but not a guarantee of one.

Being back at Cisco has been an amazing experience.  This is the company that started my career, and that has built an entire industry.  I’d be happy to finish things out here, but I have no idea in this age of mass layoffs whether that will be possible.

I had an interesting chat with the head of the CCIE route/switch lab curriculum not long ago.  I must say, that for whatever complaints I have about the program, I was quite impressed with him and his awareness of where the program is strong and where it is lacking.  I have no doubt that those guys work very hard on the CCIE program and it’s always easier to sit on the sidelines and complain.  Ten years ago, when I started this journey, I wouldn’t have believed I would be at Cisco, helping to develop and market products, with a lab stocked full of gear, chatting with a CCIE proctor.  Everyone’s path is different, and I certainly cannot promise my readers that investing in one certification will land them a pot of gold.  There were a lot of other factors in my career trajectory.  But at the end of a decade (actually 12 years as I write this), I can say that the time, money, and effort spent in pursuit of the elusive CCIE was well worth it.

As I close out this series I hope that my little autobiographical exclusion was not too self-indulgent.  I hope that those of you who are new to the industry or studying for the exam got some insight from the articles.  I hope that any other old CCIE’s who stopped by relived a few memories.  Thanks for reading the series, and good luck on your studies!