Skip navigation

I think it’s fair to say that all technical marketing engineers are excited for Cisco Live, and happy when it’s over.  Cisco Live is always a lot of fun–I heard one person say “it’s like a family reunion except I like everyone!”  It’s a great chance to see a lot of folks you don’t get to see very often, to discuss technology that you’re passionate about with other like minded people, to see and learn new things, and, for us TMEs, an opportunity to get up in front of a room full of hundreds of people and teach them something.  We all now wait anxiously for our scores, which are used to judge how well we did, and even whether we get invited back.

It always amazes me that it comes together at all.  In my last post, I mentioned all the work we do to pull together our sessions.  A lot of my TMEs did not do sessions, instead spending their Cisco Live on their feet at demo booths.  I’m also always amazed that World of Solutions comes together at all.  Here is a shot of what it looked like at 5:30 PM the night before it opened (at 10 AM.)  How the staff managed to clear out the garbage and get the booths together in that time I can’t imagine, but they did.

The WoS mess…

My boss, Carl Solder, got to do a demo in the main keynote.  There were something like 20,000 people in the room and the CEO was sitting there.  I think I would have been nervous, but Carl is ever-smooth and managed it without looking the least bit uncomfortable.

My boss (left) on the big stage!

The CCIE party was at the air and space museum, a great location for aviation lovers such as myself.  A highlight was seeing an actual Apollo capsule.  It seemed a lot smaller than I would have imagined.  I don’t think I would ever have gotten in that thing to go to the moon.  The party was also a great chance to see some of the legends of the CCIE program, such as Bruce Caslow, who wrote the fist major book on passing the CCIE exam, and Terry Slattery, the first person to actually pass it.

CCIE Party

I delivered two breakouts this year:  The CCIE in an SDN World, and Scripting the Catalyst.  The first one was a lot of fun because it was on Monday and the crowd was rowdy, but also because the changes to the program were just announced and folks were interested in knowing what was going on.  The second session was a bit more focused and deeper, but the audience was attentive and seemed to like it.  If you want to know what it feels like to be a Cisco Live presenter, see the photo below.

My view from the stage

I closed out my week with another interview with David Bombal, as well as the famous Network Chuck.  This was my first time meeting Chuck, who is a bit of a celebrity around Cisco Live and stands out because of his beard.  David and I had already done a two-part interview (part 1, part 2) when he was in San Jose visiting Cisco a couple months back.  We had a good chat about what is going on with the CCIE, and it should be out soon.

As I said, we love CL but we’re happy when it’s over.  This will be the first weekend in a long time I haven’t worked on CL slides.  I can relax, and then…Cisco Live Barcelona!

 

While I’m thinking about another TAC Tale, I’m quite busy working on slides for Cisco Live.  I figured this makes for another interesting “inside Cisco” post, since most people who have been to the show don’t know much about how it comes together.  A couple years back I asked a customer if I could schedule a meeting with him after Cisco Live, since I was working on slides.  “I thought the Cisco Live people made the slides and you just showed up and presented them!” he said.  Wow, I wish that was the case.  With hundreds of sessions I’m not sure how the CL team could accomplish that, but it would sure be nice for me.  Unfortunately, that’s not the case.

If you haven’t been, Cisco Live is a large trade show for network engineers which happens four times globally: in Europe, Australia, the US, and Mexico.  The US event is the largest, but Europe is rather large as well.  Australia and Mexico are smaller but still draw a good crowd.  The Europe and US shows move around.  The last two years Europe was in Barcelona, as it will be next year, but it was in Berlin two years before that.  The US show is in San Diego this year, was in Orlando last year, and was in Las Vegas for two years before that.  Australia is always in Melbourne, and Mexico is always in Cancun.  I went to Cisco Live US twice when I worked for a partner, and I’ve been to every event at least once since I’ve worked at Cisco as a TME.

The show has an number of attractions.  There is a large show floor with booths from Cisco and partners.  There are executive and celebrity keynotes.  The deepest content is delivered in sessions–labs, techtorials, and breakout sessions which can have between 20 and several hundred attendees.  The sessions are divided into different tracks:  collaboration, security, certification, routing and switching, etc., so attendees can focus on one or more areas.

Most CL sessions are delivered by technical marketing engineers like myself, who work in a business unit, day in and day out, with their given product.  As far as I know anyone in Cisco can submit a session, so some are delivered by people in sales, IT, CX (TAC or AS), and other organizations.  Some are even delivered by partners and customers.

Six months before a given event, a “call for papers” goes out.  I’m always amused that they pulled this term from academia, as the “papers” are mostly powerpoints and not exactly academic.  If you want to do a session, you need to figure out what you want to present and then write up an abstract, which contains not only the description, but also explains why the session is relevant to attendees, what they can hope to get out of it, and what the prerequisites are.  Each track has a group of technical experts who manage it, called “Session Group Managers”, or SGMs.  They come from anywhere in the business, but have the technical expertise to review the abstracts and sessions to ensure they are relevant and well-delivered.  For about a year, the SGM for the track I usually presented actually reported to me.  They have a tough job, because they receive a large number of applications for sessions, far more than the slots they have.  They look at the topic, quality of the abstract, quality of the speaker, available slots, and other factors in figuring out which sessions get the green light.

Once you have an approved session, you can start making slides.  Other than a standard template, there is not much guidance on how to build a deck for Cisco Live.  My old SGM liked to review each new presentation live, although some SGMs don’t.  Most of us end up making our slides quite close to the event, partly because we are busy, but also because we want to have the latest and most current info in our decks.  It’s actually hard to write up a session abstract six months before the event.  Things change rapidly in our industry, and often your original plan for a session gets derailed by changes in the product or organization.  More than once I’ve had a TME on my team presenting on a topic he is no longer working on!  One of my TMEs was presenting on Nexus switches several months after our team switched to Catalyst only.

At Cisco Live you may run into the “speaker ready room.”  It’s a space for speakers to work on slides, supplied by coffee and food, but there is also a small army of graphic design experts in there who will review the speakers’ slides one last time before they are presented. They won’t comment on your design choices, but simply review them to ensure they are consistent with the template formatting.  We’re required to submit our final deck 24 hours before our session, which gives the CL staff time to post the slides for the attendees.

Standing up in front  of a room full of engineers is never easy, especially when they are grading you.  If you rate in the top 10% of speakers, you win a “Distinguished Speaker” award.  If you score below 4.2 you need to take remedial speaker training.  If your score is low more than a couple times, the SGMs might ask you not to come back.  Customers pay a lot of money to come to CL and we don’t want them disappointed.  For a presenter, being scored, and the high stakes associated with the number you receive, makes a CL presentation even more stressful.  One thing I’ve had to accept is that some people just won’t like me.  I’ve won distinguished speaker before, but I’ve had some sessions with less-than-stellar comments too.

The stress aside, CL is one of the most rewarding things we do.  Most of the audience is friendly and wants to learn.It’s a fun event, and we make great contacts with others who are passionate about their field.  For my readers who are not Cisco TMEs (most I suspect), I hope you have a chance to experience Cisco Live at least once in your career.  Now you know the amount of work that goes into it.

It seems to be rank heresy for someone working in the valley to say it, but let me say it anyways.  I don’t agree with the axiom of the technology industry which states that all technological progress is always good.  Many in our society instinctively realize this, which is why they oppose genetic engineering and plastics.  Still, the technology industry is so persistently in love with itself, and so optimistic about its potential to solve every human problem, that when anyone points out the consequences of technological progress, we quickly respond with AI’s potential to solve the problems it’s bound to create.  (Sorry for the long sentence, but I’m going to quote Plato in this essay, and by Platonic standards that last sentence is short.)  AI is the solution to everything.  AI will unlock the mysteries of human existence.  AI will allow human beings to live forever.  AI will cure cancer.  AI will solve the dangers of, well, genetic engineering and plastics.

An example of this is the extraordinarily concerning essay in the Wall Street Journal a few weeks ago by computer scientist and 60’s icon Jerry Kaplan.  Dr. Kaplan reviews the recent accomplishments of functional brain imaging technologies, which are starting to become more precise in identifying how people are feeling, and even which words they are thinking.  “With improved imaging technology, it may become possible to ‘eavesdrop’ on a person’s internal dialogue, to the extent that they are thinking in words,” says Kaplan.  With a predictable dose of technological optimism, Kaplan sees nothing concerning in the possibility of machines being able to read people’s minds.  Instead, he thinks it opens up a world of possibilities.  For example, in civil lawsuits it’s difficult to ascertain how much pain and suffering an individual has undergone, and hence to assign damages.  Why, we could use brain imaging and AI to calculate precisely how much somebody was harmed!

I may not hold a doctorate, but I spend a lot of time working with computers in the real world, not the world of researchers.  I’m skeptical that functional brain imaging will be able to read people’s minds, but the possibility is alarming.  In today’s era of instant social media “viral” lynching, we all have to be quite careful what we say.  Even with innocent intentions, a slip of the tongue can set off Twitter mobs that will destroy your life and career.  Now, even guarding your speech won’t help you.  You may walk by an AI mind-reading machine and have your life ruined for thought-crime.  And we’re celebrating this?  Even Dr. Kaplan’s scenario of determining pain and suffering in lawsuits is ludicrous.  How quickly will people learn to game the machine, produce artificial emotional trauma, and reap the rewards?

And now to Plato.  In his Phaedrus, Plato tells the story of an inventor named Theuth who came to the Egyptian king and was showing off some of his creations.  This was in the time before writing existed.  After showing the king a number of inventions, Theuth showed him letters and writing:

And when he was talking about writing, Theuth said:  “King, this learning will make the Egyptians wiser and give them better memories.  For, I have found a medicine of both memory and wisdom.

The King said:  “Oh most artful Theuth!  While one person has the ability to create things skillfully, it takes another person to judge those things, and whether their use brings harm or help.  Now you, being the father of letters, through your love of them, have stated the opposite of their capability.  For, in the minds of those who learn this art, it will produce forgetfulness, by neglect of the memory, inasmuch as they have faith in writing, which consists of inscriptions outside of themselves, rather than remembering for themselves.”

Phaedrus 274E-275A.  (My own admittedly rough translation)

In other words, the inventor thinks writing will help memory, whereas the king points out it will hinder it!  I love this quote because it shows how the arrogance of inventors clouds their perception of their own inventions.  This is particularly true in Silicon Valley, where the pressure to always innovate removes any clear thinking about the consequences of the inventions. When confronted with the possibility of huge swaths of jobs being eliminated by their inventions, the lame response of the Silicon Valley innovators is to propose a universal basic income, hence making the movie Wall-E appear all too real.

This is a blog about network engineering, so how is this related?  Aren’t I involved in the automation of network systems?  Isn’t Cisco bringing AI to the world of networking?

Indeed we are, but I like to think that we’re a bit more realistic about it.  As a network engineer, and former TAC guy, I’ve spent countless hours doing nasty troubleshooting that, frankly, was hard and not particularly enjoyable.  Having executives looking over your shoulder, with the possibility of getting fired, with countless users freaking out, while trying to hunt down why the Internet just doesn’t work…  Well, if ML and AI will help me to locate the problem faster and restore operation to my network, I’m all for that.  If AI starts reading minds, I’m breaking out the tinfoil hat.

A lot of the blog posts I write begin with “I’m just too busy to blog these days!”  Luckily, I have dozens of drafts so often blogging is just a question of cleaning up something I wrote a long time ago.  However, I’d like to keep things up here even as life becomes more hectic here at Cisco.  (I don’t know how things can get more hectic but they seem to each day!)

I don’t have many comments on this blog.  I think this is largely due to the fact that most of my readers are spambots.  However, I know there are a few out there who actually read and enjoy some of the posts.  For years I’ve required users to enter a name and email address to post a comment, and while many users just fill out fake information there, I’ve always thought it kept spam down.  This policy probably keeps genuine comments low too.  So, I’ve flipped the setting to allow anonymous comments.  I’ll test it for a few days, and if the spam is out of control I’ll flip it back.  My spam filtering software gets the vast majority of spam comments, so I hope it will continue to do its job with anonymous commenting.

The performance on this blog is also slow.  I’m looking at moving to a more fully managed offering from my hosting provider since I don’t have time to muck around with WordPress trying to get it faster.  I also need to get a certificate installed because people aren’t as happy with un-secure web sites these days.

So, a few things going on here!

I’ve wanted to kick off a series for a while now on technical interviewing. Let me begin with a story.

My first job interview for a full network engineering role was at the San Francisco Chronicle in 2000. I had been working for five years in IT, mostly doing desktop and end-user support. I then decided to get a master’s degree in telecommunications management, which didn’t help at all, followed by a CCNA certification, which got me the interview.

My first interview was with the man who would be my boss. Henry was a manager who had almost no technical knowledge about networking, but I didn’t know that at the time. “Do you know Foundry switches at all?” Henry asked.

“No.” I was already worried.

“I doubted you would. That’s ok because we want to replace them all with Cisco and you know Cisco.” He pulled out a network diagram and handed it to me. “If you look at this, do you see a problem?” he asked.

I had never worked on a network larger than a couple switches, and now I was staring at a convoluted diagram depicting the network of the largest newspaper in Northern California. I was looking at subnet masks, link speeds, and hostnames, trying to find something wrong.

“I’m not sure,” I had to reply meekly.

He pointed at the main core switch for the network. There was only one, with no redundancy.  “There’s a huge single point of failure,” he said. I felt stupid missing the forest for the trees.

Henry brought me upstairs to interview with Tom, who was an on-site project management contractor from Lucent. I was extremely nervous–Lucent (later Avaya) was a big name in the industry and this guy worked for them! Henry left me with Tom. Tom pulled out a copy of the same diagram Henry was showing me earlier.

“Do you notice anything wrong with this?” he asked.

“Wow, that’s a huge single point of failure,” I replied.

He nodded his head in approval. “That’s right–very good.” He asked me a technical question about supernetting. I answered nervously, although it quickly became clear I knew more than he did.

The door flew open and another guy named Vincent walked in. He was the desktop support contractor, but again I didn’t know that. “Ask Jeff a few technical questions,” Tom said.

“Question number one,” said Vincent. “If you were running a network this size, would you subnet it?”

Now the answer seemed obviously to be “yes”, but I was trying to figure out if this is a trick. “Yes,” I answered, deciding to play it safe.

“Good! Next question: Can you route NetBios?” My desktop years were almost exclusively dedicated to Macs and I didn’t even know what NetBios was. I figured it was a 50/50 shot, and the way he asked it seemed to suggest the answer.

“No,” I said, trying to sound confident.

“He’s good,” said Vincent.

Next, the door flew open again, and in walked Bing. Bing was carrying some sort of network device with her. She handed it to me. “Is this a switch or a hub?” she asked. There was no obvious labeling on it, and as I turned the device over and over again in my hands, I had a sinking feeling.

“I don’t know,” I replied.

“Look at this,” Bing said. She pointed to a collision light. “Since there is one on each port, you can tell this is a switch.”

We don’t have collision lights on switches any more, but at the time we did and she had a valid point. Realizing this, I explained to her that a since a hub has a single collision domain, it would only have one collision light. I explained to her the concept of a collision domain, and how a switch worked versus a hub. It turns out she was a project manager for desktop support and she didn’t know any of that. Someone had just shown her the collision light thing and she thought it would be a good question.

“He’s good,” said Bing.

Tom had told me my next interview would be with a CCIE from Lucent. Now that was definitely intimidating. I knew of the reputation of CCIEs, and I didn’t expect to do well. The CCIE guy never showed up. As Tom was walking me to the elevator, however, we ran into him in the hallway. It turns out that Mike, who is still a friend of mine, and who later got three CCIE’s, had not passed the exam yet. We ended up talking about his home lab for a few minutes.

“He’s good,” said Mike. And a good thing too, as I’ve been in a couple of interviews with Mike and I’ve seen him grill people mercilessly.

I got a call with a job offer a few days later, and ended up working there five years.

For a while now I’ve had several posts in my drafts folder on the subject of technical interviewing. As you can see from the above story, interviews are often chaotic, disorganized, and conducted by unqualified people who have no plan. In the case of the San Francisco Chronicle, they made the right decision on me, and I don’t think anybody there would dispute that. I was thankful to begin my career in network engineering.

That said, I’ve had other interviews that didn’t go so well. Over the next few articles, I’d like to cover technical interviewing. Why do we interview people? How can we select good people from bad people? How worthwhile are the typical technical questions? Are gotchas worth throwing out “just to see how the candidate reacts”? Are interviews purely subjective or can we make them data-driven and objective?

I’ll throw out a few more anecdotes from my own experience to illustrate my points–feel free to comment with some of your own!

I worked for two years at a Cisco Gold Partner.  The first year was great.  We were trying to start up a Cisco practice in San Francisco (they were primarily a Citrix partner before), so my buddy and I wined and dined Cisco channel account managers trying to impress them with our CCIE’s and get them to steer business our way.  Eventually, the 2009 financial crisis hit and business started to dry up.  The jobs became fewer and less interesting.  I had two CCIE’s and at one point, I drove out to Mare Island near San Francisco to install a single switch for a customer whose entire network consisted of–a single switch.  I always recommend people not to stay in jobs like this too long, as it hurts your prospects for future employment.

Potential Employer:  “So what kind of jobs have you done lately?”

You:  “Uh, I installed one switch at a customer.”

Anyhow, we had one other customer that managed to keep me surprisingly busy, considering their network was quite small as well.  They were a local builder, and with three small offices connected together with ASAs and VPN tunnels.  The owner was filthy rich and also paranoid about security, which meant I was out there a lot changing passwords, tightening up ACLs, and cleaning up the mess the last network engineer had left.

The owner had a ranch near Wilits, CA which was reputed to be the size of the city of Concord, CA.  He also had two jets to take him to his private landing strip at his ranch.  Being a pilot myself, the prospect of a trip in a small jet to his ranch made me wish for some sort of network problems up there.  However, there wasn’t much up there for me to work on.  He had a single ASA 5505 connected to satellite uplink which he primarily used to connect to the cameras (which he had everywhere) at the ranch.

One day, my contact at the builder told me the cameras weren’t reachable.  Yes!  Finally a trip in the jet.  We set a date and I spent my time wondering whether I’d get the Lear or the Citation.

Unfortunately, when the day rolled around, the weather was hideous.  A Lear jet can handle most any weather, but the little airstrip had no instrument approaches.  Instead, my contact gave me an alternative:  I was to drive up there with her in-house cabling contractor (I’ll call him “Tim”) to do the job.  (I never understood why a business this small had an in-house cabling contractor.  As far as I knew he didn’t work on the actual construction projects associated with the company.)  Now from San Francisco, the drive to Willits is about 2.5 hours.  However, the ranch was near Willits.  After driving 2.5 hours to Willits, we had another hour drive over dirt roads to the middle of nowhere.

The cabling contractor was exactly the sort of person with whom I have nothing in common, and spending 3.5 hours in a car with him, in the era before smartphones are a handy distraction, was painful.  Tim loved fishtailing his truck as we drove on dirt roads on the side of a mountain.  I think he also liked just scaring the white collar guy.  It worked.

We arrived at the ranch and Tim opened up the back of his pickup.  “Can you give me a hand here?” he asked.  In the bed of his truck were several large carpet rolls and piles of dry cleaning.  I grabbed one end of a carpet roll and began the backbreaking work.  My company was billing me out at $250/hour to haul some lady’s dry-cleaning into her ranch.

The ASA itself was located in a pole in the middle of the property, which had a satellite dish on top.  I was amazed the ASA 5505 even functioned out there, given that the external temperature could reach over 100 degrees Fahrenheit.  The metal box housing the ASA was like an oven.  I consoled into it and immediately saw a problem.  Latency on the link was over one second round-trip.  There was no way he was going to get real-time video streaming with this slow satellite uplink.  I reported my findings to Tim and, after eating lunch with the ranch hands, we hopped back in the truck.  Tim put on a song called “You piss me off, f*cking jerk” while we drove.  I guess he didn’t like me.

When I mentor people, I often tell them you have to know the right time to quit a job.  There were several signs in this story that it was time for a change.  With two CCIEs, installing a single switch or working on a single ASA 5505 was not really a good use of my skills.  Neither was moving in carpet rolls and dresses for $250/hour.  Luckily I had enough big jobs at the partner that I managed to get through my interviews at Juniper without trouble.

Meanwhile, a few years later I read about the FBI raiding the builder who was my customer.  I guess he had good reasons for cameras.

 

I recently replied to a comment that I think warrants a full blog post.

I’ve been here at Cisco working on programmability for a few years.  Brian Turner wrote in to say, essentially:  Hang on!  I became a network engineer precisely because I don’t want to be a coder!  I tried programming and hated it!  Now you’re telling me to become a programmer!

As I said in my reply, I have a lot of sympathy for him.  It reminds me of a story.

Back when I was at Juniper, I met with the IT department’s head of automation to discuss using some of his tools for network automation.  Jeremy was an expert in all things Puppet and Ansible, and a rather enthusiastic promoter of these tools on the server/app side of the house.  He had also managed to get Puppet running on a Junos device.  I was meeting with him because, frankly, the wind seemed to be blowing in his direction.  That said, I did not share his enthusiasm.  He told me about a server guy he had worked with, Stephane.  When Jeremy proposed to Stephane that he should use automation tools to make his life easier, Stephane vehemently rejected the idea, and the meeting ended with Stephane banging his fists on the table and shouting “I am not a coder!”

Flash forward a couple years and Stephane ended up the head of automation for a major company.  Apparently he finally bought into the idea.

Frankly I had no desire to become a coder either.  When I interviewed at Cisco, most of my discussions were around the controllers I was working with at the time, data center fabrics, etc.  When I arrived, my new boss assigned me as his Principal TME for programmability.  I never claimed to be an expert in this area.  Two months later I was presenting to Tech Field Day, and experienced automation guys like Jason Edelman and Matt Oswalt on how to run Puppet on a Nexus switch.  Three years later and I’m known as a NETCONF/YANG guy.  I’d barely heard of them when I started.

As I replied to Brian, Cisco doesn’t want him or anyone to learn Python or YANG or whatever.  Think about it from my perspective in product management.  Implementing YANG models for all of IOS XE is a massive undertaking.  Engineering devoted a huge amount of effort to pull this off.  Huge.  Mandating YANG models for their ongoing development burns cycles.  Product marketing and engineering would never prioritize this unless we thought there was a high probability someone would use it.  In other words, we don’t want people to use it so much as customers want us to develop it.  We have demand for programmable interfaces for network devices, and hence we’ve delivered on it.   My job as a TME is not to push NETCONF/YANG on anyone, but to provide the enablement to make it easier for someone to use this technology if they themselves want to.

As I often say in my presentations, the why is important.  Why do some customers demand these interfaces?  Well, because they know Notepad is a horrible automation tool, and it’s what 90% of network engineers use.  If you want to configure 50 switches, you’re going to configure one, paste the config into Notepad, tweak a few values, and then paste it into the next switch.  Do this 48 more times and tell me if this is the best use of your time as a highly skilled network engineer.  You can write a script to do this and save yourself a lot of trouble.  Or use Ansible to do it.  Or Cisco DNAC.  Whatever you want.  But if you want any of these tools to work efficiently, you need a machine interface, which CLI is not.  If you don’t believe me, try writing a script to do regular expression-based parsing of CLI outputs.  It’s a lot easier with YANG.

The point is not for network engineers to become programmers.  The point is to add some tools to your toolbox to help you focus on what you do well.  One weekend spent with a Python course and one more weekend with a DevNet course on YANG will give you a tool you can use to make your life easier.  That’s it.  Some customers may take it a lot further, of course, and go way into CI/CD workflows and that’s fine.  If you want to do 95% of your work in CLI and write a few scripts to do the other 5%, that’s fine.  If you want to use Cisco DNAC to do almost everything, knock yourself out.  It’s about what works best for you, as a network engineer.

I often point out how lousy my code quality is.  I’m sometimes ashamed to show the code for some of the scripts I’ve written.  I’m not a coder!  That’s a point I often make.  I don’t want to be a full-time software developer.  I’m a network engineer.  So for Brian and all the other CCIE’s out there, keep doing what you do best, but don’t close yourself off to some additional tools that will make your life easier.

I’ve mentioned before that EIGRP SIA was my nightmare case at TAC, but there was one other type of case that I hated–QoS problems.  Routing protocol problems tend to be binary.  Either the route is there or it isn’t;  either the pings go through or they don’t.  Even when a route is flapping, that’s just an extreme version of the binary problem.  QoS is different.  QoS cases often involved traffic that was passing sometimes or in certain amounts, but would start having problems when different sizes of traffic went through, or possibly traffic was dropping at a certain rate.  Thus, the routes could be perfectly fine, pings could pass, and yet QoS was behaving incorrectly.

In TAC, we would regularly get cases where the customer claimed traffic was dropping on a QoS policy below the configured rate.  For example, if they configured a policing profile of 1000 Mbps, sometimes the customer would claim the policer was dropping traffic at, say, 800 Mbps.  The standard response for a TAC agent struggling to figure out a QoS policy issue like this was to say that the link was experiencing “microbursting.”  If a link is showing a 800 Mbps traffic rate, this is actually an average rate, meaning the link could be experiencing short bursts above this rate that exceed the policing rate, but are averaged out in the interface counters.  “Microbursting” was a standard response to this problem for two reasons:  first, it was most often the problem;  second, it was an easy way to close the case without an extensive investigation.  The second reason is not as lazy as it may sound, as microbursts are common and are usually the cause of these symptoms.

Thus, when one of our large service provider customers opened a case stating that their LLQ policy was dropping packets before the configured threshold, I was quick to suspect microbursts.  However, working in high-touch TAC, you learn that your customers aren’t pushovers and don’t always accept the easy answer.  In this case, the customer started pushing back, claiming that the call center which was connected to this circuit generated a constant stream of traffic and that he was not experiencing microbursts.  So much for that.

This being the 2000’s, the customer had four T1’s connected in a single multi-link PPP (MLPPP) bundle.  The LLQ policy was dropping traffic at one quarter of the threshold it was configured for.  Knowing I wouldn’t get much out of a live production network, I reluctantly opened a lab case for the recreate, asking for a back-to-back router with the same line cards, a four-link T1 interconnection, and a traffic generator.  As always, I made sure my lab had exactly the same IOS release as the customer.

Once the lab was set up I started the traffic flowing, and much to my surprise, I saw traffic dropping at one quarter of the configured LLQ policy.  Eureka!  Anyone who has worked in TAC will tell you that more often than not, lab recreates fail to recreate the customer problem.  I removed and re-applied the service policy, and the problem went away.  Uh oh.  The only thing worse than not recreating a problem is recreating it and then losing it again before developers get a chance to look at it.

I spent some time playing with the setup, trying to get the problem back.  Finally, I reloaded the router to start over and, sure enough, I got the traffic loss again.  So, the problem occurred at start-up, but when the policy was removed and re-applied, it corrected itself.  I filed a bug and sent it to engineering.

Because it was so easy to recreate, it didn’t take long to find the answer.  The customer was configuring their QoS policy using bandwidth percentages instead of absolute bandwidth numbers.  This means that the policy bandwidth would be determined dynamically by the router based on the links it was applied to.  It turns out that IOS was calculating the bandwidth numbers before the MLPPP bundle was fully up, and hence was using only a single T1 as the reference for the calculation instead of all four.  The fix was to change the priority of operations in IOS, so that the MLPPP bundle came up before the QoS policy was applied.

So much for microbursts.  The moral(s) of the story?  First, the most obvious cause is often not the cause at all.  Second, determined customers are often right.  And third:  even intimidating QoS cases can have an easy fix.

I was doing well on the blog for a few months but lately fell behind.  With (now) 12 people reporting to me, and three major areas of responsibility (SD-Access, Assurance, and Programmability), it’s not easy to find time to write up a blog post.   I have about five drafts needing work but I cannot seem to find the will to finish them.  Sometimes, however, it just takes a spark to get me going. That spark came in my inbox from Ivan Peplnjak.  I like Ivan’s blog posts, which, while often not favorable to Cisco, are nonetheless fair and balanced and raise some very important points.

“Why Is Every SDN Vendor Bashing Networking Engineers?” asks Ivan in the form email I received.  “[T]he vendors know they wouldn’t be able to sell their latest concoctions to people who actually understand how networking works and why some architectures have no chance of ever working in real life,” answers Ivan.  “The only way to sell the warez is to try to convince everyone else how to get rid of the pesky ossified CLI jockeys.”

Now I work for a vendor, and since I deal with the aforementioned products, I guess I am an SDN vendor.  That would seem to qualify me to speak on this subject.  (With, of course, the usual disclaimer that the opinions here are my own and do not represent Cisco officially.)

Selling Concoctions

I must admit, I do want to sell our products.  Everyone at Cisco should want our products to sell.  Just about all of us have a personal, financial stake in the matter, whether we have stock grants or ESPP.  We would be insane not to want people to buy our products.  I, and most of my co-workers, are driven by far more than finance, however.  We all want to know that our work means something, and that we are coming up with innovative solutions to problems.  Otherwise, why show up in the office every day?

We operate in a highly competitive environment, which means if we are not constantly innovating and coming up with better ways to do things, we will all suffer.  You can complain about the macroeconomic system, and believe me, I’m not a Randian, objectivist believer in unbridled capitalism.  But, at the end of the day, a public company needs to create the perception of future value in the eyes of the stock market, and that’s a motivating factor for all of us.

These things being said, I’ve been in product management for a few years now and I have never heard anyone, ever, talk about trying to put one over on our customers.  I’m not saying that’s what Ivan means here, but it’s an accusation I’ve heard before.  In the first place, our customers are network engineers who are quite smart.  If ever I’ve presented to my customer and was not crystal clear on what I was talking about and what advantage it would bring the customer, they will let me know it.  We’re constantly trying to find ways to do things better and make our customers’ lives easier.  As somebody who worked in IT for more years than product management, I’m very interested in this subject.  There were a lot of things that were frustrating and I want to fix things that used to annoy me.  You can argue about whether we’ve come up with the right ideas, but I hope nobody questions our motivations.

CLI Jockeys

Do I bash CLI jockeys in order to sell my products?  I should hope not, given that most of my customers are CLI jockeys, as I am myself!  I have two CCIEs and a JNCIE.  I spent a couple years in routing protocols TAC and many years in IT.  I spent a long time learning my trade and I have a lot of respect for those who have put the time and effort into learning it as well.  It’s not easy.

However, I don’t operate under the delusion that network engineers do a good job of configuring and managing CLI.  When I was at Juniper, I had designed a new NGMVPN system for our WAN.  I handed it off to the implementation team with some sample configs and asked them to come back to me with their plan.  I think we were touching about 20 devices the first go around.  The engineer came back with 20 Word documents.  He took my sample config and copied and pasted it into Word, and then modified the config in a separate Word doc for each CE/PE he was touching.  CLI itself isn’t a problem, but how we manage it.  This is where programmability and automation tools come in.  At the very least Ansible templating would have made this easier.  Software-Defined Networking (a very loose term, for what it’s worth), is not about replacing ossified CLI jockeys but getting them to focus on what they should be doing (network engineering) and avoiding what they should not (pasting stuff in Word docs.)

SD-Access takes this quite a bit further than Ansible, NETCONF, and other device-level tools.  Rather than saying “I want this device to be a LISP MS/MR” and so forth, you just say “I want this device to be a control plane node” and the system figures out what you need.  Theoretically we could change from LISP to some other protocol and the end-user shouldn’t even notice.  The idea here is somewhat like a fly-by-wire system.  When a pilot operated the controls of an airplane, they used to be directly coupled to the control surfaces via hydraulics.  Now, the pilot is operating what is essentially a joystick, providing control inputs to a computer, which then computes the best way to move the control surfaces given the conditions.  This is then relayed to servo motors in the wings, tail, etc.  The complexity of a fly-by-wire system is much higher than an old hydraulic system, but the complexity is hidden from the pilot in order to provide a better experience.  Likewise, with SD-Access, we’ve made the details more complex in order to deliver a better experience (TrustSec, layer 3 routed backbone, etc.) while hiding the complexity from the user.  It’s a different approach, for sure, but the idea is to allow engineers to focus on the right problems, like how to design their network, and not worry so much about configuration.

A New Era?

I’ve written extensively (see, for example, here, and here) about the role for CLI-jockey network engineers in the future.  When airplanes switched from the old dials and gauges to sleek, modern computerized (glass) cockpits, I’m sure some old timers threw up their hands, retired, and got their old Piper Super Cubs out of the hanger to do some “real” flying.  But most adapted, and in the end, saw how the new automation systems helped them do their jobs better.  That’s an era I’m looking forward to.  And as I always, always say, the pilots who fly the new cockpits still need to understand weather systems, engines, navigation, etc.  We still need network engineers who know how networks operate.

Meanwhile, I won’t bash any CLI jockeys and I hope nobody else here does either.

My first full-time networking job was at the San Francisco Chronicle.  Now there isn’t much to the Chronicle anymore, but in the early 2000’s the newspaper was still going strong.  It was the beginning of the decline, but most people still took their local newspaper as their primary source of news.  Being a network engineer at a major metropolitan newspaper was fascinating.  It is a massive operation to print and distribute a newspaper every single day, and you can never, ever, miss.  There is no slippage of production deadlines.  It has to be out every day, and every day you start all over, with a blank page.

As the lead network engineer, I touched everything from editorial (the news and photography content of the paper) to advertising, pre-press, production systems, and circulation.  Every one of these was critical.  If editorial content didn’t make it through, there was nothing to go into the paper.  If advertising didn’t make it in, we didn’t earn revenue.  If pre-press or production had problems, the paper wasn’t printed.  If circulation wasn’t working, nobody could get their paper.

The Chronicle owned and operated three printing plants in the Bay Area.  One was on Army Street in San Francisco, while the other two were in Union City and Richmond in the East Bay.  The main office was on Fifth and Mission in downtown SF, so the paper was prepared in San Francisco and then sent to the plants via microwave.  That’s where I came in.

Our microwave system used a dish on the clock tower of our building.  From 5th and Mission we sent a signal up to Roundtop Mountain in the East Bay hills. At Roundtop we leased space in a little concrete bunker that was used for various kinds of radio communication including cellular.  From Roundtop we bounced the signal back to the three printing plants.

Chronicle building with the microwave visible on the clock tower

The microwave presented itself to us as T1 lines.  I had the T1 lines connected to dual routers at the main site and each of the plants.  In addition to the microwave, we had two additional backup T1’s to each plant which were landlines from different carriers with diverse paths into the buildings.  We kept the microwave and the first T1 plugged into the routers, with the third one on manual standby in case we needed it.  You don’t take chances with production in a newspaper, and we had triple redundancy on everything.  I used OSPF for redundancy between the microwave and #1 backup circuit on the routers, and HSRP for gateway redundancy.  With only four sites it was a simple enough topology and it never gave me much trouble.

Until, that is, the day when I got a call from our operations center that the primary circuits were all down.  We were running on backups.  I immediately called up the production systems engineer who managed the microwave and told him his circuits were down.  “Impossible!” he said, “that microwave is five-nines reliable.  Check your router!”  I tried a few of the usual:  shut/no shut the interface, changing the line encoding, etc.  No go.  He wanted me to start swapping hardware, which was a big deal in a live newspaper environment, and seemed pointless.  If it was hardware, why would all of the circuits be down?

We bickered a bit before I moved to have the tertiary backup circuits swapped in so we had automatic failover while we worked on the microwave.  I got out our old T-berd tester to see if I could find any indication of the problem.  Then the systems engineer called:  “We need to meet at the clock tower, I’ve found the problem,” he said.  It’s always a relief to hear that when finger pointing is going around.

T-berd T1 Tester

I showed up at the entrance to the tower and followed the systems guy up a rusty ladder mounted to the wall.  Up in the tower there were bird droppings and as I climbed higher I fought the urge to look down.  I’ve never much liked heights and being out of shape and relying on my own strength to keep from falling several stories onto concrete was not promising.  Once I got to the top there was a large separation between the ladder and the floor, and I fought the urge to panic as I flung my leg way over to climb onto the concrete flooring.  From there we went outside and I saw the problem right away.

If you’ve ever been to a convention in San Francisco, chances are it took place in the Moscone Center.  In the early 2000’s, the city decided to expand Moscone by building a new Moscone Center West on 4th and Howard streets.  And from up on the clock tower it was plain as day:  they had built a cooling tower on the roof right in the path of our microwave beam.  I looked at the systems guy and said, “Well, I guess you could make popcorn in that cooling tower.  Anyways, there goes your five nines.”

We hastily called meetings together to decide what to do.  Sue the city?  Call the FCC?  Find another building to bounce the microwave off of?  Those were long term solutions but we had an immediate problem.  Two circuits might seem like enough, but they were telco circuits and not as reliable as the microwave was, at least when its path wasn’t blocked.

Getting the city to cut the cooling tower off Moscone West was a non-starter, especially when it was the newspaper asking, a newspaper that made its money being critical of city officials.  So, we decided to lease roof space from another building and add an additional repeater.  However, this was a long process.  We needed to negotiate with the landlord, replan the radio deployment, license it and obtain permits, add the new repeater, and re-point the old dish to the new building.  That last item was not as simple as it sounded, since this wasn’t a DirecTV dish.  It was welded to the tower, so we needed to hire ironworkers to cut it off and re-position it.

Meantime, we ordered T1’s from downtown SF up to Roundtop to bypass the segment that wasn’t working.  We’d go hard wire to Roundtop, the microwave the rest of the way.  This was not, by any means, an ideal solution, nor was it an overnight solution, but we could at least get some redundancy faster than it would take to add the repeater.  I’m glad we did because shortly after the microwave went down we started having terrible problems with the landlines and needed the triple redundancy.

If you drive by Fifth and Mission now, the microwave dish is gone from the clock tower.  The Chronicle, a shadow of its former self, no longer operates its own printing plants, and has a circulation far smaller than it did in 2004, when I left.  As I said in my last post, it’s great to have a sense of purpose when you work in IT.  It wasn’t about fixing a microwave but about getting that paper in the hands of our readers.  I’m thankful I got to be a part of that for a few years, even if it cost me some vertigo and sleepless nights.