Skip navigation

I’ve been in this industry a while now, and I’ve done a lot of jobs.  Certainly not every job, but a lot.  My first full time network engineering job came in 2000, but I was doing some networking for a few years before that.

I often see younger network engineers posting in public forums asking about the pros and cons of different job roles.  I’ve learned over the years that you have to take people’s advice with a grain of salt.  Jobs in one category may have some common characteristics, but a huge amount is dependent on the company, your manager, and the people you work with.  However, we all have a natural tendency to try to figure out the situation at a potential job in advance, and the experience of others can be quite helpful in sizing up a role.  Therefore, I’ve decided to post a summary of the jobs I’ve done, and the advantages/disadvantages of each.

IT Network Engineer

Summary:
This is an in-house engineer at a corporation, government agency, educational institution, or really any entity that runs a network.  The typical job tasks vary depending on level, but they usually involve a large amount of day-to-day network management.  This can be responding to complaints about network performance, patching in network connectivity (less so these days because of wireless), upgrading and maintaining devices, working with carriers, etc.  Larger scale projects could be turning up new buildings and sites, planning for adding new functionality (e.g. multicast), etc.

Pros:

  • Stable and predictable work environment.  You show up at the same place and know the people, unlike consulting.
  • You know the network.  You’re not showing up a new place trying to figure out what’s going on.
  • It can be a great chance to learn if the company is growing and funds new projects

Cons:

  • You only get to see one network and one way of doing things.
  • IT is a cost center, so there is a constant desire to cut personnel/expenses.
  • Automation is reducing the type of on-site work that was once a staple for these engineers.
  • Your fellow employees often hate IT and blame you for everything.
  • Occasionally uncomfortable hours due to maintenance windows.

Key Takeaway:
I often tell people that if you want to do an in-house IT job, try to find an interesting place to work.  Being an IT guy at a law firm can be kind of boring.  Being an IT guy at the Pentagon could be quite interesting.  I worked for a major metropolitan newspaper for five years (when there was such a thing) and it was fascinating to see how newspapers work.  Smaller companies can be better in that you often get to touch more technologies, but the work can be less interesting.  Larger companies can pigeonhole you into a particular area.  You might work only on the WAN and never touch the campus or data center side of things, for example.

Technical Support Engineer

Summary:
Work at a vendor like Cisco or Juniper taking cases when things go wrong.  Troubleshoot problems, recreate them in the lab, file bugs, find solutions for customers.  Help resolve outages.  See my TAC Tales for the gory details.

Pros:

  • Great way to get a vast amount of experience by taking lots of tough cases
  • Huge support organization to help you through trouble
  • Short-term work for the most part–when you close a case you’re done with it and move on to something new
  • Usually works on a shift schedule, with predictable hours.  Maintenance windows can often be handed off.

Cons:

  • Nearly every call involves someone who is unhappy.
  • Complex and annoying technical problems.  Your job is 100% troubleshooting and it gets old.
  • Usually a high case volume which means a mountain of work.

Key Takeway:
Technical Support is a tough and demanding environment, but a great way to get exposure to a constant stream of major technical issues.  Some people actually like tech support and make a career out of it, but most I’ve known can burn out after a while.  I wouldn’t trade my TAC years for anything despite the difficulties, as it was an incredible learning experience for me.

Sales Engineer

Summary:
I’ve only filled this role at a partner, so I cannot speak directly to the experience inside a company like Cisco (although I constantly work with Cisco SE’s).  This is a pre-sales technical role, generally partnered with a less-technical account manager.  SE’s ultimately are responsible for generating sales, but act as a consultant or adviser to the customer to ensure they are selling something that fits.  SE’s do initial architecture of a given solution, work on developing the Bill of Materials (BoM), and in the case of partners, help to write the Statement of Work (SoW) for deployment.  SE’s are often involved in deployment of the solutions they sell but it is not their primary job.

Pros:

  • Architectural work is often very rewarding;  great chance to partner with customer and build networks.
  • Often allows working on a broad range of technologies and customers.
  • Because it involves sales, usually good training on the latest technologies.
  • Unlike pure sales (account managers in Cisco lingo), a large amount of compensation is salary so better financial stability.
  • Often very lucrative.

Cons:

  • Like any account-based job, success/enjoyability is highly dependent on the account(s) you are assigned to.
  • Compensation tied to sales, so while there are good opportunities to make money, there is also a chance to lose a lot of discretionary income.
  • Often take the hit for poor product quality from the company whose products you are selling.
  • Because it is a pre-sales role, often don’t get as much hands-on as post-sales engineers.
  • For some products, building BoM’s can be tedious.
  • Sales pressure.  Your superiors have numbers to make and if you’re not seen to be helping, you could be in trouble.

Key Takeaway:
Pre-sales at a partner or vendor can be a well-paying and enjoyable job.  Working on architecture is rewarding and interesting, and a great chance to stay current on the latest technologies.  However, like any sales/account-based job, the financial and career success of SE’s is highly dependent on the customers they are assigned to and the quality of the sales team they are working with.  Generally SE’s don’t do technical support, but often can get pulled into late-night calls if a solution they sell doesn’t work.  SEs are often the face of the company and can take a lot of hits for things that they sell which don’t work as expected.  Overall I enjoyed being a partner SE for the most part, although the partner I worked for had some problems.

Post-Sales/Advanced Services

Summary:
I’m including both partner post-sales, which I have done, and advanced services at a vendor like Cisco, which are similar.  A post-sales engineer is responsible for deploying a solution once the customer has purchased it, and oftentimes the AS/deployment piece is a part of the sale.  Occasionally these engineers are used for non-project-based work, more so at partners.  In this case, the engineer might be called to be on site to do some regular maintenance, fill in for a vacationing engineer, etc.

Pros:

  • Hands-on network engineering.  This is what we all signed up for, right?  Getting into real networks, setting stuff up, and making things happen.
  • Unlike IT network engineers, this job is more deployment focused so you don’t have to spend as much time on day-to-day administrative tasks.
  • Unlike sales, the designs you work on are lower-level and more detailed, so again, this is a great nuts-and-bolts engineering role.

Cons:

  • As with sales, the quality and enjoyability is highly dependent on the customers you end up with.
  • You can get into some nasty deployment scenarios with very unhappy customers.
  • Often these engagements are short-term, so less of a chance to learn a customer/network.  Often it is get in, do the deployment, and move on to the next one.
  • Can involve a lot of travel.
  • Frequently end up assisting technical support with deployments you have done.
  • Can have odd hours.
  • Often left scrambling when sales messed up the BoM and didn’t order the right gear or parts.

Key Takeway:
I definitely enjoyed many of my post-sales deployments at the VAR.  Being on-site, and doing a live deployment with a customer is great.  I remember one time when I did a total network refresh and  VoIP deployment up at St. Helena Unified School District in Napa, CA.  It was a small school district, but over a week in the summer we went building-by-building replacing the switches and routers and setting up the new system.  The customer was totally easygoing, gave us 100% free reign to do it how we wanted, was understanding of complications, and was satisfied with the result.  Plus, I enjoyed spending a week up in Napa, eating well and loving the peace. However, I also had some nightmare customers who micromanaged me or where things just went south.  It’s definitely a great job to gain experience on a variety of live customer networks.

Technical Marketing Engineer

Summary:
I’m currently a Principal TME and a manager of TMEs.  This is definitely my favorite job in the industry.  I give more details in my post on what a TME does, but generally we work in a business unit of a vendor, on a specific product or product family, both guiding the requirements for the product and doing outbound work to explain the product to others, via white papers, videos, presentations, etc.

Pros:

  • Working on the product side allows a network engineer to actually guide products and see the results.  It’s exciting to see a new feature, CLI, etc., added to a product because you drove it.
  • Get to attend at least several trade shows a year.  Everyone likes getting a free pass to a conference like Cisco Live, but to be a part of making it happen is exhilarating.
  • Great career visibility.  Because the nature of the job requires producing content related to your product, you have an excellent body of work to showcase when you decide to move on.
  • Revenue side.  I didn’t mention this in the sales write-up, but it’s true there too.  Being close to revenue is generally more fun than being in a cost center like IT, because you can usually spend more money.  This means getting new stuff for your lab, etc.
  • Working with products before they are ever released to the public is a lot of fun too.
  • Mostly you don’t work on production networks so not as many maintenance windows and late nights as IT or AS.

Cons:

  • Relentless pace of work.  New software releases are constantly coming;  as soon as one trade show wraps up it’s time to prepare for the next one.  I often say TME work is as relentless as TAC.
  • Can be heavy on the travel.  That could, of course, be a good thing but it gets old.
  • Difficulty of influencing engineering without them reporting to you.  Often it’s a fight to get your ideas implemented when people don’t agree.
  • If you don’t like getting up in front of an audience, or writing documents, this job may not be for you.
  • For new products, often the TMEs are the only resources who are customer facing with a knowledge of the product, so you can end up working IT/AS-type hours anyways.  Less an issue with established/stable products.

Key Takeaway:
As I said, I love this job but it is a frenetic pace.  Most of the posts I manage to squeeze in on the blog are done in five minute intervals over a course of weeks.  But I have to say, I like being a TME more than anything else I’ve done.  Being on the product side is fascinating, especially if you have been on the consumer side.  Going to shows is a lot of fun.  If you like to teach and explain, and mess around with new things in your lab, this is for you.

It’s not a comprehensive list of the jobs you can do as a network engineer, but it covers some of the main ones.  I’m certainly interested in your thoughts on other jobs you’ve done, or if you’ve done one of the above, whether you agree with my assessment.  Please drop a comment–I don’t require registration but do require an email address just to keep spam down.

I was hoping to do a few technical posts but my lab is currently being moved, so I decided to kick off another series of posts I call “NetStalgia”.  The TAC tales continue to be popular, but I only spent two years in TAC and most cases are pretty mundane and not worthy of a blog post.  What about all those other stories I have from various times and places working on networks?  I think there is some value in those stories, not the least because they show where we’ve come from, but also I think there are some universal themes.  So, allow me to take you back to 1995, to a now-defunct company where I first ventured to work on a computer network.

I graduated college with a liberal arts degree, and like most liberal arts majors, I ended up working as an administrative assistant.  I was hired on at company that both designed and built museum exhibits.  It was a small company, with around 60 people, half of whom worked as fabricators, building the exhibits, while the other half worked as designers and office personnel.  The fabricators consisted of carpenters, muralists, large and small model builders, and a number of support staff.  The designers were architects, graphic designers, and museum design specialists.  Only the office workers/designers had their own computers, so it was a quite small network of 30 machines, all Macs.

When the lead designer was spending too much time on maintaining the computer network, the VP of ops called me in and asked me to take over, since seemed to be pretty good with computers and technical stuff, like fixing the fax machine.

Back then, believe it or not, PCs did not come with networking capabilities built in.  You had to install a NIC if you wanted to connect to a network.  Macs actually did come with an Apple-proprietary interface called LocalTalk.  The LocalTalk interface consisted of a round serial port, and with the appropriate connectors and cables, you could connect your Macs in a daisy-chain topology.  Using thick serial cables with short lengths to network office computers was a big limitation, so an enterprising company named Farallon came up with a better solution, called PhoneNet.  PhoneNet plugged into the rear LocalTalk port, but instead of using serial cables it converted the LocalTalk signal so that it ran on a single twisted pair of wires.  The brilliance of this was that most offices had phone jacks at every desk, and PhoneNet could use the spare wires in the jacks to carry its signal.  In our case, we had a digital phone system that consumed two pairs of our four-pair Cat 3 cables, so we could dedicate one to PhoneNet/LocalTalk and call it good.

PhoneNet connector with resistor

We used an internal email system called SnapMail from Cassidy and Greene.  SnapMail was great for small companies because it could run in a peer-to-peer mode, without the need for an expensive server.  In this mode, an email you sent to a colleague went directly to their machine.  The obvious problem with this is that if I work the day shift, and you work the night shift, our computers will never be on at the same time and you won’t get my email.  Thankfully, C&G also offered a server option for store-and-forward messaging, but even with the server enabled it would still attempt a peer-to-peer delivery if both sender and receiver were online.

One day I started getting complaints about the reliability of the email system.  Messages were being sent but not getting delivered.  Looking at some of the trouble devices, I could see that they were only partially communicating with each other and the failed messages were not being queued in the server.  This was because the peers seemed to think each other was online, when in fact there was some communication breakdown.

Determining a cause for the problem was tough.  Our network used the AppleTalk protocol suite and not IP.  There was no ping to test connectivity.  I had little idea what to do.

As I mentioned, PhoneNet used a single pair of phone wiring, and as we expanded, the way I added new users was as follows:  when a new hire came on board, I would connect a new phone jack for him, and then go to the 66 punch-down block in a closet in the cafeteria and tie the wires into another operative jack. Then I would plug a little RJ11 with a resistor on it in the empty port of the LocalTalk dongle, because the dongle had a second port for daisy-chaining and this is what we were supposed to do if it was not in use.  This was a supported configuration known in PhoneNet terminology as a “passive star”.  Passive, because there was nothing in between the stations.  This being before Google, I didn’t know that Farallon only supported 4 branches on a passive star.  I had 30.  Not only did we have too many stations and too much cable length, but the combined resistance on this giant circuit was huge because of all the resistors.

I had a walkthrough with our incredulous “systems integrator”, who refused to believe we had connected so many devices without a hub, which was called a “Star Controller” in Farallon terminology.  When he figured out what I had done, we came up with a plan to remove some of the resistors and migrate the designers off of the LocalTalk network.

Some differences between now and then:

  • Networking capability wasn’t built in on PCs, but it was on Macs.
  • I was directly wiring together computers on a punch-down block.
  • There was no Google to figure out why things weren’t working.
  • We used peer-to-peer email systems.

Some lessons that stay the same:

  • Understand thoroughly the limitations of your system.
  • Call an expert when you need help.
  • And of course:  don’t put resistors on your network unless you really need to!

 

In a previous post I had mentioned I co-authored a book on IOS XE Programmability with some colleagues of mine.  For those who are interested, the book is available here.

The book is not a comprehensive how-to, but a summary of the IOS XE features along with a few samples.  It should provide a good overview of the capabilities of IOS XE.  For those who were on my CCIE webinar, it should be more than adequate to get you up to speed on CCIE written programmability topics.

As with any technical book, there could be some errata, so please feel free to pass them along and I can get them corrected in the next edition.

I’ve mentioned in previous TAC Tales that I started on a TAC team dedicated to enterprise, which made sense given my background.  Shortly after I came to Cisco the enterprise team was broken up and its staff distributed among the routing protocols team and LAN switch team.  The RP team at that time consisted of service provider experts with little understanding of LAN switching issues, but deep understanding of technologies like BGP and MPLS.  This was back before the Ethernet-everywhere era, and SP experts had never really spent a lot of time with LAN switches.

This created a big problem with case routing.  Anyone who has worked more than 5 minutes in TAC knows that when you have a routing protocol problem, usually it’s not the protocol itself but some underlying layer 2 issue.  This is particularly the case when adjacencies are resetting.  The call center would see “OSPF adjacencies resetting” and immediately send the case to the protocols team, when in fact the issue was with STP or perhaps a faulty link.  With all enterprise RP issues suddenly coming into the same queue as SP cases, our SP-centric staff were constantly getting into stuff they didn’t understand.

One such case came in to us, priority 1, from a service provider that ran “cell sites”, which are concrete bunkers with radio equipment for cellular transmissions.  “Now wait,” you’re saying, “I thought you just said enterprise RP cases were a problem, but this was a service provider!”  Well, it was a service provider but they ran LAN switches at the cell site, so naturally when OSPF started going haywire it came in to the RP team despite obviously being a switching problem!

A quick look at the logs confirmed this:

Jun 13 01:52:36 LSW38-0 3858130: Jun 13 01:52:32.347 CDT:
%C4K_EBM-4-HOSTFLAPPING: Host 00:AB:DA:EE:0A:FF in vlan 74 is flapping
between port Fa2/37 and port Po1

Here we could see a host MAC address moving between a front-panel port on the switch and a core-facing port channel.  Something’s not right there.  There were tons of messages like these in the logs.

Digging a little further I determined that Spanning Tree was disabled.  Ugh.

Spanning Tree Protocol (STP) is not  popular, and it’s definitely flawed.  With all due respect to the (truly) great Radia Perlman, the inventor of STP, choosing the lowest bridge identifier (usually the MAC address of the switch) as the root, when priorities are set to the default, is a bad idea.  It means that if customers deploy STP with default values, the oldest switch in the network becomes root.  Bad idea, as I said.  However, STP also gets a bad reputation undeservedly.  I cannot tell you how many times there was a layer 2 loop in a customer network, where STP was disabled, and the customer referred to it as a “Spanning Tree loop”.  STP stops layer 2 loops, it does not create them.  And a layer 2 loop out of control is much worse than a 50 second spanning tree outage, which is what you got with the original protocol spec.  When there is no loop in the network, STP doesn’t do anything at all except for send out BPDUs.

As I suspected, the customer had disabled spanning tree due to concerns about the speed of failover.  They had also managed to patch a layer 2 loop into their network during a minor change, causing an unchecked loop to circulate frames out of control, bringing down their entire cell site.

I explained to them the value of STP, and why any outage caused by it would be better than the out of control loop they had.  I was told to mind my own business.  They didn’t want to enable spanning tree because it was slow.  Yes, I said, but only when there is a loop!  And in that case, a short outage is better than a meltdown.  Then I realized the customer and I were in a loop, which I could break by closing the case.

Newer technologies (such as SD-Access) obviate the need for STP, but if you’re doing classic Layer 2, please, use it.

An old networking friend whom I mentored for his CCIE a long time ago wrote me an email:  I’ve been a CCIE for 10 years now, he said, and I’m feeling like a dinosaur.  Everyone wants people who know AWS and automation and they don’t want old-school CLI guys.

It takes me back to a moment in my career that has always stuck with me.  I was in my early twenties at my first job as a full-time network engineer.  I was working at the San Francisco Chronicle, at the time (early 2000’s) a large newspaper with a wide circulation.  The company had a large newsroom, a huge advertising call center, three printing plants, and numerous circulation offices across the bay area.  We had IP, IPX, AppleTalk and SNA on the network, typical of the multi-protocol environments of the time.

My colleague Tony and I were up in the MIS area on the second floor of the old Chronicle building on 5th and Mission St. in downtown San Francisco.  The area we were in contained armies of mainframe programmers, looking at the black screens of COBOL code that were the backbone of the newspaper systems in those days.  Most of the programmers were in their fifties, with gray hair and beards.  Tony and I were young, and TCP/IP networking was new to these guys.

I was telling Tony how I always wanted to be technical.  I loved CLI, and it was good at it.  I was working on my first CCIE.  I was at the top of my game, and if any weird problem cropped up on our network I dove in and got it fixed, no matter how hard.  As I explained to Tony, this was all I wanted to do in my career, to be a CLI guy, working with Cisco routers and switches.

Tony gestured at the mainframe programmers, sitting in their cubes typing their COBOL.  “Is this what you want to be when you’re in your fifties,” he said under his breath, “a dinosaur?  Do you just want to be typing obscure code into systems that are probably going to be one step away from being shut down?  How long do you think these guys will have their jobs anyways?”

Well, I haven’t been to the Chronicle in a while but those jobs are almost certainly gone.  Fortunately for the COBOL guys, they’re all retirement age anyways.

We live in a world and an industry that worships the young and the new.  If you’re in your twenties, and totally current on the latest DevOps tools, be warned:  someday you’ll be in your forties and people will think DevOps is for dinosaurs.  The tech industry is under constant pressure to innovate, and innovating usually means getting machines to do things people used to do.  This is why some tech titans are pushing for universal basic income.  They realize that their innovations eliminate jobs at such a rate that people won’t be able to afford to live anymore.  I think it’s a terrible idea, but that’s a subject for another post.  The point is, in this industry, when you think you’ve mastered something and are relevant, be ready:  your obsolescence commeth.

This is an inversion of the natural respect for age and experience we’ve had throughout human history.  I don’t say this as a 40-something feeling some bitterness for the changes to his industry;  in fact, I actually had this thought when I was much younger.  In the West, at least,  in the 1960’s there developed a sense that, to paraphrase Hunter Thompson, old is evil.  This was of course born from legitimately bad things that were perpetuated by previous generations, but it’s interesting to see how the attitude has taken hold in every aspect of our culture.  If you look at medieval guilds, the idea was that the young spent years going through apprentice and journeyman stages before being considered a master of their craft.  This system is still in place in many trades that do not experience innovation at the rate of our industry, and there is a lot to be said for it.  The older members of the trade get security and the younger get experience.

I’ve written a bit about the relevance of the CCIE, and of networking skills in general, in the new age.  Are we becoming the COBOL programmers of the early 2000’s?  Is investing in networking skills about the same as studying mainframe programming back then, a waste of cycles on dying systems?

I’ve made the point many times on this blog that I don’t think that’s (yet) the case.  At the end of the day, we still need to move packets around, and we’re still doing it in much the same way as we did in 1995.  Most of the protocols are the same, and even the newer ones like VXLAN are not that different from the old ones.  Silicon improves, speeds increase, but fundamentally we’re still doing the same thing.  What changing is how we’re managing those systems, and as I say in my presentations, that’s not a bad thing.  Using Notepad to copy/paste across a large number of devices is not a good use of network engineers’ time.  Automating can indeed help us to do things better and focus on what matters.

I’ve often used the example of airline pilots.  A modern airplane cockpit looks totally different from a cockpit in the 1980’s or even 1990’s.  The old dials and switches have been replaced by LCD panels and much greater automation.  And yet we still have pilots, and the pilot today still needs to understand engine systems, weather, aerodynamics, and navigation.  What’s changed is how that pilot interacts with the machine.  As a pilot myself, I can tell you how much better a glass cockpit is than the old dials.  I get better information presented in a much more useful way and don’t have to waste my time on unnecessary tasks.  This is how network automation should work.

When I raised this point to some customer execs at a recent briefing, one of them said that the pilots could be eliminated since automation is so good now.  I’m skeptical we will ever reach that level of automation, despite the futurists’ love of making such predictions.  The pilots aren’t there for the 99% of the time when things work as expected, but for the 1% when they don’t, and it will be a long time, if ever, before AI can make judgement calls like a human can.  And in order to make those 1% of calls, the pilots need to be flying the 99% of the time when it’s routine, so they know what to do.

So, are we dinosaurs?  Are we the COBOL programmers of the late 2010’s, ready to be eliminated in the next wave of layoffs?  I don’t think so, but we have to adapt.  We need to learn the glass cockpit.  We need to stay on top of developments, and learn how those developments help us to manage the systems we know well how to manage.  Mainframes and operating systems will come and go, but interconnecting those systems will still be relevant for a long time.

Meanwhile, an SVP at Cisco told me he saw someone with a ballcap at Cisco Live:  “Make CLI Great Again”.  Gotta love that.  Some dinosaurs don’t want to go extinct.

Cisco Live Orlando has wrapped up, at least for me, and I can relax until Cisco Live Europe in January.  I never realized how much work goes into Cisco Live until I became a TME.  Building labs, working on slides, preparing demos, and arranging customer meetings is a months-long process and always a scramble at the end.  It’s a great show, and I can say that having attended as a customer.  It’s more fun and less work to be an attendee, but for technical marketing engineers, it’s still a blast and the highlight of the year.

Orlando had a special significance for me because it was at CL Orlando in 2007 that I decided I really wanted to be a TME.  I attended several breakouts and thought that I’d love to be up in front of the room, teaching folks how about technology.  The only problem:  I was terrified of public speaking.

It took years of trainings, including many as a Toastmaster, before I became comfortable in front of an audience.  That’s a story for another time.  It also took years before the right job opened up, and there were a couple near moves into technical marketing that didn’t work out.  I have to say, I’m glad I have this job and love (almost) every minute of it.

Still, getting up in front of a bunch of your (rather smart) peer network engineers and claiming some sort of expertise is nerve-wracking.  Wanting to do well in front of an audience can lead to frustration.  My main breakout session, BRKCRS-2451, Scripting Catalyst Switches, won me two distinguished speaker awards in a row.  This year, however, the scores are looking quite a bit lower.

It didn’t help that the start time was 8am.  I’m not a morning person, and 8am in Orlando was 5am for me.  The old neurons just weren’t firing for the first 30-45 minutes of the presentation, and in front of 400 people that just isn’t good.

A dose of humility is a good thing, though.  I know TMEs who would kill for my “disappointing” score, so it wasn’t that bad.  And the comments were quite helpful, in fact, and make clear what people are looking for and where they didn’t think I delivered.

I structured BRKCRS-2451 as a journey through developing a script on IOS XE.  The session begins with a demo of a fairly simple script, which pulls some data down from a switch and then formats it and sends it to a Webex Teams (formerly Spark) room.  Then, I break down the script starting with installing Python, and some of the tools needed, like Git and Virtual Environments.  Then I move on to YANG/NETCONF, talk about REST, and then wrap it up by showing how it all fits together to build the script I demoed.

It was a winning formula for a while, but I’m suspecting network engineers have up-leveled their programmability skills in the last year or so.  When I used to explain what GitHub was, network engineers usually were relieved to have it explained to them.  Now I think they all know.

I have a few ideas for making the session more relevant.  Still, it was a great experience talking to 400 people, meeting customers around the show floor and halls, and visiting some of my colleagues’ sessions.  Hopefully my attendees got something out of the session, and I look forward to the next Cisco Live.

Two years ago I published my Ten Years a CCIE series.  Actually, I had written the series a couple years before I published it, but as I say in my introduction to the series, I felt it was a bit self-indulgent an uninteresting, so I scrapped it for a while.  The original pieces were dictated, and I’ve been meaning to go back and clean up some of the grammatical errors or grating phrases, but haven’t had the time.  Not a lot of people have read it, nor did I expect many to read it, since I generally don’t advertise the blog in social media, or anywhere really.  But the feedback from the few who have read it has been positive, and I’m gratified for that.

Things have changed a lot since I got into networking in 1995, and since I passed my CCIE in 2004.  But it’s also amazing how much has stayed the same.  TCP/IP, and in fact IPv4, is still the heart of the network.  Knowledge of OSPF and BGP is still key.  For the most part, new controllers and programmable interfaces represent a different way of managing fundamentally the same thing.

The obvious reasons for this are that networks work and are hard to change.  The old protocols have been sufficient for passing data from point A to point B for a long time.  They’re not perfect but they are more than adequate.  They are hard to change because networks are heterogeneous.  There are so many types of different systems connecting to them, that if we wanted to fundamentally alter the building blocks of networks, we’d have to upgrade a lot of systems.  This is why IPv6 adoption is so slow.

Occasionally I poke around at TechExams.net to see what newer network engineers are thinking, and where they are struggling.  I’m probably the only director-level employee of Cisco who reads or comments on that message board.  I started reading it back when I was still at Juniper and studying for my JNCIE, but I’ve continued to read it because I like the insights I get from folks prepping for their certifications.  People are occasionally concerned that the new world of controllers and automation will make their jobs obsolete.

I built the first part of my career on CLI.  Now I’m building it on controllers and programmability.  In this industry, we have to adapt, but we don’t have to die.  Cars have changed drastically, with on-board computer systems and so forth, but we still need mechanics.  We still need good network engineers.

To be honest, I was getting tired of my career by the time I left Juniper and came to Cisco.  I was bored.  I thought of going back to school and getting a Ph.D. in classical languages, my other passion.  Getting married helped put an end to that idea (Ph.D.’s in ancient Greek make a lot less than network engineers) but when I came back to Cisco, I felt revitalized.  I started learning new things.  Networking was becoming fun again.

I wrote the “Ten Years a CCIE” series both for people who had passed the exam and wanted to have some fun remembering the experience, as well as for people struggling to pass it.  Some things change, as I said, but a lot remains the same.  I still think, closing in on 15 years since I took the exam, that it’s still worth it.  I still think it’s a fantastic way to launch a career.  The exam curriculum will adapt, as it always does, with new technologies, but it’s an amazing learning experience if you do it honestly, and you will be needed when you make it through.

Everyone who’s worked in TAC can tell you their nightmare case–the type of case that, when they see it in the queue, makes them want to run away, take an unexpected lunch break, and hope some other engineer grabs it.  The nightmare case is the case you know you’ll get stuck on for hours, on a conference bridge, escalating to other engineers, trying to find a solution to an impossible problem.  For some it’s unexplained packet loss.  For others, it’s multicast.  For me, it was EIGRP Stuck-in-Active (SIA).

Some customer support engineers (CSEs) thought SIA cases were easy.  Not me.  A number of times I had a network in total meltdown due to SIA with no clue as to where the problem was.  Often the solution required a significant redesign of the network.

As a review, EIGRP is more-or-less a distance-vector routing protocol, which uses an algorithm called DUAL to achieve better performance than a traditional DV protocol like RIP.  I don’t want to get into all the fun CCIE questions on the protocol details, but what matters for this article is how querying works.  When an EIGRP neighbor loses a route, it sets the route as “Active” and then queries its neighbors as to where the route went.  Then, if the neighbors don’t have it, they set it active and query their neighbors.  If those neighbors don’t have the route active, they of course mark it active and query their neighbors.  And so forth.

It should be obvious from this process that in a large network, the queries can multiply quite quickly.  If a router has a lot of neighbors, and its neighbors have a lot of neighbors, the queries multiply exponentially, and can get out of control.  Meanwhile, when a router sets a route active, it sets a timer.  If it doesn’t get a reply before the timer expires, then the router marks the route “Stuck In Active”, and resets the entire EIGRP adjacency.  In a large network with a lot of neighbors, even if the route is present, the time lag between sending a query and getting a response can be so long that the route gets reset before the response makes it to the original querying router.

I’ve ironed out some of the details here, since obviously an EIGRP router can lose a route entirely without going SIA.  For details, see this article.  The main point to remember is that the SIA route happens when the querying route just doesn’t get a response back.

Back in my TAC days, I of course wasn’t happy to see an SIA drop in the queue.  I waited to see if one of my colleagues would take the case and alleviate the burden, but the case turned blue after 20 minutes, meaning someone had to take it.  Darn.

Now I can show my age, because the customer had adjacencies resetting on Token Ring interfaces.  I asked the customer for a topology diagram, some debugs, and to check whether there was packet loss across the network.  Sometimes, if packets are getting dropped, the query responses don’t make it back to the original router, causing SIA.  The logs from the resets looked like this:

rtr1 - 172.16.109.118 - TokenRing1/0
Sep 1 16:58:06: %DUAL-3-SIA: Route 172.16.161.58/32 stuck-in-active state in IP-EIGRP(0) 55555. Cleaning up
Sep 1 16:58:06: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is down: stuck in active
Sep 1 16:58:07: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is up: new adjacency

This is typical of SIA.  The adjacency flapped, but the logs showed no particular reason why.

I thought back to my first troubleshooting experience as a network engineer.  I had brought up a new branch office but it couldn’t talk back to HQ.  Mike, my friend and mentor, showed up and started pinging hop-by-hop until he found a missing route.  “That’s how I learned it,” he said, “just go one hop a time.”  The big clue I had in the SIA case was the missing route:  172.16.161.58/32.  I started tracing it back, hop-by-hop.

I found that the route originated from a router on the edge of the customer network, which had an ISDN PRI connected.  (Showing my age again!)  They had a number of smaller offices that would dial into the ISDN on-demand, and then drop off.  ISDN had per-minute charges and thus, in this pre-VPN era, it was common to setup ISDN in on-demand mode.  ISDN was a digital dial-up technology with very short call setup times.  I discovered that, as these calls were going up and down, the router was generating /32 peer routes for the neighbors and injecting them into EIGRP.  They had a poorly designed network with a huge query domain size, and so as these dial peers were going up and down, routers on the opposite side of the network were going into active on the route and not getting responses back.

They were advertising a /16 for the entire 172.16.x.x network, so sending a /32 per dial peer was totally unnecessary.  I recommended they enable “no peer neighbor-route” on the PRI to suppress the /32’s and the SIAs went away.

I hate to bite the hand that feeds me, but even though I work at Cisco I can say I really never liked EIGRP.  EIGRP is fast, and if the network is designed well, it works fine.  However, networks often grow organically, and the larger the domain, the more unstable EIGRP becomes.  I’ve never seen this sort of problem with OSPF or ISIS.  Fortunately, this case ended up being much less problematic than I expected, but often these cases were far nastier.  Oftentimes it was nearly impossible to find the route causing the problem and why it was going crazy.  Anyhow it’s always good to relive a case with both Token Ring and ISDN for a double case of nostalgia.

[et_pb_section admin_label=”section”][et_pb_row admin_label=”row”][et_pb_column type=”4_4″][et_pb_text admin_label=”Text”]

I’m somewhat recovered from an exhausting week.  I spent last week with a team of 10 others locked up in building 4 at Cisco writing a book using the book sprint methodology.

Several of the TMEs who report to me got together and wrote a book on Software-Defined Access earlier this year.  The PDF version of that book is available here.  Then, just over a month ago, some TMEs (including one member of my team) got together and wrote a book on the Catalyst 9000-series, available here.  Both of these were also produced with the book sprint methodology, and the quality is surprisingly good.

These books are written with the help of the Book Sprint company.  They send a facilitator who guides the team through writing a book from scratch in a week.  There is no preparation beforehand, and almost no work after the week is over.

The week begins with everyone writing their ideas on post-its, and then organizing them into the basic structure of the book.  By the second half of day one, we were assembled into to small teams to outline our sections.  After outlining the section, the sub-teams then break down and individuals start writing the book.

By the end of Tuesday, the book is written, but it doesn’t end there.  On Wednesday the entire book is reviewed by teams different from the ones that wrote it, and then on Thursday it is reviewed again.  Friday the entire book is reviewed by a sub-team to iron out the English and ensure the voice is consistent throughout.  While all this is going on, editors and illustrators are working on the book in the background.

As I mentioned, it’s exhausting.  We worked until midnight on Thursday and 10pm on Friday.  But we got it done and we’ll have some copies printed up for Cisco Live in Orlando in June.

I can’t say I agree with the approach of every part of the book, but that’s the idea.  It’s a team effort.  It’s not my book, nor the book of any other team member.  It’s our book.  I tend to write in a more conversational tone that works for blogs but is not as good for books.  I think that my occasionally excessive wordiness helps to draw the reader along, and gives them space to digest what I’m saying.  So, it was occasionally painful to see my prose hacked apart by other authors.  Still, at the end of the day, the process works and the result was good.

For any readers who might be attending CL Orlando, I’ll be happy to sign a copy for you.  For those who aren’t, when we have the PDF finalized I’ll link it on the blog.

[/et_pb_text][/et_pb_column][/et_pb_row][/et_pb_section]

A common approach for TAC engineers and customers working on a tough case is to just “throw hardware at it.”  Sometimes this can be laziness:  why troubleshoot a complex problem when you can send an RMA, swap out a line card, and hope it works?  Other times it’s a legitimate step in a complex process of elimination.  RMA the card and if the problem still happens, well, you’ve eliminated the card as one source of the problem.

Hence, it was not an uncommon event the day that I got a P1 case from a major service provider, requeued (reassigned) after multiple RMAs.   The customer had a 12000-series GSR, top of the line back then, and was frustrated because ISIS wasn’t working.

“We just upgraded the GRP to a PRP to speed the router up,” he said, “but now it’s taking 4 hours for ISIS to converge.  Why did we pay all this money on a new route processor when it just slowed our box way down?!”

The GSR router is a chassis-type router, with multiple line cards with ports of different types, a fabric interconnecting them, and a management module (route processor, or RP) acting as the brains of the device.  The original RP was called a GRP, but Cisco had released an improved version called the PRP.

The GSR 12000-series

The customer seemed to think the new PRP had performance issues, but this didn’t make sense.  Performance issues might cause some small delays or possibly packet loss for packets destined to the RP, but not delays of four hours.  Something else was amiss.  I asked the customer to send me the ISIS database, and it was full of LSPs like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
0651.8412.7001.00-00  0x00000000   0x0000        193               0/0/0

ISIS routers periodically send CSNPs, or Complete Sequence Number PDUs, which contain a list of all the link state packets (LSPs) in the router database.  In this case, the GSR was directly attached to a Juniper router which was its sole ISIS adjacency.  It was receiving the entire ISIS database from this router.  Normally an ISIS database entry looks like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
bb1-sjc.00-00         0x0000041E   0xF97D        65365             0/0/0

Note that instead of a router ID, we actually have a router name.  Note also that we have a sequence number and a checksum for each LSP.  As the previous output shows, something was wrong with the LSPs we were receiving.  Not only was the name not resolving, the sequence and checksum were zero.  How can we possibly have an LSP which has no sequence number at all?

Even weirder was that as I refreshed the ISIS outputs, the LSPs started resolving, suddenly popping up with names and non-zero sequences and checksums.  I stayed on the phone with the customer for several hours, before finally every LSP was resolved, and the customer had full reachability.  “Don’t do anything to the router until I get back to you,” I said before hanging up.  If only he had listened.

I was about to pack up for the day and I got called by our hotline.  The customer had called in and escalated to a P1 after reloading the router.  The entire link state database was zero’d out again, and the network was down.  He only had a short maintenance window in which to work, and now he had an outage.  It was 6pm.  I knew I wasn’t going home for a while.

Whatever was happening was well beyond my ISIS expertise.  Even in the routing protocols team, it was hard to find deep knowledge of ISIS.  I needed an expert, and Abe Martey, who sat across from me, literally wrote the book on ISIS.  The Cisco Press book, that is.  The only issue:  Abe had decided to take PTO that week.  Of course.  I pinged a protocols escalation engineer, one of our best BGP guys.  He didn’t want anything to do with it.  Finally I reached out to the duty manager and asked for help.  I also emailed our internal mailers for ISIS, but after 6pm I wasn’t too optimistic.

Why were we seeing what appeared to be invalid LSPs?  How could an LSP even have a zero checksum or sequence number?  Why did they seem to clear out, and why so slowly?  Did the upgrade to the PRP have anything to do with it?  Was it hardware?  A bug?  As a TAC engineer, you have to consider every single possibility, from A to Z.

The duty manager finally got Sanjeev, an “ISIS expert” from Australia on the call.  The customer may not realize this while a case is being handled, but if it’s complex and high priority, there is often a flurry of instant messaging going on behind the scenes.  We had a chat room up, and as the “expert” listened to the description of the problem and looked at the notes, he typed in the window:  “This is way over my head.”  Great, so much for expertise.  Our conversation was getting heated with the customer, as his frustration with the lack of progress escalated.  The so-called expert asked him to run a command, which another TAC engineer suggested.

“Fantastic,” said the customer, “Sanjeev wants us to run a command.  Sanjeev, tell us, why do you want to run this command?  What’s it going to do?”

“Uh, I’m not sure,” said Sanjeev, “I’ll have to get back to you on that.”

Not a good answer.

By 8:30 PM we also had a senior routing protocols engineer in the chat window.  He seemed to think it was a hardware issue and was scraping the error counters on the line cards. The dedicated Advanced Services NCE for the account also signed on and was looking at the errors. It’s a painful feeling knowing you and the customer are stranded, but we honestly had no idea what to do.  Because the other end of the problem was a Juniper router, JTAC came on board as well.  We may have been competitors, but we were professionals and put it aside to best help the customer.

Looking at the chat transcript, which I saved, is painful.  One person suggests physically cleaning the fiber connection.  Another thinks it’s memory corruption.  Another believes it is packet corruption.  We schedule a circuit test with the customer to look for transmission errors.

All the while, the 0x0000 LSPs are re-populating with legitimate information, until, by 9pm, the ISIS database was fully converged and routing was working again.  “This time,” I said, “DO NOT touch the router.”  The customer agreed.  I headed home at 9:12pm, secretly hoping they would reload the router so the case would get requeued to night shift and taken off my hands.

In the morning we got on our scheduled update call with the customer.  I was tired, and not happy to make the call.  We had gotten nowhere in the night, and had not gotten helpful responses to our emails.  I wasn’t sure what I was going to say.  I was surprised to hear the customer in a chipper mood.  “I’m happy to report Juniper has reproduced the problem in their lab and has identified the problem.”

There was a little bit of wounded pride knowing they found the fix before we did, but also a sense of relief to know I could close the case.

It turns out that the customer, around the same time they installed the PRP, had attempted to normalize the configs between the Juniper and Cisco devices.  They had mistakenly configured a timer called the “LSP pacing interval” on the Juniper side.  This controls the rate at which the Juniper box sends out LSPs.  They had thought they were configuring the same timer as the LSP refresh interval on the Cisco side, but they were two different things.  By cranking it way up, they ensured that the hundreds of LSPs in the database would trickle in, taking hours to converge.

Why the 0x0000 entries then?  It turns out that in the initial exchange, the ISIS routers share with each other what LSPs they have, without sending the full LSP.  Thus, in Cisco ISIS databases, the 0x0000 entry acts as a placeholder until complete LSP data is received.  Normally this period is short and you don’t see the entry.  We probably would have found the person who knew that eventually, but we didn’t find him that night and our database of cases, newsgroup postings, and bugs turned up nothing to point us in the right direction.

I touched a couple thousand cases in my time at TAC, but this case I remember even 10 years later because of the seeming complexity, the simplicity of the resolution, the weirdness of the symptoms, and the distractors like the PRP upgrade.  Often a major outage sends you in a lot of directions and down many rat holes.  I don’t think we could have done much differently, since the config error was totally invisible to us.  Anyway, if Juniper and Cisco can work together to solve a customer issue, maybe we should have hope for world peace.