Cisco IOS XE Programmability Book

In a previous post I had mentioned I co-authored a book on IOS XE Programmability with some colleagues of mine.  For those who are interested, the book is available here.

The book is not a comprehensive how-to, but a summary of the IOS XE features along with a few samples.  It should provide a good overview of the capabilities of IOS XE.  For those who were on my CCIE webinar, it should be more than adequate to get you up to speed on CCIE written programmability topics.

As with any technical book, there could be some errata, so please feel free to pass them along and I can get them corrected in the next edition.

TAC Tales #15: Loopy

I’ve mentioned in previous TAC Tales that I started on a TAC team dedicated to enterprise, which made sense given my background.  Shortly after I came to Cisco the enterprise team was broken up and its staff distributed among the routing protocols team and LAN switch team.  The RP team at that time consisted of service provider experts with little understanding of LAN switching issues, but deep understanding of technologies like BGP and MPLS.  This was back before the Ethernet-everywhere era, and SP experts had never really spent a lot of time with LAN switches.

This created a big problem with case routing.  Anyone who has worked more than 5 minutes in TAC knows that when you have a routing protocol problem, usually it’s not the protocol itself but some underlying layer 2 issue.  This is particularly the case when adjacencies are resetting.  The call center would see “OSPF adjacencies resetting” and immediately send the case to the protocols team, when in fact the issue was with STP or perhaps a faulty link.  With all enterprise RP issues suddenly coming into the same queue as SP cases, our SP-centric staff were constantly getting into stuff they didn’t understand.

One such case came in to us, priority 1, from a service provider that ran “cell sites”, which are concrete bunkers with radio equipment for cellular transmissions.  “Now wait,” you’re saying, “I thought you just said enterprise RP cases were a problem, but this was a service provider!”  Well, it was a service provider but they ran LAN switches at the cell site, so naturally when OSPF started going haywire it came in to the RP team despite obviously being a switching problem!

A quick look at the logs confirmed this:

Jun 13 01:52:36 LSW38-0 3858130: Jun 13 01:52:32.347 CDT:
%C4K_EBM-4-HOSTFLAPPING: Host 00:AB:DA:EE:0A:FF in vlan 74 is flapping
between port Fa2/37 and port Po1

Here we could see a host MAC address moving between a front-panel port on the switch and a core-facing port channel.  Something’s not right there.  There were tons of messages like these in the logs.

Digging a little further I determined that Spanning Tree was disabled.  Ugh.

Spanning Tree Protocol (STP) is not  popular, and it’s definitely flawed.  With all due respect to the (truly) great Radia Perlman, the inventor of STP, choosing the lowest bridge identifier (usually the MAC address of the switch) as the root, when priorities are set to the default, is a bad idea.  It means that if customers deploy STP with default values, the oldest switch in the network becomes root.  Bad idea, as I said.  However, STP also gets a bad reputation undeservedly.  I cannot tell you how many times there was a layer 2 loop in a customer network, where STP was disabled, and the customer referred to it as a “Spanning Tree loop”.  STP stops layer 2 loops, it does not create them.  And a layer 2 loop out of control is much worse than a 50 second spanning tree outage, which is what you got with the original protocol spec.  When there is no loop in the network, STP doesn’t do anything at all except for send out BPDUs.

As I suspected, the customer had disabled spanning tree due to concerns about the speed of failover.  They had also managed to patch a layer 2 loop into their network during a minor change, causing an unchecked loop to circulate frames out of control, bringing down their entire cell site.

I explained to them the value of STP, and why any outage caused by it would be better than the out of control loop they had.  I was told to mind my own business.  They didn’t want to enable spanning tree because it was slow.  Yes, I said, but only when there is a loop!  And in that case, a short outage is better than a meltdown.  Then I realized the customer and I were in a loop, which I could break by closing the case.

Newer technologies (such as SD-Access) obviate the need for STP, but if you’re doing classic Layer 2, please, use it.

On Being a Dinosaur

An old networking friend whom I mentored for his CCIE a long time ago wrote me an email:  I’ve been a CCIE for 10 years now, he said, and I’m feeling like a dinosaur.  Everyone wants people who know AWS and automation and they don’t want old-school CLI guys.

It takes me back to a moment in my career that has always stuck with me.  I was in my early twenties at my first job as a full-time network engineer.  I was working at the San Francisco Chronicle, at the time (early 2000’s) a large newspaper with a wide circulation.  The company had a large newsroom, a huge advertising call center, three printing plants, and numerous circulation offices across the bay area.  We had IP, IPX, AppleTalk and SNA on the network, typical of the multi-protocol environments of the time.

My colleague Tony and I were up in the MIS area on the second floor of the old Chronicle building on 5th and Mission St. in downtown San Francisco.  The area we were in contained armies of mainframe programmers, looking at the black screens of COBOL code that were the backbone of the newspaper systems in those days.  Most of the programmers were in their fifties, with gray hair and beards.  Tony and I were young, and TCP/IP networking was new to these guys.

I was telling Tony how I always wanted to be technical.  I loved CLI, and it was good at it.  I was working on my first CCIE.  I was at the top of my game, and if any weird problem cropped up on our network I dove in and got it fixed, no matter how hard.  As I explained to Tony, this was all I wanted to do in my career, to be a CLI guy, working with Cisco routers and switches.

Tony gestured at the mainframe programmers, sitting in their cubes typing their COBOL.  “Is this what you want to be when you’re in your fifties,” he said under his breath, “a dinosaur?  Do you just want to be typing obscure code into systems that are probably going to be one step away from being shut down?  How long do you think these guys will have their jobs anyways?”

Well, I haven’t been to the Chronicle in a while but those jobs are almost certainly gone.  Fortunately for the COBOL guys, they’re all retirement age anyways.

We live in a world and an industry that worships the young and the new.  If you’re in your twenties, and totally current on the latest DevOps tools, be warned:  someday you’ll be in your forties and people will think DevOps is for dinosaurs.  The tech industry is under constant pressure to innovate, and innovating usually means getting machines to do things people used to do.  This is why some tech titans are pushing for universal basic income.  They realize that their innovations eliminate jobs at such a rate that people won’t be able to afford to live anymore.  I think it’s a terrible idea, but that’s a subject for another post.  The point is, in this industry, when you think you’ve mastered something and are relevant, be ready:  your obsolescence commeth.

This is an inversion of the natural respect for age and experience we’ve had throughout human history.  I don’t say this as a 40-something feeling some bitterness for the changes to his industry;  in fact, I actually had this thought when I was much younger.  In the West, at least,  in the 1960’s there developed a sense that, to paraphrase Hunter Thompson, old is evil.  This was of course born from legitimately bad things that were perpetuated by previous generations, but it’s interesting to see how the attitude has taken hold in every aspect of our culture.  If you look at medieval guilds, the idea was that the young spent years going through apprentice and journeyman stages before being considered a master of their craft.  This system is still in place in many trades that do not experience innovation at the rate of our industry, and there is a lot to be said for it.  The older members of the trade get security and the younger get experience.

I’ve written a bit about the relevance of the CCIE, and of networking skills in general, in the new age.  Are we becoming the COBOL programmers of the early 2000’s?  Is investing in networking skills about the same as studying mainframe programming back then, a waste of cycles on dying systems?

I’ve made the point many times on this blog that I don’t think that’s (yet) the case.  At the end of the day, we still need to move packets around, and we’re still doing it in much the same way as we did in 1995.  Most of the protocols are the same, and even the newer ones like VXLAN are not that different from the old ones.  Silicon improves, speeds increase, but fundamentally we’re still doing the same thing.  What changing is how we’re managing those systems, and as I say in my presentations, that’s not a bad thing.  Using Notepad to copy/paste across a large number of devices is not a good use of network engineers’ time.  Automating can indeed help us to do things better and focus on what matters.

I’ve often used the example of airline pilots.  A modern airplane cockpit looks totally different from a cockpit in the 1980’s or even 1990’s.  The old dials and switches have been replaced by LCD panels and much greater automation.  And yet we still have pilots, and the pilot today still needs to understand engine systems, weather, aerodynamics, and navigation.  What’s changed is how that pilot interacts with the machine.  As a pilot myself, I can tell you how much better a glass cockpit is than the old dials.  I get better information presented in a much more useful way and don’t have to waste my time on unnecessary tasks.  This is how network automation should work.

When I raised this point to some customer execs at a recent briefing, one of them said that the pilots could be eliminated since automation is so good now.  I’m skeptical we will ever reach that level of automation, despite the futurists’ love of making such predictions.  The pilots aren’t there for the 99% of the time when things work as expected, but for the 1% when they don’t, and it will be a long time, if ever, before AI can make judgement calls like a human can.  And in order to make those 1% of calls, the pilots need to be flying the 99% of the time when it’s routine, so they know what to do.

So, are we dinosaurs?  Are we the COBOL programmers of the late 2010’s, ready to be eliminated in the next wave of layoffs?  I don’t think so, but we have to adapt.  We need to learn the glass cockpit.  We need to stay on top of developments, and learn how those developments help us to manage the systems we know well how to manage.  Mainframes and operating systems will come and go, but interconnecting those systems will still be relevant for a long time.

Meanwhile, an SVP at Cisco told me he saw someone with a ballcap at Cisco Live:  “Make CLI Great Again”.  Gotta love that.  Some dinosaurs don’t want to go extinct.

Cisco Live US 2018

Cisco Live Orlando has wrapped up, at least for me, and I can relax until Cisco Live Europe in January.  I never realized how much work goes into Cisco Live until I became a TME.  Building labs, working on slides, preparing demos, and arranging customer meetings is a months-long process and always a scramble at the end.  It’s a great show, and I can say that having attended as a customer.  It’s more fun and less work to be an attendee, but for technical marketing engineers, it’s still a blast and the highlight of the year.

Orlando had a special significance for me because it was at CL Orlando in 2007 that I decided I really wanted to be a TME.  I attended several breakouts and thought that I’d love to be up in front of the room, teaching folks how about technology.  The only problem:  I was terrified of public speaking.

It took years of trainings, including many as a Toastmaster, before I became comfortable in front of an audience.  That’s a story for another time.  It also took years before the right job opened up, and there were a couple near moves into technical marketing that didn’t work out.  I have to say, I’m glad I have this job and love (almost) every minute of it.

Still, getting up in front of a bunch of your (rather smart) peer network engineers and claiming some sort of expertise is nerve-wracking.  Wanting to do well in front of an audience can lead to frustration.  My main breakout session, BRKCRS-2451, Scripting Catalyst Switches, won me two distinguished speaker awards in a row.  This year, however, the scores are looking quite a bit lower.

It didn’t help that the start time was 8am.  I’m not a morning person, and 8am in Orlando was 5am for me.  The old neurons just weren’t firing for the first 30-45 minutes of the presentation, and in front of 400 people that just isn’t good.

A dose of humility is a good thing, though.  I know TMEs who would kill for my “disappointing” score, so it wasn’t that bad.  And the comments were quite helpful, in fact, and make clear what people are looking for and where they didn’t think I delivered.

I structured BRKCRS-2451 as a journey through developing a script on IOS XE.  The session begins with a demo of a fairly simple script, which pulls some data down from a switch and then formats it and sends it to a Webex Teams (formerly Spark) room.  Then, I break down the script starting with installing Python, and some of the tools needed, like Git and Virtual Environments.  Then I move on to YANG/NETCONF, talk about REST, and then wrap it up by showing how it all fits together to build the script I demoed.

It was a winning formula for a while, but I’m suspecting network engineers have up-leveled their programmability skills in the last year or so.  When I used to explain what GitHub was, network engineers usually were relieved to have it explained to them.  Now I think they all know.

I have a few ideas for making the session more relevant.  Still, it was a great experience talking to 400 people, meeting customers around the show floor and halls, and visiting some of my colleagues’ sessions.  Hopefully my attendees got something out of the session, and I look forward to the next Cisco Live.

Two Years of Ten Years a CCIE

Two years ago I published my Ten Years a CCIE series.  Actually, I had written the series a couple years before I published it, but as I say in my introduction to the series, I felt it was a bit self-indulgent an uninteresting, so I scrapped it for a while.  The original pieces were dictated, and I’ve been meaning to go back and clean up some of the grammatical errors or grating phrases, but haven’t had the time.  Not a lot of people have read it, nor did I expect many to read it, since I generally don’t advertise the blog in social media, or anywhere really.  But the feedback from the few who have read it has been positive, and I’m gratified for that.

Things have changed a lot since I got into networking in 1995, and since I passed my CCIE in 2004.  But it’s also amazing how much has stayed the same.  TCP/IP, and in fact IPv4, is still the heart of the network.  Knowledge of OSPF and BGP is still key.  For the most part, new controllers and programmable interfaces represent a different way of managing fundamentally the same thing.

The obvious reasons for this are that networks work and are hard to change.  The old protocols have been sufficient for passing data from point A to point B for a long time.  They’re not perfect but they are more than adequate.  They are hard to change because networks are heterogeneous.  There are so many types of different systems connecting to them, that if we wanted to fundamentally alter the building blocks of networks, we’d have to upgrade a lot of systems.  This is why IPv6 adoption is so slow.

Occasionally I poke around at TechExams.net to see what newer network engineers are thinking, and where they are struggling.  I’m probably the only director-level employee of Cisco who reads or comments on that message board.  I started reading it back when I was still at Juniper and studying for my JNCIE, but I’ve continued to read it because I like the insights I get from folks prepping for their certifications.  People are occasionally concerned that the new world of controllers and automation will make their jobs obsolete.

I built the first part of my career on CLI.  Now I’m building it on controllers and programmability.  In this industry, we have to adapt, but we don’t have to die.  Cars have changed drastically, with on-board computer systems and so forth, but we still need mechanics.  We still need good network engineers.

To be honest, I was getting tired of my career by the time I left Juniper and came to Cisco.  I was bored.  I thought of going back to school and getting a Ph.D. in classical languages, my other passion.  Getting married helped put an end to that idea (Ph.D.’s in ancient Greek make a lot less than network engineers) but when I came back to Cisco, I felt revitalized.  I started learning new things.  Networking was becoming fun again.

I wrote the “Ten Years a CCIE” series both for people who had passed the exam and wanted to have some fun remembering the experience, as well as for people struggling to pass it.  Some things change, as I said, but a lot remains the same.  I still think, closing in on 15 years since I took the exam, that it’s still worth it.  I still think it’s a fantastic way to launch a career.  The exam curriculum will adapt, as it always does, with new technologies, but it’s an amazing learning experience if you do it honestly, and you will be needed when you make it through.

TAC Tales #14: Stuck in Active

Everyone who’s worked in TAC can tell you their nightmare case–the type of case that, when they see it in the queue, makes them want to run away, take an unexpected lunch break, and hope some other engineer grabs it.  The nightmare case is the case you know you’ll get stuck on for hours, on a conference bridge, escalating to other engineers, trying to find a solution to an impossible problem.  For some it’s unexplained packet loss.  For others, it’s multicast.  For me, it was EIGRP Stuck-in-Active (SIA).

Some customer support engineers (CSEs) thought SIA cases were easy.  Not me.  A number of times I had a network in total meltdown due to SIA with no clue as to where the problem was.  Often the solution required a significant redesign of the network.

As a review, EIGRP is more-or-less a distance-vector routing protocol, which uses an algorithm called DUAL to achieve better performance than a traditional DV protocol like RIP.  I don’t want to get into all the fun CCIE questions on the protocol details, but what matters for this article is how querying works.  When an EIGRP neighbor loses a route, it sets the route as “Active” and then queries its neighbors as to where the route went.  Then, if the neighbors don’t have it, they set it active and query their neighbors.  If those neighbors don’t have the route active, they of course mark it active and query their neighbors.  And so forth.

It should be obvious from this process that in a large network, the queries can multiply quite quickly.  If a router has a lot of neighbors, and its neighbors have a lot of neighbors, the queries multiply exponentially, and can get out of control.  Meanwhile, when a router sets a route active, it sets a timer.  If it doesn’t get a reply before the timer expires, then the router marks the route “Stuck In Active”, and resets the entire EIGRP adjacency.  In a large network with a lot of neighbors, even if the route is present, the time lag between sending a query and getting a response can be so long that the route gets reset before the response makes it to the original querying router.

I’ve ironed out some of the details here, since obviously an EIGRP router can lose a route entirely without going SIA.  For details, see this article.  The main point to remember is that the SIA route happens when the querying route just doesn’t get a response back.

Back in my TAC days, I of course wasn’t happy to see an SIA drop in the queue.  I waited to see if one of my colleagues would take the case and alleviate the burden, but the case turned blue after 20 minutes, meaning someone had to take it.  Darn.

Now I can show my age, because the customer had adjacencies resetting on Token Ring interfaces.  I asked the customer for a topology diagram, some debugs, and to check whether there was packet loss across the network.  Sometimes, if packets are getting dropped, the query responses don’t make it back to the original router, causing SIA.  The logs from the resets looked like this:

rtr1 - 172.16.109.118 - TokenRing1/0
Sep 1 16:58:06: %DUAL-3-SIA: Route 172.16.161.58/32 stuck-in-active state in IP-EIGRP(0) 55555. Cleaning up
Sep 1 16:58:06: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is down: stuck in active
Sep 1 16:58:07: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is up: new adjacency

This is typical of SIA.  The adjacency flapped, but the logs showed no particular reason why.

I thought back to my first troubleshooting experience as a network engineer.  I had brought up a new branch office but it couldn’t talk back to HQ.  Mike, my friend and mentor, showed up and started pinging hop-by-hop until he found a missing route.  “That’s how I learned it,” he said, “just go one hop a time.”  The big clue I had in the SIA case was the missing route:  172.16.161.58/32.  I started tracing it back, hop-by-hop.

I found that the route originated from a router on the edge of the customer network, which had an ISDN PRI connected.  (Showing my age again!)  They had a number of smaller offices that would dial into the ISDN on-demand, and then drop off.  ISDN had per-minute charges and thus, in this pre-VPN era, it was common to setup ISDN in on-demand mode.  ISDN was a digital dial-up technology with very short call setup times.  I discovered that, as these calls were going up and down, the router was generating /32 peer routes for the neighbors and injecting them into EIGRP.  They had a poorly designed network with a huge query domain size, and so as these dial peers were going up and down, routers on the opposite side of the network were going into active on the route and not getting responses back.

They were advertising a /16 for the entire 172.16.x.x network, so sending a /32 per dial peer was totally unnecessary.  I recommended they enable “no peer neighbor-route” on the PRI to suppress the /32’s and the SIAs went away.

I hate to bite the hand that feeds me, but even though I work at Cisco I can say I really never liked EIGRP.  EIGRP is fast, and if the network is designed well, it works fine.  However, networks often grow organically, and the larger the domain, the more unstable EIGRP becomes.  I’ve never seen this sort of problem with OSPF or ISIS.  Fortunately, this case ended up being much less problematic than I expected, but often these cases were far nastier.  Oftentimes it was nearly impossible to find the route causing the problem and why it was going crazy.  Anyhow it’s always good to relive a case with both Token Ring and ISDN for a double case of nostalgia.

Book Sprint

[et_pb_section admin_label=”section”][et_pb_row admin_label=”row”][et_pb_column type=”4_4″][et_pb_text admin_label=”Text”]

I’m somewhat recovered from an exhausting week.  I spent last week with a team of 10 others locked up in building 4 at Cisco writing a book using the book sprint methodology.

Several of the TMEs who report to me got together and wrote a book on Software-Defined Access earlier this year.  The PDF version of that book is available here.  Then, just over a month ago, some TMEs (including one member of my team) got together and wrote a book on the Catalyst 9000-series, available here.  Both of these were also produced with the book sprint methodology, and the quality is surprisingly good.

These books are written with the help of the Book Sprint company.  They send a facilitator who guides the team through writing a book from scratch in a week.  There is no preparation beforehand, and almost no work after the week is over.

The week begins with everyone writing their ideas on post-its, and then organizing them into the basic structure of the book.  By the second half of day one, we were assembled into to small teams to outline our sections.  After outlining the section, the sub-teams then break down and individuals start writing the book.

By the end of Tuesday, the book is written, but it doesn’t end there.  On Wednesday the entire book is reviewed by teams different from the ones that wrote it, and then on Thursday it is reviewed again.  Friday the entire book is reviewed by a sub-team to iron out the English and ensure the voice is consistent throughout.  While all this is going on, editors and illustrators are working on the book in the background.

As I mentioned, it’s exhausting.  We worked until midnight on Thursday and 10pm on Friday.  But we got it done and we’ll have some copies printed up for Cisco Live in Orlando in June.

I can’t say I agree with the approach of every part of the book, but that’s the idea.  It’s a team effort.  It’s not my book, nor the book of any other team member.  It’s our book.  I tend to write in a more conversational tone that works for blogs but is not as good for books.  I think that my occasionally excessive wordiness helps to draw the reader along, and gives them space to digest what I’m saying.  So, it was occasionally painful to see my prose hacked apart by other authors.  Still, at the end of the day, the process works and the result was good.

For any readers who might be attending CL Orlando, I’ll be happy to sign a copy for you.  For those who aren’t, when we have the PDF finalized I’ll link it on the blog.

[/et_pb_text][/et_pb_column][/et_pb_row][/et_pb_section]

TAC Tales #13: All Zeros

A common approach for TAC engineers and customers working on a tough case is to just “throw hardware at it.”  Sometimes this can be laziness:  why troubleshoot a complex problem when you can send an RMA, swap out a line card, and hope it works?  Other times it’s a legitimate step in a complex process of elimination.  RMA the card and if the problem still happens, well, you’ve eliminated the card as one source of the problem.

Hence, it was not an uncommon event the day that I got a P1 case from a major service provider, requeued (reassigned) after multiple RMAs.   The customer had a 12000-series GSR, top of the line back then, and was frustrated because ISIS wasn’t working.

“We just upgraded the GRP to a PRP to speed the router up,” he said, “but now it’s taking 4 hours for ISIS to converge.  Why did we pay all this money on a new route processor when it just slowed our box way down?!”

The GSR router is a chassis-type router, with multiple line cards with ports of different types, a fabric interconnecting them, and a management module (route processor, or RP) acting as the brains of the device.  The original RP was called a GRP, but Cisco had released an improved version called the PRP.

The GSR 12000-series

The customer seemed to think the new PRP had performance issues, but this didn’t make sense.  Performance issues might cause some small delays or possibly packet loss for packets destined to the RP, but not delays of four hours.  Something else was amiss.  I asked the customer to send me the ISIS database, and it was full of LSPs like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
0651.8412.7001.00-00  0x00000000   0x0000        193               0/0/0

ISIS routers periodically send CSNPs, or Complete Sequence Number PDUs, which contain a list of all the link state packets (LSPs) in the router database.  In this case, the GSR was directly attached to a Juniper router which was its sole ISIS adjacency.  It was receiving the entire ISIS database from this router.  Normally an ISIS database entry looks like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
bb1-sjc.00-00         0x0000041E   0xF97D        65365             0/0/0

Note that instead of a router ID, we actually have a router name.  Note also that we have a sequence number and a checksum for each LSP.  As the previous output shows, something was wrong with the LSPs we were receiving.  Not only was the name not resolving, the sequence and checksum were zero.  How can we possibly have an LSP which has no sequence number at all?

Even weirder was that as I refreshed the ISIS outputs, the LSPs started resolving, suddenly popping up with names and non-zero sequences and checksums.  I stayed on the phone with the customer for several hours, before finally every LSP was resolved, and the customer had full reachability.  “Don’t do anything to the router until I get back to you,” I said before hanging up.  If only he had listened.

I was about to pack up for the day and I got called by our hotline.  The customer had called in and escalated to a P1 after reloading the router.  The entire link state database was zero’d out again, and the network was down.  He only had a short maintenance window in which to work, and now he had an outage.  It was 6pm.  I knew I wasn’t going home for a while.

Whatever was happening was well beyond my ISIS expertise.  Even in the routing protocols team, it was hard to find deep knowledge of ISIS.  I needed an expert, and Abe Martey, who sat across from me, literally wrote the book on ISIS.  The Cisco Press book, that is.  The only issue:  Abe had decided to take PTO that week.  Of course.  I pinged a protocols escalation engineer, one of our best BGP guys.  He didn’t want anything to do with it.  Finally I reached out to the duty manager and asked for help.  I also emailed our internal mailers for ISIS, but after 6pm I wasn’t too optimistic.

Why were we seeing what appeared to be invalid LSPs?  How could an LSP even have a zero checksum or sequence number?  Why did they seem to clear out, and why so slowly?  Did the upgrade to the PRP have anything to do with it?  Was it hardware?  A bug?  As a TAC engineer, you have to consider every single possibility, from A to Z.

The duty manager finally got Sanjeev, an “ISIS expert” from Australia on the call.  The customer may not realize this while a case is being handled, but if it’s complex and high priority, there is often a flurry of instant messaging going on behind the scenes.  We had a chat room up, and as the “expert” listened to the description of the problem and looked at the notes, he typed in the window:  “This is way over my head.”  Great, so much for expertise.  Our conversation was getting heated with the customer, as his frustration with the lack of progress escalated.  The so-called expert asked him to run a command, which another TAC engineer suggested.

“Fantastic,” said the customer, “Sanjeev wants us to run a command.  Sanjeev, tell us, why do you want to run this command?  What’s it going to do?”

“Uh, I’m not sure,” said Sanjeev, “I’ll have to get back to you on that.”

Not a good answer.

By 8:30 PM we also had a senior routing protocols engineer in the chat window.  He seemed to think it was a hardware issue and was scraping the error counters on the line cards. The dedicated Advanced Services NCE for the account also signed on and was looking at the errors. It’s a painful feeling knowing you and the customer are stranded, but we honestly had no idea what to do.  Because the other end of the problem was a Juniper router, JTAC came on board as well.  We may have been competitors, but we were professionals and put it aside to best help the customer.

Looking at the chat transcript, which I saved, is painful.  One person suggests physically cleaning the fiber connection.  Another thinks it’s memory corruption.  Another believes it is packet corruption.  We schedule a circuit test with the customer to look for transmission errors.

All the while, the 0x0000 LSPs are re-populating with legitimate information, until, by 9pm, the ISIS database was fully converged and routing was working again.  “This time,” I said, “DO NOT touch the router.”  The customer agreed.  I headed home at 9:12pm, secretly hoping they would reload the router so the case would get requeued to night shift and taken off my hands.

In the morning we got on our scheduled update call with the customer.  I was tired, and not happy to make the call.  We had gotten nowhere in the night, and had not gotten helpful responses to our emails.  I wasn’t sure what I was going to say.  I was surprised to hear the customer in a chipper mood.  “I’m happy to report Juniper has reproduced the problem in their lab and has identified the problem.”

There was a little bit of wounded pride knowing they found the fix before we did, but also a sense of relief to know I could close the case.

It turns out that the customer, around the same time they installed the PRP, had attempted to normalize the configs between the Juniper and Cisco devices.  They had mistakenly configured a timer called the “LSP pacing interval” on the Juniper side.  This controls the rate at which the Juniper box sends out LSPs.  They had thought they were configuring the same timer as the LSP refresh interval on the Cisco side, but they were two different things.  By cranking it way up, they ensured that the hundreds of LSPs in the database would trickle in, taking hours to converge.

Why the 0x0000 entries then?  It turns out that in the initial exchange, the ISIS routers share with each other what LSPs they have, without sending the full LSP.  Thus, in Cisco ISIS databases, the 0x0000 entry acts as a placeholder until complete LSP data is received.  Normally this period is short and you don’t see the entry.  We probably would have found the person who knew that eventually, but we didn’t find him that night and our database of cases, newsgroup postings, and bugs turned up nothing to point us in the right direction.

I touched a couple thousand cases in my time at TAC, but this case I remember even 10 years later because of the seeming complexity, the simplicity of the resolution, the weirdness of the symptoms, and the distractors like the PRP upgrade.  Often a major outage sends you in a lot of directions and down many rat holes.  I don’t think we could have done much differently, since the config error was totally invisible to us.  Anyway, if Juniper and Cisco can work together to solve a customer issue, maybe we should have hope for world peace.

Where I’ve been, and what a TME is

Jesse, a recent commentor, asked why I haven’t been posting much lately.  In fact, my last post was August of 2017.  Well, there are several reasons I don’t post much these days.  In part, I’m not convinced anyone is reading.  It’s nice to see a comment now and again to realize it’s not just spambots looking at SZ.

The other major reason was a job change.  I moved to Cisco over two years ago, and I came in as an individual contributor (IC).  I liked to joke that I had never been so busy since…the last time I worked at Cisco.  However, as an IC, I had no idea how easy I had it.

Someone got the crazy idea to make me a manager.  So now, not only do I have the Principal Technical Marketing Engineer title, I also manage a team of 10 TMEs.  The team happens to be driving Software-Defined Access, currently Cisco’s flagship product.  So, the time for blogging is a bit limited.  I’m still working on programmability in my spare time, and I’m continuing to do Cisco Live sessions at least twice a year.  My hair is turning white and I don’t think it’s just my age.

That said, I cannot image a better job or place to be than this job at this time.  It’s an exciting company to work for, and an exciting time.  The team that reports to me includes some of the smartest and hardest working TMEs in Cisco.  These guys are legendary.  (For me “guys” is gender-neutral, for those of you who worry about such things.)  And my boss is considered by many to be one of the best who ever did the TME job.

A quick primer on exactly what a TME does, for those who don’t know:  We are (usually) attached to a business unit within Cisco, and we are really an interface between sales and engineering.  We also work directly with customers, but generally when sales pulls us in.  TMEs are technical (the “T” in “TME”) so they are expected to know their product/technology in detail.  They are, however marketers (the “M” in “TME”) so they need to be able to explain what their product does.

On the inbound side, we learn the requirements for products from sales and customers and communicate those requirements to engineering.  We work closely with Product Managers (PMs) to develop Product Requirement Documents (PRDs) and meet regularly with engineering to ensure that they are building their products in a way that satisfies marketing requirements.

On the outbound side, we develop collateral, which could be white papers, videos, slide decks, etc.  (We do not write the documentation you see on CCO, but we certainly review it.)  We present to our sales team in twice-a-year events, explaining the latest developments in our products and collecting their feedback on what we could do better.  We travel on site to meet with customers in support of sales, or else meet the customers here, at the Executive Briefing Center (EBC).  The most enjoyable part of the year, for most of us, is Cisco Live, our major trade show.

We have four CL events each year:  Europe, South America, Australia, and US.  The US event is the largest of all.  I generally attend both Europe (we were in Barcelona this year) and the US event (Orlando this time).  These events are a blast, but I never realized how much hard work goes into planning the event and developing the content.  It’s also stressful.  I’ve been fortunate to win distinguished speaker two events in a row, which means I was rated in the top 10%.  However, standing in front of an audience of hundreds is always a bit nerve-wracking, and getting ready requires a ton of preparation.  Still, it’s a great opportunity to meet with customers and have a good time.

The pace of work for TMEs is relentless.  I used to say TAC was relentless, because the second you close a case, you take another.  Well, with two SEVTs (sales events) and four Cisco Lives to prepare for, plus a constant and never-ending series of product/software releases…well, it never stops here either.

So that’s why the blogs have fallen away.  I do think I can find 10-15 minutes to post updates at least every week, so I’m going to try to do it.  I wouldn’t mind actually writing the series on programmability I started.  I’d like to clean up and revise the 10 years a CCIE series.  I also have another TAC tale to write up, one of my all-time favorites, so look for that soon.

And to Jesse:  thanks for getting me going!

TAC Tales #12: SACK of trouble

When I first started at Cisco TAC, I was assigned to a team that handled only enterprise customers.  One of the first things my boss said to me when I started there was “At Cisco, if you don’t like your boss or your cubicle, wait three months.”  Three months later, they broke the team up and I had a new boss and a new cubicle.  My new team handled routing protocols for both enterprise and service provider customers, and I had a steep learning curve having just barely settled down in the first job.

A P1 case came into my queue for a huge cable provider.  Often P1’s are easy, requiring just an RMA, but this one was a mess.  It was a coast-to-coast BGP meltdown for one of the largest service provider networks in the country.  Ugh.  I was on the queue at the wrong time and took the wrong case.

The cable company was seeing BGP adjacencies reset across their entire network.  The errors looked like this:

Jun 16 13:48:00.313 EST: %BGP-5-ADJCHANGE: neighbor 172.17.249.17 Down BGP
Notification sent

Jun 16 13:48:00.313 EST: %BGP-3-NOTIFICATION: sent to neighbor 172.17.249.17
3/1 (update malformed) 8 bytes 41A41FFF FFFFFFFF

The cause seemed to be malformed BGP packets, but why?  The GSR routers they had were kind enough to give us a hex dump of the BGP packet when an adjacency reset.  I got out my trusty Doyle book and began decoding the packets on paper, when a colleague was kind enough to point me to an internal Cisco tool that would decode a BGP packet from hex.

We could see that, for some reason, the NLRI portion of the BGP message was getting cut off.  According to my calculations, it should have been 44 bytes, but we were only seeing 32 bytes of information.  NLRI is Network Layer Reachability Information, just a fancy BGP way of saying the paths that go into the routing update.  We also noticed a clue in the router logs:  TCP-6-TOOBIG messages showing up from time to time.

Going over it with engineering, we realized something interesting.  The customer had enabled TCP selective acknowledgement on all their routers.  Also known as SACK, TCP selective acknowledgement is designed to circumvent an inefficiency in TCP.  If, say, 1 of 3 TCP segments gets dropped, the TCP protocol requires re-transmission of all 3 of the segments.  In other words, the receiver keeps ACKing the last segment it received, but it takes time for the sender to realize something is wrong.  When the sender finally realizes something is wrong, it goes back to the last known good segment and re-transmits everything after it.  SACK allows TCP to acknowledge and re-transmit specific segments.  If we are only missing segments 2, 3, and 5, then we can ask for just those to be re-transmitted.  SACK is stored as an option in the TCP header.

The problem is, there is a finite amount of space in the TCP header, and the SACK field can get rather long.  It just so happens that BGP also stores its MD5 authentication hash in the TCP header.  If SACK gets too long, it can crowd the MD5 header and cause BGP errors.  Based on our analysis, this was exactly what had happened.  Thus, the malformed packets.  We had the customer remove the SACK option from all routers and the problem stopped.

We were left with a couple questions.  Why did SACK get so long, and why would it be allowed to overwrite other important values in the TCP header?  In answer to the first question, there was a bug which was causing some linecards to send out malformed packets on occasion, thus causing SACKs.  In answer to the second question, there was a bug in the TCP header options packing that allowed one field (SACK) to crowd out another field (MD5 authentication).  I knew the case wouldn’t close for a long time.  Multiple bugs needed to be filed, and new code qualified and installed.  Fortunately the customer had a workaround (disable SACK) and an HTE.  An HTE was a TAC engineer dedicated to their account.  He grabbed the case from me for babysitting and I moved onto my next case.

In my TAC tales I often make fun of the occasional mistakes of TAC engineers.  However, TAC is a tough job, and the organization is staffed by some top engineers.  Many cases, like this one, required hard core engineering and knowledge that spans protocol details and ASIC-level hardware debugging.  It’s not a job for the faint of heart.  This case required digging into the TCP header, understanding how options are packed, and figuring out how to stop a major meltdown of a service provider network.  A high-stress situation, to be sure, but these cases often were the most rewarding.