Skip navigation

“Progress might have been alright once, but it has gone on too long.”
–  Ogden Nash

The book The Innovator’s Dilemma appears on the desk of a lot of Silicon Valley executives.  Its author, Clayton Christiensen, is famous for having coined the term “disruptive innovation.”  The term has always bothered me, and I keep waiting for the word “disruption” to die a quiet death.  I have the disadvantage of having studied Latin quite a bit.  The word “disrupt” comes from the Latin verb rumperewhich means to “break up”, “tear”, “rend”, “break into pieces.”  The word, as does our English derivative, connotes something quite bad.  If you think “disruption” is good, what would you think if I disrupted a presentation you were giving?  What if I disrupted the electrical system of your heart?

Side note:  I’m fascinated with the tendency of modern English to use “bad” words to connote something good.  In the 1980’s the word “bad” actually came to mean its opposite.  “Wow, that dude is really bad!” meant he was good.  Cool people use the word “sick” in this way.  “That’s a sick chopper” does not mean the motorcycle is broken.

The point, then, of disruption is to break up something that already exists, and this is what lies beneath the b-school usage of it.  If you innovate, in a disruptive way, then you are destroying something that came before you–an industry, a way of working, a technology.  We instantly assume this is a good thing, but what if it’s not?  Beneath any industry, way of working, or technology are people, and disruption is disruption of them, personally.

The word “innovate” also has a Latin root.  It comes from the word novus, which means “new”.  In industry in general, but particularly the tech industry, we positively worship the “new”.  We are constantly told we have to always be innovating.  The second one technology is invented and gets established, we need to replace it.  Frame Relay gave way to MPLS, MPLS is giving way to SD-WAN, and now we’re told SD-WAN has to give way…  The life of a technology professional, trying to understand all of this, is like a man trying to walk on quicksand.  How do you progress when you cannot get a firm footing?

We seem to have forgotten that a journey is worthless unless you set out on it with an end in mind.  One cannot simply worship the “new” because it is new–this is self-referential pointlessness.  There has to be a goal, or an end–a purpose, beyond simply just cooking up new things every couple years.

Most tech people and b-school people have little philosophical education outside of, perhaps (and unfortunately) Atlas Shrugged.  Thus, some of them, realizing the pointlessness of endless innovation cycles, have cooked up ludicrous ideas about the purpose of it all.  Now we have transhumanists telling us we’ll merge our brains with computers and evolve into some sort of new God-species, without apparently realizing how ridiculous they sound.  COVID-19 should disabuse us of any notion that we’re not actually human beings, constrained by human limitations.

On a practical level, the furious pace of innovation, or at least what is passed off as such, has made the careers of technology people challenging.  Lawyers and accountants can master their profession and then worry only about incremental changes.  New laws are passed every year, but fundamentally the practice of their profession remains the same.  For us, however, we seem to face radical disruption every couple of years.  Suddenly, our knowledge is out-of-date.  Technologies and techniques we understood well are yesterday’s news, and we have to re-invent ourselves yet again.

The innovation imperative is driven by several factors:  Wall Street constantly pushes public companies to “grow”, thus disparaging companies that simply figure out how to do something and do it well.  Companies are pressured into expanding to new industries, or into expanding their share of existing industries, and hence need to come up with ways to differentiate themselves.  On an individual level, many technologists are enamored of innovation, and constantly seek to invent things for personal satisfaction or for professional gain.  Wall Street seems to have forgotten the natural law of growth.  Name one thing in nature that can grow forever.  Trees, animals, stars…nothing can keep growing indefinitely.  Why should a company be any different?  Will Amazon simply take over every industry and then take over governing the planet?  Then what?

This may seem a strange article coming from a leader of a team in a tech company that is handling bleeding edge technologies.  And indeed it would seem to be a heresy for someone like me to say these things.  But I’m not calling for an end to inventing new products or technologies.  Having banged out CLI for thousands of hours, I can tell you that automating our networks is a good thing.  Overlays do make sense in that they can abstract complexity out of networks.  TrustSec/Scalable Group Tags are quite helpful, and something like this should have been in IP from the beginning.

What I am saying is that innovation needs a purpose other than just…innovation.  Executives need to stop waxing eloquent about “disrupting” this or that, or our future of fusing our brains with an AI Borg.  Wall Street needs to stop promoting growth at all costs.  And engineers need time to absorb and learn new things, so that they can be true professionals and not spend their time chasing ephemera.

Am I optimistic?  Well, it’s not in my nature, I’m afraid.  As I write this we are in the midst of the Coronavirus crisis.  I don’t know what the world will look like a year from now.  Business as usual, with COVID a forgotten memory?  Perhaps.  Great Depression due to economic shutdown?  Perhaps.  Total societal, governmental, and economic collapse, with rioting in the streets?  I hope not, but perhaps.  Whatever happens, I do hope we remember that word “novel”, as in “novel Coronavirus”, comes from the same Latin root as the word “innovation”.  New isn’t always the best.

In my last post, I discussed the BBS and how it worked.  (It would be helpful to review, to understand the terminology.)  In this post, I have resurrected, in part, the BBS I used to run from 1988-1990.  It was called “The Tower”, for no particularly good reason except that it sounded cool to my teenage mind.

Now, bringing this back to life was no simple task, but was aided by some foresight I had 20 years ago.  I had a Mac with a disk drive, and realizing the floppy era was coming to a close, I decided to produce disk images of all the 3.5 inch floppies I had saved from my Apple II days.  Fortunately, my last Apple II, the IIGS, used 3.5″ drives instead of the 5.25″ that were more common on the Apple IIs.  The Macs that had floppy drives all had 3.5″ drives.  Additionally, Apple had included software to on the pre OSX MacOS to read ProDOS (Apple II) disks.  Thus, in the year 2000, I could mount an Apple II floppy from a dozen years prior and make an image out of it.

I did not have a full working version of my GBBS, however, so I had to download a copy.  I also had to do a lot of work to bring it up to Macos (not MacOS, but Macos, Modified ACOS), which was a modified form of the GBBS compiler I used at the time.  All of my source files required Macos and not the stock GBBS software.  Believe me, even though I ran the BBS for a couple years and wrote a lot of the code, remembering how to do any of this after 30 years was non-trivial.

Rather than hook up my old IIGS, which I still have, it made a lot more sense to use an emulator.  (It also enabled me to take screen shots.)  I used an emulator called Sweet16, which is a bit bare bones but does the trick.  In case you’re not familiar with the Apple II series, the early models were primarily text-driven.  They had graphics, of course, but they were not GUI machines.  After the Mac came out, there was a push to incorporate a GUI into the Apple II and the result was the Apple IIGS (Graphics and Sound).  While it had a GUI-driven OS (ProDOS 16 at first, replaced by GS/OS), it was backwards compatible with the old Apple II software.  The GBBS software I ran was classic Apple II, and thus it was a bit of a waste to run it on an Apple IIGS, but, well, that’s what I did.

In this screen shot (Figure 1), you can see the Apple IIGS finder from which I’m launching the BBS software, the only GUI shot you’ll see in the article:

Figure 1: The Apple IIGS ProDOS Finder

The next shot (Figure 2) shows the screen only visible to the sysop, while waiting for a call.  As sysop, I had the option to hit a key and log in myself, but if a user dialed in the system would beep and the user would begin the log in process.  I’m not sure why we’re awaiting call 2 which will be call 1 today, but it looks like a bug I need to hunt down.  The screen helpfully tells me if new users have signed up for the BBS, and whether I have mail.

Figure 2: The landing page while waiting for a call

(If you want to know why I used the silly handle “Mad MAn”, please see the previous article.)

The next screen shows the BBS right after logon.  The inverse text block at the top was a local sysop-only view, showing user information including the user name and phone number, as well as the user’s flags.  These are interesting.  Some BBS software provided access levels for controlling what a user could and could not do.  Instead of sequential access levels, GBBS provided a series of binary flags the sysop could set.  Thus, I could give access to one area but not another, whereas the sequential access levels mean that each access level inherits the privileges of the previous level.  Very clever.  A few other stats are displayed that I won’t go into.  I’ll turn off the sysop bar for the remaining screen shots.

Figure 3: The main level prompt with sysop bar. Be sure to report error #20!

Note the prompt provided to the user in figure 3.  It tells you:

  • That the sysop is available
  • That the user has not paged the sysop
  • The double colons (::) normally would display time left on the system.  Since this was a dial-up system, I needed to limit the time users could spend on the BBS.  But as sysop, I of course had unlimited time.
  • The BBS had different areas and the prompt (like an IOS prompt) tells you where you are (“Main level”)

Next, in figure 4 you can see the main menu options for a user logged into the BBS.  This is the default/stock GBBS menu, as my original is lost.  Despite the limited options, this was like entering a new world in the days of 64K RAM.  You can see that a user could send/read mail, go to a file transfer section, chat (or attempt to chat) with the system operator, or read the public message boards.

Figure 4: The BBS main menu. This is the GBBS default, not the custom menu I built

Next, the user list.  I had 150 users on my BBS, not all of them active.  I blacked out the last names and phone numbers, but you can get a sense of the handles that were used at the time.  In addition to these names, there were a lot of Frodo’s and Gandalf’s floating around.  Also note that most BBSing was local (to avoid long-distance charges.)  Sadly, none of these users has logged on since 1989.  I wish they’d come back.  Oggman, whom I mentioned in my last post, was a user on my board.

Figure 5: My user list

Conclusions

I recently interviewed a recent college grad who asked me how she could be successful at a company like Cisco.  My answer was that you have to understand where we came from in order to understand where we are.  You cannot understand, say, SD-WAN without understanding how we used to build WANs.  Go back to the beginning.  Learn what SneakerNet was.  Understand why we are where we are.  Even before SneakerNet, some of us were figuring out how to get computers to talk to each other over an already existing network–the analog telephone network.  As a side note, I love vintage computing.  It’s a lot of fun using emulators to resurrect the past, and I hope to do some physical restorations some day.  Trying to figure out how to boot up a long-defunct system like this BBS provides a great reminder of how easy we have it now.

With Coronavirus spreading, events shut down, the Dow crashing, and all the other bad news, how about a little distraction?  Time for some NetStalgia.

Back in the mid 1990’s, I worked at a computer consulting firm called Mann Consulting.  Mann’s clientele consisted primarily of small ad agencies, ranging from a dozen people to a couple hundred.  Most of  my clients were on the small side, and I handled everything from desktop support to managing the small networks that these customers had.  This was the time when the Internet took the world by storm–venture capitalists poured money into the early dotcoms, who in turn poured it into advertising.  San Francisco ad agencies were at the heart of this, and as they expanded they pulled on companies like Mann to build out their IT infrastructure.

I didn’t particularly like doing desktop support.  For office workers, a computer is the primarily tool they use to do their job.  Any time you touch their primary tool, you have the potential to mess something up, and then you are dealing with angry end users.  I loved working on networks, however small they were.  For some of these customers, their network consisted of a single hub (a real hub, not a switch!), but for some it was more complicated, with switches and a router connecting them to the Internet.

Two of my customers went through DDoS episodes.  To understand them, it helps to look at the networks of them time.

Both customers had roughly the same topology.  A stack of switches was connected together via back-stacking.  The entire company, because of its size, was in a single layer2/layer 3 domain.  No VLANs, no subnetting.  To be honest, at the time I had heard of VLANs but didn’t really understand what they were.  Today we all use private, RFC1918 addressing for end hosts, except for DMZs.  Back then, our ISP assigned us a block of addresses and we simply applied the public addresses directly on the end-stations themselves.  That’s right, your laptop had a public IP address on it.  We didn’t know a thing about security;  both companies had routers connected directly to the Internet, without even a simple ACL.  I think most companies were figuring out the benefits of firewalls at the time, but we also had a false sense of security because we were Mac-based, and Macs were rarely hacked back then.

One day, I came into work at a now-defunct ad agency called Leagas Delaney.  Users were complaining that nothing was working–they couldn’t access the Internet and even local resources like printing were failing.  Macs didn’t even have ping available, so I tried hitting a few web sites and got the familiar hung browser.  Not good.

I went into Leagas’ server room.  The overhead lights were off, so the first thing I noticed were the lights on the switches.  Each port had a traffic light, and each port was solid, not blinking like they usually did.  When they did occasionally blink, they all did in unison.  Not good either.  Something was amiss, but what?

Wireshark didn’t exist at the time.  There was a packet sniffer called Etherpeek available on the Mac, but it was pricey–very pricey.  Luckily, you could download it with a demo license.  It’s been over 20 years, so I don’t quite recall how I managed to acquire it with the Internet down and no cell phone tethering, but I did.  Plugging the laptop into one of the switches, I began a packet capture and immediately saw a problem.

The network was being aggressively inundated with packets destined to the subnet broadcast address.  For illustration, I’ll use one of Cisco’s reserved banks of public IP addresses.  If the subnet was 209.165.200.224/27, then the broadcast address would be 209.165.200.255.  Sending a packet to this address means it would be received by every host in the subnet, just like the broadcast address of 255.255.255.255.  Furthermore, because this address was not generic, but had the subnet prefix, a packet sent to that broadcast address could be sent through the Internet to our site.  This is known as directed broadcast.  Now, imagine you spoof the source address to be somebody else’s.  You send a single packet to a network with, say, 100 hosts, and those 100 hosts reply back to the source address, which is actually not yours but belongs to your attack target.  This was known as a smurf attack, and they were quite common at the time.  There is really no good reason to allow these directed broadcasts, so after I called my ISP, I learned how to shut them down with the “no ip directed-broadcast” command.  Nowadays, this sort of traffic isn’t allowed, most companies have firewalls, and they don’t use public IP addresses, so it wouldn’t work anyhow.

My second story is similar.  While still working for Mann, I was asked to fill in for one of our consultants who was permanently stationed at an ad agency as their in-house support guy.  He was going on vacation, and my job was to sit in the server room/IT office and hopefully not do anything at all.  Unfortunately, the day after he left a panicked executive came into the server room complaining that the network was down.  So much for a quiet week.

As I walked around trying to assess the problem, of course I overheard people saying “see, Jon leaves, they send a substitute, and look what happens!”  People started questioning me if I had “done” anything.

A similar emergency download of a packet sniffer immediately led me to the source of the problem.  The network was flooded with broadcast traffic from a single host, a large-format printer.  I tracked it down, unplugged it, and everything started working again.  And yet several employees still seemed suspicious I had “done” something.

Problems such as these led to the invention of new technologies to stop directed broadcasts and contain broadcast storms.  It’s good to remember that there was a time before these thing existed, and before we even had free packet sniffers.  We had to improvise a lot back then, but we got the job done.

This one falls into the category of, “I probably shouldn’t post this, especially now that I’m at Cisco again,” but what the heck.

I’ve often mentioned, in this series, the different practices of “backbone TAC” (or WW-TAC) and High Touch Technical Support (HTTS), the group I was a part of.  WW-TAC was the larger TAC organization, where the vast majority of the cases landed.  HTTS was (and still is) a specialized TAC group dedicated to Cisco’s biggest customers, who generally pay for the additional service.  HTTS was supposed to provide a deeper knowledge of the specifics of customer networks and practices, but generally worked the same as TAC.  We had our own queues, and when a high-touch customer would open a case, Cisco’s entitlement tool would automatically route their case to HTTS based on the contract number.

Unlike WW-TAC, HTTS did not use the “follow the sun” model.  This meant that regular TAC cases would be picked up by a region where it was currently daytime, and when a TAC agent’s shift ended, they would find another agent in the next timezone over to pick up a live (P1/P2) case.  At HTTS, we had US-based employees only, at the time, and they had to work P1/P2 cases to resolution.  This meant if your shift ended at 6pm, and a P1 case came in at 5:55, you might be stuck in the office for hours until you resolved it.  We did have a US-based nightshift that came on at 6pm, but they only accepted new cases–we couldn’t hand off a live one to nightshift.

Weekends were covered by a model I hated, called “BIC”.  I asked my boss what it stood for and he explained it was either “Butt In Chair” or “Bullet In the Chamber.”  The HTTS managers would publish a schedule (quarterly if I recall) assigning each engineer one or two 6 hour shifts during the weekends of that quarter.  During those 6 hours, we had to be online and taking cases.

Why did I hate it?  First, I hated working weekends, of course.  Second, the caseload was high.  A normal day on my queue might see 4 cases per engineer, but on BIC you typically took seven or eight.  Third, you had to take cases on every topic.  During the week, only a voice engineer would pick up a voice case.  But on BIC, I, a routing protocols engineer, might pick up a voice case, a firewall case, a switching case…or whatever.  Fourth, because BIC took place on a weekend, normal escalation channels were not available.  If you had a major P1 outage, you couldn’t get help easily.

Remember that a lot of the cases you accepted took weeks or even months to resolve.  Part of a TAC engineer’s day is working his backlog of cases:  researching, working in the lab to recreate a problem, talking to engineering, etc., all to resolve these cases.  When you picked up seven cases on a weekend, you were slammed for weeks after that.

We did get paid extra for BIC, although I don’t remember how much.  It was hundreds of dollars per shift, if I recall.  Because of this, a number of engineers loaded up on BIC shifts and earned thousands of dollars per quarter.  Thankfully, this meant there were plenty of willing recipients when I wanted to give away my shifts, which I did almost always.  (I worked two during my two years at TAC.)  However, sometimes I could not find anyone to take my shift, and in that case I actually would sell my shift, offering a hundred additional dollars if someone would take the shift.  That’s how much I hated BIC.  Of course, this was done without the company knowing about it, as I’m sure they wouldn’t approve of me selling my work!

We had one CSE on our team, I’ll call him Omar, who loaded up on BICs.  Then he would come into his week so overloaded with cases from the weekend that he would hardly take a case during the week.  We’d all get burdened with extra load because Omar was off working his weekend cases.  Finally, as team lead, I called him out on it in our group chat and Omar blew up on me.  Well, I was right of course but I had to let it go.

I don’t know if HTTS still does BIC, although I suspect it’s gone away.  I still work almost every weekend I have, but it’s to stay on top of work rather than taking on more.

Two things can almost go without saying:

  1. If you start a blog, you need to commit time to writing it.
  2. When you move up in the corporate world, time becomes a precious commodity.

When I started this blog several years ago, I was a network architect at Juniper with a fair amount of time on my hands.  Then I came to Cisco as a Principal TME, with a lot less time on my hands.  Then I took over a team of TMEs.  And now I have nearly 40 people reporting to me, and responsibility for technical marketing for Cisco’s entire enterprise software portfolio.  That includes ISE, Cisco DNA Center, SD-Access, SD-WAN (Viptela), and more.  With that kind of responsibility and that many people depending on me, writing TAC Tales becomes a lower priority.

In addition, when you advance in the corporate hierarchy, expressing your opinions freely becomes more dangerous.  What if I say something I shouldn’t?  Or, do I really want to bare my soul on a blog when an employee is reading it?  Might they be offended, or afraid I would post something about them?  Such concerns don’t exist when you’re an individual contributor, even at the director level, which I was.

I can take some comfort in the fact that this blog is not widely read.  The handful of people who stumble across it probably will not cause me problems at work.  And, as for baring my soul, well, my team knows I am transparent.  But time is not something I have much of these days, and I cannot sacrifice work obligations for personal fulfillment.  And that’s definitely what the blog is.  I do miss writing it.

Is this a goodbye piece?  By no means.  The blog will stay, and if I can eek out 10 minutes here or there to write or polish an old piece, I will.  Meanwhile, be warned about corporate ladder climbing–it has a way of chewing up your time.

It’s inevitable as we get older that we look back on the past with a certain nostalgia.  Nostalgia or not, I do think that computing in the 1980’s was more fun and interesting than it is now.  Personal computers were starting to become common, but were not omnipresent as they are now.  They were quite mysterious boxes.  An error might throw you into a screen that displayed hexadecimal with no apparent meaning.  Each piece of software had its own unique interface, since there were no set standards.  For some, there might be a menu-driven interface.  For others you might use control keys to navigate.  Some programs required text commands.  Even working with devices that had only 64 Kilobytes of memory, there was always a sense of adventure.

I got my start in network engineering in high school.  Computer networks as we understand them today didn’t really exist back then.  (There was a rudimentary Internet in some universities and the Defense Department.)  Still, we found ways to connect computers together and get them to communicate, the most common of which was the Bulletin Board System, or BBS.

The BBS was an individual computer equipped with a modem, into which other computer users could dial.  For those who aren’t familiar with the concept of a modem, this was a device that enabled computer data to be sent over analog telephone lines.    Virtually all BBS’s had a single phone line and modem connecting to a single computer.  (A few could handle multiple modems and callers, but these were rare.)  The host computer ran special BBS software which received connections from anyone who might dial into it.  Once the user dialed in, then he or she could send email, post messages on public message boards, play text-based video games, and do file transfers/downloads.  (Keep in mind, the BBS was text-only, with no graphics, so you were limited in terms of what you could do.)  An individual operator of a BBS was called a System Operator or Sysop (“sis-op”).  The sysop was the master of his or her domain, and occasionally a petty tyrant.  The sysop could decide who was allowed to log into the board, what messages and files could be posted, and whether to boot a rude user.

Because a BBS had a single modem, dialing in was a pain.  That was especially true for popular BBS’s.  You would set your terminal software to dial the BBS phone number, and you would often get a busy signal because someone else was using the service.  Then you might set your software to auto re-dial the BBS until you heard the musical sound of a ring tone followed by modems chirping to each other.

How did you find the phone numbers for BBS’s in the era before Google?  You might get them from friends, but often you would find them posted as lists on other BBS’s.  When we first bought our modem for my Apple II+, we also bought a subscription to Compuserve, a public multi-user dial-in service.  On one of their message boards, I managed to find a list of BBS’s in the 415 area code where I resided.  I dialed into each of them.  Some BBS on the list had shut down and I could hear someone saying “Hello??” through the modem speaker.  Others connected, I set up an account, and, after perusing the board, I would download a list of more BBS numbers and go on to try them.

Each sysop configured the board however seemed best, so the BBS’s tended to have a lot of variation.  The software I used–the most common among Apple II users–was called GBBS.  GBBS had its own proprietary programming language and compiler called ACOS, allowing heavy customization.  I re-wrote almost the entire stock bulletin board system in the years I ran mine.  It also allowed for easy exchange of modules.  I delegated a lot of the running of my board to volunteer co-sysops, and one of them wanted to run a fantasy football league.  He bought the software, I installed it, and we were good to go.  I had friends who ran BBS’s on other platforms that did not have GBBS, and their boards were far less customize-able.

A funny story about that fantasy football sysop.  Back then the software came on floppy disks, and while I insisted on him mailing it to me, he insisted on meeting me in person and handing it over.  I was terrified of meeting this adult and revealing that I was only 14 years old.  I wanted everyone on the board to think I was an adult, not a teenager.  It helped project authority.  He wouldn’t budge, so we agreed to meet at a local sandwich shop.  Imagine my surprise when a 12-year-old walked in carrying the disks!  We had a nice lunch and I at least knew I could be an authority figure for him.  I suspect most of my users were no older than seventeen.

Each user on a BBS had a handle, which was just a screen name.  I’m somewhat embarrassed to admit that mine was “Mad MAn”.  I don’t really recall how I thought of the name, but you always wanted to sound cool, and to a 15 year old “madman” sounded cool.  This was in the era before school violence, so it wasn’t particularly threatening.  I spelled it with two words because I didn’t know how to spell “madman”, and this was before every spelling mistake was underlined in red.  The second A was capitalized because I was a bad typist and couldn’t get my finger off the shift key fast enough.  Eventually I just adopted that as a quirk. Because the BBS population consisted largely of nerdy teenage boys, a lot of the handles came from Lord of the Rings and other fantasy and sci-fi works.  I can’t tell you how many Gandalf’s were floating around, but there were a lot.  I had a Strider for a co-sysop.  Other handles, like mine, attempted to sound tough.  I had another co-sysop whose handle was Nemesis.

Since each BBS was an island, if someone sent you an email on BBS1, you couldn’t see it on BBS2.  So, if you were active on five BBS’s, you had to log in to all five and check email separately.  At one point a sysop who went by the handle “Oggman” launched a system called OGG-Net.  (His BBS also had a cool name, “Infinity’s Edge”.)  Oggy’s BBS became a central repository for email, and subscribing boards would dial in at night to exchange emails they had queued up.  This of course meant that it could take an entire day for email to propagate from one BBS to another, but it was better than before.

I’m writing this post in my “NetStalgia” series for a couple reasons.  First, it’s always important to look back in order to know where you are going.  Second, I’ve resurrected my old BBS using an Apple II emulator, and in my next post I’m going to share a few screen shots of what this thing actually looked like.  I hope you’ll enjoy them.

The case came into the routing protocols queue, even though it was simply a line card crash.  The RP queue in HTTS was the dumping ground for anything that did not fit into one of the few other specialized queues we had.  A large US service provider had a Packet over SONET (PoS) line card on a GSR 12000-series router crashing over and over again.

Problem Details: 8 Port ISE Packet Over SONET card continually crashing due to

SLOT 2:Aug  3 03:58:31: %EE48-3-ALPHAERR: TX ALPHA: error: cpu int 1 mask 277FFFFF
SLOT 2:Aug  3 03:58:31: %EE48-4-GULF_TX_SRAM_ERROR: ASIC GULF: TX bad packet header detected. Details=0x4000

A previous engineer had the case, and he did what a lot of TAC engineers do when faced with an inexplicable problem:  he RMA’d the line card.  As I have said before, RMA is the default option for many TAC engineers, and it’s not a bad one.  Hardware errors are frequent and replacing hardware often is a quick route to solving the problem.  Unfortunately the RMA did not fix the problem, the case got requeued to another engineer, and he…RMA’d the line card.  Again.  When that didn’t work, he had them try the card in a different slot, but it continued to generate errors and crash.

The case bounced through two other engineers before getting to me.  Too bad the RMA option was out.  But the simple line card crash and error got even weirder.  The customer had two GSR routers in two different cities that were crashing with the same error.  Even stranger:  the crash was happening at precisely the same time in both cities, down to the second.  It couldn’t be a coincidence, because each crash on the first router was mirrored by a crash at exactly the same time on the second.

The conversation with my fellow engineers ranged from plausible to ludicrous.  There was a legend in TAC, true or not, that solar flares cause parity errors in memory and hence crashes.  Could a solar flare be triggering the same error on both line cards at the same time?  Some of my colleagues thought it was likely, but I thought it was silly.

Meanwhile, internal emails were going back and forth with the business unit to figure out what the errors meant.  Even for experienced network engineers, Cisco internal emails can read like a foreign language.  “The ALPHA errors are side-effects the GULF errors,” one development engineer commented, not so helpfully.  “Engine is feeding invalid packets to GULF and that causes the bad header error being detected on GULF,” another replied, only slightly more helpfully.

The customer, meanwhile, had identified a faulty fabric card on a Juniper router in their core.  Apparently the router was sending malformed packets to multiple provider edge (PE) routers all at once, which explained the simultaneous crashing.  Because all the PEs were in the US, forwarding was a matter of milliseconds, and thus there was very little variation in the timing.  How did the packets manage to traverse the several hops of the provider network without crashing any GSRs in between?  Well, the customer was using MPLS, and the corruption was in the IP header of the packets.  The intermediate hops forwarded the packets, without ever looking at the IP header, to the edge of the network, where the MPLS labels get stripped, and IP forwarding kicks in.  It was at that point that the line card crashed due to the faulty IP headers.  That said, when a line card receives a bad packet, it should drop it, not crash.  We had a bug.

The development engineers could not determine why the line card was crashing based on log info.  By this time, the customer had already replaced the faulty Juniper module and the network was stable.  The DEs wanted us to re-introduce the faulty line card into the core, and load up an engineering special debug image on the GSRs to capture the faulty packet.  This is often where we have a gulf, pun intended, between engineering and TAC.  No major service provider or customer wants to let Cisco engineering experiment on their network.  The customer decided to let it go.  If it came back, at least we could try to blame the issue on sunspots.

In the last article on technical interviewing, I told the story of how I got my first networking job.  The interview was chaotic and unorganized, and resulted in me getting the job and being quite successful.  In this post, I’d like to start with a very basic question:  Why is it that we interview job candidates in first place?

This may seem like an obvious question, but if you think about it face-to-face interviewing is not necessarily the best way to assess a candidate for a networking position.  To evaluate their technical credentials, why don’t we administer a test? Or, force network engineering candidates to configure a small network? (Some places do!)  What exactly is it that we hope to achieve by sitting down for an hour and talking to this person face-to-face?

Interviewing is fundamentally a subjective process.  Even when an interviewer attempts to bring objectivity to the interview by, say, asking right/wrong questions, interviews are just not structured as objective tests.  The interviewer feedback is usually derived from gut reactions and feelings as much as it is from any objective criteria.  The interviewer has a narrow window into the candidate’s personality and achievements, and frequently an interviewer will make an incorrect assessment in either direction:

  • By turning down a candidate who is qualified for the job.  When I worked at TAC, I remember declining a candidate who didn’t answer some questions about OSPF correctly.  Because he was a friend of a TAC engineer, he got a second chance and did better in his second interview.  He got hired and was quite successful.
  • By hiring a candidate who is unqualified for the job.  This happens all the time.  We pass people through interviews who end up being terrible at the job.  Sometimes we just assess their personality wrong and they end up being complete jerks.  Sometimes, they knew enough technical material to skate through the interview.

Having interviewed hundreds of people in my career, I think I’m a very good judge of people.  I was on the interview team for TAC, and everyone we hired was a successful engineer.  Every TME I’ve hired as a manager has been top notch.  That said, it’s tricky to assess someone in such a short amount of time. As the interviewee, you need to remember that you only have an hour or so to convince this person you are any good, and one misplaced comment could torpedo you unfairly.

I remember when I interviewed for the TME job here at Cisco.  I did really well, and had one final interview with the SVP at the time.  He was very personable, and I felt at ease with him.  He asked me for my proudest accomplishment in my career.  I mentioned how I had hated TAC when I started, but I managed to persevere and left TAC well respected and successful.  He looked at my quizzically.  I realized it was a stupid answer.  I was interviewing for a director-level position.  He wanted to hear some initiative and drive, not that I stuck it out at a crappy job.  I should have told him about how I started the Juniper on Juniper project, for example.  Luckily I got through but that one answer gave him an impression that took me down a bit.

When you are interviewing, you really need to think about the impression you create.  You need empathy.  You need to feel how your interviewer feels, or at least be self-aware enough to know the impression you are creating.  That’s because this is a subjective process.

I remember a couple of years back I was interviewing a candidate for an open position.I asked him why he was interested in the job. The candidate proceeded to give me a depressing account of how bad things were in his current job.”It’s miserable here,” he said.  “Nobody’s going anywhere in his job.  I don’t like the team they’re not motivated.”  And so forth.  He claimed he had programming capabilities and so I asked him what his favorite programming language.”I hate them all,” he said. I actually think that he was technically fairly competent but in my opinion working with this guy would’ve been such a downer that I didn’t hire him.

In my next article I’ll take a look at different things hiring managers and interviewers are looking for in a candidate, and how they assess them in an interview.

 

When you open a TAC case, how exactly does the customer support engineer (CSE) figure out how to solve the case?  After all, CSEs are not super-human.  Just like any engineer, in TAC you have a range of brilliant to not-so-brilliant, and everything in between.  Let me give an example:  I worked at HTTS, or high-touch TAC, serving customers who paid a premium for higher levels of support.  When a top engineer at AT&T or Verizon opened a case, how was it that I, who had never worked professionally in a service provider environment, was able to help them at all?  Usually when those guys opened a case, it was something quite complex and not a misconfigured route map!

TAC CSEs have an arsenal of tools at their disposal that customers, and even partners, do not.  One of the most powerful is well known to anyone who has ever worked in TAC:  Topic.  Topic is an internal search engine.  It can do more now, but at the time I was in TAC, Topic could search bugs, TAC cases, and internal mailers.  If you had a weird error message or were seeing inexplicable behavior, popping the message or symptoms into Topic frequently resulted in a bug.  Failing that, it might pull up another TAC case, which would show the best troubleshooting steps to take.

Topic also searches internal mailers, the email lists used internally by Cisco employees.  TAC agents, sales people, TMEs, product managers, and engineering all exchange emails on these mailers, which are then archived.  Oftentimes a problem would show up in the mailer archives and engineering had already provided an answer.  Sometimes, if Topic failed, we would post the symptoms to the mailers in hopes engineering, a TME, or any expert would have a suggestion.  I was always careful in doing so, as if you posted something that was already answered, or asked too often, flames would be coming your way.

TAC engineers have the ability to file bugs across the Cisco product portfolio.  This is, of course, a powerful way to get engineering attention.  Customer found defects are taken very seriously, and any bug that is opened will get a development engineer (DE) assigned to it quickly.  We were judged on the quality of bugs we filed since TAC does not like to abuse the privilege and waste engineering time.  If a bug is filed for something that is not really a bug, it gets marked “J” for Junk, and you don’t want to have too many junked bugs.  That said, on one or two occasions, when I needed engineering help and the mailers weren’t working, I knowingly filed a Junk bug to get some help from engineering.  Fortunately, I filed a few real bugs that got fixed.

My team was the “routing protocols” team for HTTS, but we were a dumping ground for all sorts of cases.  RP often got crash cases, cable modem problems, and other issues, even though these weren’t strictly RP.  Even within the technical limits of RP, there is a lot of variety among cases.  Someone who knows EIGRP cold may not have a clue about MPLS.  A lot of times, when stuck on a case, we’d go find the “guy who knows that” and ask for help.  We had a number of cases on Asynchronous Transfer Mode (ATM) when I worked at TAC, which was an old WAN (more or less) protocol.  We had one guy who knew ATM, and his job was basically just to help with ATM cases.  He had a desk at the office but almost never came in, never worked a shift, and frankly I don’t know what he did all day.  But when an ATM case came in, day or night, he was on it, and I was glad we had him, since I knew little about the subject.

Some companies have NOCs with tier 1, 2, and 3 engineers, but we just had CSEs.  While we had different pay grades, TAC engineers were not tiered in HTTS.  “Take the case and get help” was the motto.  Backbone (non-HTTS) TAC had an escalation team, with some high-end CSEs who jumped in on the toughest cases.  HTTS did not, and while backbone TAC didn’t always like us pulling on their resources, at the end of the day we were all about killing cases, and a few times I had backbone escalation engineers up in my cube helping me.

The more heated a case gets, the higher the impact, the longer the time to resolve, the more attention it gets.  TAC duty managers can pull in more CSEs, escalation, engineering, and others to help get a case resolved.  Occasionally, a P1 would come in at 6pm on a Friday and you’d feel really lonely.  But Cisco being Cisco, if they need to put resources on an issue, there are a lot of talented and smart people available.

There’s nothing worse than the sinking feeling a CSE gets when realizing he or she has no clue what to do on a case.  When the Topic searches fail, when escalation engineers are stumped, when the customer is frustrated, you feel helpless.  But eventually, the problem is solved, the case is closed, and you move on to the next one.

There were quite a few big announcements at Cisco Live this year.  One of the big ones was the overhaul of the certification program.  A number of new certifications were introduced (such as the DevNet CCNA/CCNP), and the existing ones were overhauled.  I wanted to do a post about this because I was involved with the certification program for quite a while on launching these.  I’m posting this on my personal blog, so my thoughts here are, of course, personal and not official.

First, the history.  Back when I was at Juniper, I had the opportunity to write questions for the service provider written exams.  It was a great experience, and I got thorough training from the cert program on how to properly write exam questions.  I don’t really remember how I got invited to do it, but it was a good opportunity, as a certified (certifiable?) individual, to give back to the program.  When I came to Cisco, I quickly connected with the cert program here, offering my services as a question writer. I had the training from Juniper, and was an active CCIE working on programmability.  It was a perfect fit, and a nice chance to recertify without taking the test, as writing/reviewing questions gets your CCIE renewed.

As I was managing a team within the business unit that was working on Software-Defined Access and programmability, it seemed logical for me to talk to the program about including those topics on the test.  I can assure you there was a lot of internal debate about this, as the CCIE exam is notoriously complex, and the point of our Intent-Based Networking products is simplicity.  One product manager even suggested a separate CCIE track for SD-Access, an idea I rejected immediately for that very reason.

Still, as I often point out here and elsewhere, SDN technologies do not mitigate the need for network engineers.  SDN products, all SDN products, are complex precisely because they are automated.  Automation enables us to build more complex things, in general.  You wouldn’t want to configure all the components of SD-Access by hand.  Still, we need engineers who understand what the automation tools are doing, and how to work with all the components which comprise a complex solution like SD-Access.  Network engineers aren’t going to disappear.

For this reason, we wanted SD-Access, SDWAN, and also device programmability (NETCONF/YANG, for example) to be on the lab.  We want to have engineers who know and understand these technologies, and the certification program is a fantastic way to help people to learn them.  I, and some members of my team, spent several months working with the CCIE program to build a new blueprint, which became the CCIE Enterprise Infrastructure.  The storied CCIE Routing and Switching will be no more.

At the end of the day, the CCIE exam has always adapted to changed in networking.  The R/S exam no longer has ISDN or IPX on it, nor should it.  Customers are looking for more automated solutions, and the exam is keeping pace.  If you’re studying for this exam, the new blueprint may be intimidating.  That said, CCIE exams have always been intimidating.  But think about this:  if you pass this exam, your resume will have skills on it that will make you incredibly marketable.

The new CCIE-EI (we always abbreviate stuff, right?) breaks down like this:

  • 60% is classic networking, the core routing protocols we all know and love.
  • 25% is SDx:  SD-Access and SD-WAN, primarily.
  • 15% is programmability.  NETCONF/YANG, controller APIs, Ansible, etc.

How do you study for this?  Like you study for anything.  Read about it and lab it.  There is quite a bit of material out there on all these subjects, but let me make some suggestions:

Programmability

You are not expected to be a programming expert for this section of the exam.  It’s not about seeing if you can write complex programs, but whether you know the basics well enough to execute some tasks via script/Ansible/etc instead of CLI.  DevNet is replete with examples of how to send NETCONF messages, or read data off a router or switch with programmable interfaces.  Download them, play with them, spend some time learning the fundamentals of Python, and relax about it.

  • Learn:  DevNet is a phenomenal resource.  Hank Preston, an evangelist for DevNet, has put out a wealth of material on programmability.  In addition, there is the book on IOS XE programmability I wrote with some colleagues.
  • Lab:  You can lab programmability stuff easily on your laptop.  Python and ncclient are free, as is Ansible.  If you have any sort of lab setup already, all you need to do is set up a Linux VM or install some tools on to your laptop.

Software-Defined

This is, as I said before, a tough one to test on.  After all, to add a device to an SD-Access fabric, you select it and click “Add to Fabric.”  What’s there to test?  Well, since these are new products you of course need to understand the components of SD-Access/SDWAN and how they interoperate.  How does policy work?  How do fabric domains talk to non-fabric domains?  There is plenty to study here.

  • Learn:  Again, we’ve written books on SD-Access and SD-WAN.  Also, we are moving a lot of documentation into Cisco Communities.
  • Lab:  Well, this is harder.  We’re working on getting SD-Access into the hands of learning partners, so you’ll have a place to get your hands on it.  We’re also working on virtualizing SD-Access as much as possible, to make it easier for people to run in labs.  I don’t have a timeframe on the latter, but hopefully the former we can do soon.

These are huge but exciting changes. I’ve been very lucky to have landed at a job where I am at the forefront of the changes in the industry, but this new exam will give others the opportunity to move themselves in that direction as well.  Happy labbing!