TAC

All posts tagged TAC

In 2007, I left Cisco after two brutal years in high-touch TAC.  I honestly hated the job, but it was an amazing learning experience.  I draw on my TAC experience every single day.  A buddy of mine got a job at a Gold Partner, offered to bring me in, and I jumped on the opportunity.  Things didn’t go so well, and in 2009, I was laid off and looking for a job again.  That’s when another buddy (buddies help!) called me and told me of an opportunity at Juniper.

I knew little about Juniper.  We had a Juniper SSL box in the network I used to manage, but the routers were mostly for service provider networks.  When I was at TAC, I had one case where a major outage was caused by misconfiguration of a Juniper BGP peer.  But otherwise, I didn’t know a thing.

The opportunity was to be the “network architect” for Juniper’s corporate network.  In other words, to work in internal IT at a network vendor.  It seemed like a good career move, but little did I know I would be thrust the corporate politics at the director-level instead of technical challenges.  I ended up spending six tumultuous years there, with several highlights:

  • My boss disappeared on medical leave on my very first day.
  • I was re-assigned to a Sr. Director who was an applications person and not knowledgeable in networking.  He viewed the network a bit like Col. Kendrick, the Marine, viewed the Navy in the movie A Few Good Men:  “Every time we gotta go some place to fight, you fellas always give us a ride.”
  • I proposed and got buy-off for a program to ensure we actually ran our own gear internally and to ensure we built solid network architectures.
  • I subsequently had the program taken away from me.
  • I found out a job posting with the identical title and JD to mine was listed on Juniper’s public site without my knowledge.
  • My manager was changed to a person two pay grades below me in another country without even informing me.  (Someone noticed it in the directory and told me.)
  • I quit in disgust, without any other job.
  • I was talked into staying.
  • After another year or misery, I was demoted two pay grades myself.
  • I focused on doing the best job I could ended up getting re-promoted to director and left on good terms.

Some of the above was my own fault, much of it was dysfunctional management, some of it was the stupidity we all know lurks in every good size company.  I actually bear Juniper no resentment at all.

I worked at Juniper in the pre-Mist days, and in the midst of the fiscal crisis that began in 2008.  We went from CEO Kevin Johnson’s rah-rah “Mission10” pep rallies that we would be the “next $10B company” (uh, no), to draconian OpEx cuts when a pump-and-dump “activist investor” took over our board.

At the time I was there, Juniper made some mistakes.  NetScreen firewalls had done well for us, but then we made the decision to kill the NetScreen in favor of the JunOS-based SRX.  This is the classic mistake of product management–replace a successful, popular product with a made-from-scratch product with no feature parity.  There were some good arguments to do SRX, but it was done abruptly which signalled EOL to NetScreen customers, and SRX didn’t even have a WebUI.

We also did QFabric while I was there.  We installed one of these beasts in a data center on campus.  I have no idea if they improved it, but the initial versions took a full day to upgrade.  Imagine taking a day-long outage on your data center just to do an upgrade!

Another product that didn’t work out was Space.  JunOS Space came out at the time when the iPhone was still new.  Juniper borrowed the idea.  Instead of building an NMS product, we’d build a platform, and then software developers could build apps on top of it.  Cisco might be able to get away with that approach, but Juniper didn’t have enough of the networking market to attract developers.

In addition, a bunch of other acquisitions fizzled out, including Trapeze, our WAN accelerator, our load balancer.

All that said, Juniper had some fine products when I worked there.  (And believe me, my current employer has had many failures too.)  I got my JNCIE-SP, working on MX routers, which were a really good platform.  I thought the EX switches were decent.  And the operating system was nicely done.  Funnily enough, I worked a solid year on the JNCIE and promptly went to Cisco.  I never renewed it and now it’s expired.

I left after meeting with a strategy VP and explaining our mission to use Juniper’s corporate network to demonstrate how to build an enterprise network to our customers.  She looked at me (and the CIO) and said, “Juniper is done with enterprise networking.  I’m not interested.”  I left after that.  In her defense, Mist was years off and she couldn’t have seen it coming.

She was right, in that Juniper certainly had a core SP market.  Juniper came about at the time when Cisco was still selling 7500’s and 12000’s to its service provider customers, dated platforms running a dated OS.  Juniper did such a nice job with their platform that Cisco had to turn around and build the CRS-1 and IOS-XR, both of which had, ehm, similarities to Juniper’s products.  Juniper really couldn’t crack the enterprise market while I was there.  The lack of a credible wireless solution was always a problem.  Obviously Mist changed the game for them.

Juniper always felt like a scrappy anti-Cisco when I was there, but it was fast becoming corporatized and taken over by the MBAs.  Many old-schoolers would tell me how different things were in the startup days.  It still always had the attitude of an anti-Cisco.  One of our engineers ALWAYS referred to Cisco devices as “Crisco boxes”, and when I announced I was returning to Cisco, a long-time IT guy called me an “asshole”.  A couple funny stories around this:

A customer came in to our office for training and looked in the window of one the data centers nearby.  He saw it was packed with Cisco gear and subsequently published a video on social media captioned “Juniper uses Cisco.”  He didn’t realize that we leased the building from another company called Ariba, and the data center was theirs, not ours.  In fact, we worked very hard to not run Cisco in our internal network.  Juniper subsequently asked Ariba to block out the window.

One time we solicited a proposal from one of our largest service provider customers to host a data center for us.  The SP came back to us with an architecture which was 100% Cisco.  Cisco switches, Cisco routers, Cisco firewalls.  I told the SP I would never deploy our DC on Cisco gear.  What if a major bug hit Cisco devices causing outages and our data center went down too?  What if we got hacked due to a Cisco PSIRT and it became public?

The SP didn’t care.  We were their customer, but they were also ours.  They used Cisco in their data center, and had no desire to re-tool for another vendor.  I escalated all the way to the CEO, who agreed with me, and the deal was scuttled.  Ironically, I used this story in my Cisco interviews when asked for an example of a time when I had taken a strong stand on something.

I work at Cisco now, and even ran the competitive team for a while.  Competition is healthy and makes us all better.  I actually value our competition.  Obviously my job is to win deals against them, but I have friends who work at Juniper and I have friends who work at HPE.  We’re all engineers doing our jobs, and I wish them no ill will.  I always respected Juniper, their engineering, and their scrappy attitude.  While I know some of this will be retained as they get absorbed into a large corporation, it’s definitely the end of an era, for the industry and for me.

38
1

I haven’t written anything for a while, because of the simple fact that I had nothing to say.  The problem with being a writer is that sometimes you have nothing to write.  I also have a day job, and sometimes it can keep me quite busy.  Finally, an afternoon drive provided some inspiration.

There’s a funny thing about the buildings I work in–they all tend to be purchased by Google.  When I started at Juniper, I worked in their Ariba campus in Mountain View, several buildings they rented from that software company.  We were moved to Juniper’s (old) main campus, on Mathilda Drive, and the old Ariba buildings were bought and re-purposed by Google.  Then the Mathilda campus was bought by Google.

When I worked in TAC, from 2005-2007, I worked in building K on Tasman Drive in San Jose.  Back then, the meteoric growth of Cisco was measured by the size of its campus, which stretched all along Tasman, down Cisco way, and even extended into Milpitas.

Cisco’s campus has been going the opposite direction for a while now.  The letter buildings (on Tasman, West of Zanker Street) started closing before the COVID lockdowns changed everything.  Now a lot of buildings sit empty and will certainly be sold, including quite possibly the ones in Milpitas, where I work.

Building K closed, if not sometime during the lockdowns, shortly after.  I hadn’t driven by it in months, and when I did yesterday-lo and behold!-it was now a Google building!

What used to be building K

It’s funny how our memories can be so strongly evoked by places.  Building K was, for a long time, the home to Cisco TAC.  I vividly remember parking on Champion Drive, reviewing all of my technological notes before going in to be panel-interviewed by four tough TAC engineers.  I remember getting badged in the day I started, after passing the interview, and being told by my mentor that he wouldn’t be able to put me “on the queue” for three months, because I had so much to learn.

Two weeks later I was taking cases.  Not because I was a quick study, but because they needed a body.

I worked in High Touch Techical Support, dealing with Cisco’s largest customers.  The first team I was on was called ESO.  Nobody knew what it stood for.  The team specialzied in taking all route/switch cases for Cisco’s large financial customers like Goldman Sacks and JPMC.  Most of the cases involved the Cat 6k, although we supported a handful of other enterprise platforms.

When a priority 1 case came in, the Advanced Services Hotline (ASH) call center agents would call a special number that would cause all of the phones on the ESO team to play a special ring tone.  I grew to develop a visceral hatred of that ring tone.  Hearing it today would probably trigger PTSD.  I’d wait and wait for another TAC engineer (called CSEs) to answer it.  If nobody did, I’d swallow hard and grab the phone.

The first time I did it was a massive multicast meltdown disrputing operations on the NYSE trading floor.  I had just gotten my CCIE, but I had only worked previosly as a network engineer in a small environment.  Now I was dealing with a major outage, and it was the first time I had to handle real-world multicast.  Luckily, my mentor showed up after 20 minutes or so and helped me work the case.

My first boss in HTTS told me on the day I started, “at Cisco, if you don’t like your boss or your cubicle, wait three months.”  Three months later I had a new boss and a new cubicle.  The ESO team was broken up, and its engineers dispersed to other teams.  I was given a choice:  LAN Switch or Routing Protocols.  I chose the latter.

I joined the RP-LSA team as a still new TAC engineer.  The LSA stood for “Large Scale Architectures.”  The team was focused on service provider routing and platform issues.  Routing protocol cases were actually a minority of our workload.  We spent a lot of time dealing with platform issues on the GSR 12000-series router, the broadband aggregation 10000-series, and the 7500.  Many of the cases were crashes, others were ASIC issues.  I’d never even heard of the 12k and 10k, now I was expected to take cases and speak with authority.  I leaned on my team a lot in the early days.

Fortunately for me, these were service provider guys, and they knew little about enterprise networking or LAN switching.  With the breakup of the ESO team, the large financials were now coming into the RP-LSA queue.  And anyone who has worked in TAC can tell you, a routing protocols case is often not an RP case at all.  When the customer opens a case for flapping OSPF adjacencies, it’s usually just a symptom of a layer 2 issue.  The SP guys had no clue how to deal with these, but I did, so we ended up mutually educating each other.

In those days, most of the protocol cases were on Layer 3 MPLS.  I had never even heard of MPLS before I started there, but I did a one week online course (with lab), and started taking cases like a champ.  MPLS cases were often easily because it was new, but usually when a large service provider like AT&T, Orange, or Verizon opens a case on soemthing like BGP, it’s not because they misconfigured a route map.  They’ve looked at everything before opening the case, and so the CSE becomes essentially a middleman to coordinate the customer talking to developers.  In many cases the CSE is like a paramedic, stabilizing the patient before the doctor takes over to figure out what is wrong.  We often knew we were facing a bug, but our job was to find workarounds to bring the network back up so developers could find a fix.

I had my share of angry customers in those days, some even lividly angry.  But most customers were nice.  These were professional network engineers who understood that the machines we build don’t always act as we expect them to.  Nevertheless, TAC is a high-stress job.  It’s also relentless.  Close one case, and two more are waiting in the queue.  There is no break, no downtime.  The best thing about it was that when you went home, you were done.  If a call came in on an open case in your backlog, it would be routed to another engineer.  (Though sometimes they routed it back to you in the morning.)  In HTTS, we had the distinct disadvantage of having to work cases to resolution.  If the case came in at 5:55pm on Friday night, and your shift ended at 6pm, you might spend the next five hours in the office.  Backbone TAC engineers “followed the sun” and re-assigned cases as soon as their shift ended.

I make no secret of the fact that I hated the job.  My dream was to work at Cisco, but shortly after I started, I wanted out.  And yet the two years I spent in TAC are two of the most memorable of my career.  TAC was a crucible, a brutal environment dealing with nasty technical problems.  The fluff produced by marketeers has no place there.  There was no “actualize your business intent by optimizing and observing your network”-type nonsense.  Our emails were indeciperhable jumbles of acronyms and code names.  “The CEF adjacency table is not being programmed because the SNOOPY ASIC has a fault.”  (OK, I made that up…  but you get the point.)  This was not a place for the weak-minded.

When things got too sticky for me, I could call in escalation engineers.  I remember one case where four backbone TAC escalation engineers and one from HTTS took over my cube, peering at my screen and trying to figure out what was going on during a customer meltdown.

Building K was constructed with the brutalist-stlye of architecture so common in Silicon Valley.  One look at the concrete and glass and the non-descript offices and conference rooms is enough to drain one’s soul.  These buildings are pure function over form.  They are cheap to put up and operate, but emotionally crushing to work in.  There is no warmth, even on a winter day with the heat on.

Still, when I look at building K, or what’s become of it, I think of all the people I knew there.  I think of the battles fought and won, the cases taken and closed, the confrontational customers and the worthless responses from engineering, leaving us unable to close a case.  I think of the days I would approach that building in dread, not knowing what hell I would go through that day.  I also think of the incredible rush of closing a complex case, of finding a workaround, and of getting an all-5’s “bingo” (score) from a customer.  TAC is still here, but for those of us who worked in building K, its closure represents the end of an era.

It’s impossible to count how many people at my college wanted to be “writers”.  So many early-twenty-somethings here in the US think they are going to spend their lives as screenwriters or novelists.  My colleagues from India tell me most people there want to be doctors or engineers, which tells you something about the decline of the United States.

Back in the mid-2000’s, a popular buddy-comedy came out about a novelist and an actor and their adventures in the “California wine country”.  The author of the film is an LA novelist.  The only people he knew, and the only characters he could create, were writers and actors.  I read that his first novel was about a screenwriter.  The movie was popular, but I found the characters utterly boring.  Who cares about a novelist and his romantic adventures?  Herman Melville spent years at sea, giving him the material to write Moby Dick.  Fyodor Dostoevsky wanted to be a writer from an early age, but he spent years in a prison camp followed by years of forced military service, to give him a view into nihilism and its effect on the human soul.  The point is, these great writers earned the right to talk about something, they didn’t just go to college and come out a genius with brilliant things to say.

I’ve been hearing a lot about “product management” lately.  I work in product management, in fact, and I’ve worked with product managers for many years.  However, I didn’t realize until recently that product management is the hot new field.  Everyone wants to major in PM in business school.  As one VP I know told me, “people want to be PMs because that’s where CEOs come from.”  Well, like 19-year-olds feeling entitled to be great novelists, b-school students are apparently expecting to become CEOs.  Somewhere missing in this sense of entitlement is that achievement has to be earned, and that is has to be earned by developing specific expertise.  A college student who wants to be a novelist thinks he or she simply deserves to be a novelist by virtue of his or her brilliance;  a b-school PM student apparently thinks the same way about being a CEO.

Back when I worked in TAC, one of my mentors was a TAC engineer who had previously been a product manager for GSR (12000-series) line cards.  He went back to TAC because he wanted to get into the new CRS-1 router and felt it was the best place to learn the new product quickly.  It made sense at the time, but it is inconceivable now that a PM would go to TAC.  The product manager career path is directed towards managing business, not technology, and it would be a step down for product managers to become technical again.

If you don’t work for a tech company, you may not know a lot about product management, but PMs are very important to the development of the products you use.  They decide what products are brought to market;  what features they will have;  they prioritize product roadmaps.  They are held accountable for the revenue (or lack thereof) for a product.

Imagine, now, that somebody with that responsibility for, say, a router has no direct experience as a network engineer, but instead has an MBA from Kellogg or Haas or Wharton.  They’ve studied product management as a discipline, but know nothing about the technology that they own.  Suppose this person has no particular interest in or passion for their field–they just want to succeed in business and be a CEO some day.  What do you think the roadmap will look like?  Do you think the product will take into account the needs of the customer?  When various technologists come to such a PM, will he be able to rationally sort through their competing proposals and select the correct technology?

To be clear, I am not criticizing any individual or my current employer here.  This problem extends industry-wide and explains why so many badly conceived products exist.  The problem of corporatism, which I’ve written about often, extends beyond product management too.  How often are decisions in IT departments made by business people who have little to no experience in the field they are responsible for?  I got into network engineering because I was fascinated by it and loved it.  I’m not the best engineer out there–I’ve worked with some brilliant people–but I do care about the industry and the products we make.  And most importantly, I care about network engineers because I’ve been one.

Corporatists believe generic management principles can be learned which apply to any business, and that they don’t really need domain-specific expertise.  They know business, so why would they?  True, there are some “business” specific tasks like finance that where generic business knowledge is really all that’s needed.  But the mistaken thinking that generic business knowledge qualifies one to be authoritative on technical topics doesn’t make sense.  This is how tech CEO’s end up CEO of coffee companies–it’s just business, right?

I don’t mean to denigrate product management as a discipline.  PMs have an important role to play, and product management is the art of making decisions between different alternatives with constrained resources.  I am saying this:  that if you want to become a product manager, spend the time to learn not just the business, but the actual thing you are product managing.  You’d be better off spending a couple years in TAC out of business school than going straight into PM.  Not that many CEO-aspiring PMs would ever do that, these days.

Now off to write my first novel.

I shall avoid naming names, but when I worked for Juniper we had a certain CEO who pumped us up as the next $10 billion company.  It never happened, and he left and became the CEO of Starbucks.  Starbucks has nothing to do with computer networking at all.  Why was he hired by Starbucks?  How did his (supposed) knowledge of technology translate into coffee?

Apparently it didn’t.  Howard Schultz, Starbucks’ former CEO, is back at the helm.  “I wasn’t here the last four years, but I’m here now,” he said, according to an article in the Wall Street Journal (paywall).  “I am not in business, as a shareholder of Starbucks, to make every single decision based on the stock price for the quarter…Those days, ladies and gentlemen, are over.”  Which of course, implies that that was exactly what the previous CEO was doing.

What happened under the old CEO?  “Workers noticed an increasing focus on speed metrics, including the average time to prepare an order, by store.”  Ah, metrics, my old enemy.  There’s a reason one of my favorite books is called The Tyranny of Metrics and why I wrote a TAC Tales piece just about the use of metrics in TAC.  More on that in a bit.

As I look at what I refer to as “corporatism” and its effect on our industry, it often becomes apparent that the damage of this ethos extends beyond tech.  The central tenet of corporatism, as I define it, is that organizations are best run by people who have no particular expertise other than management itself.  That is, these individuals are trained and experienced in generic management principles, and this is what qualifies them to run businesses.  The generic management skills are translate-able, meaning that if you become an expert in managing a company that makes paper clips, you can successfully use your management skills to run a company that makes, say, medical-device software.  Or pharmaceuticals.  Or airplanes.  Or whatever.  You are, after all, a manager, maybe even a leader, and you just know what to do without any deep expertise or hard-acquired industry-specific knowledge.

Those of us who spend years, even decades acquiring deep technical knowledge of our fields are, according to this ethos, the least qualified to manage and lead.  That’s because we are stuck in our old ways of doing things, and therefore we don’t innovate, and we probably make things complex, using funny acronyms like EIGRP, OSPF, BGP, STP, MPLS, L2VNI, etc., to confuse the real leaders.

Corporatists simply love metrics.  They may not understand, say, L2VNIs, but they look at graphs all day long.  Everything has to be measured in their world, because once it’s measured it can be graphed, and once it’s graphed it’s simply a matter of making the line go the right direction.  Anyone can do that!

Sadly, as Starbucks seems to be discovering, life is messier than a few graphs.  Management by metric usually leads to unintended consequences, and frequently those who operate in such systems resort to metric-gaming.  As I mentioned in the TAC Tale, measuring TAC agents on create-to-close numbers led to many engineers avoiding complex cases and sticking with RMAs to get their numbers looking good.  Tony Hsieh at Zappos, whatever problems he may have had, was totally right when he had his customer service reps stay on the phone as long as needed with customers, hours if necessary, to resolve an issue with a $20 pair of shoes.  That would never fly with the corporatists.  But he understood that customer satisfaction would make or break his business, and it’s often hard to put a number on that.

Corporatism of various sorts has been present in every company I’ve worked for.  The best, and most successful, leadership teams I’ve worked for have avoided it by employing leaders that grew up within the industry.  This doesn’t make them immune from mistakes, of course, but it allows them to understand their customers, something corporatists have a hard time with.

Unfortunately, we work in an industry (like many) in which the stock value of companies is determined by an army of non-technical “analysts” who couldn’t configure a static route, let alone explain what one is.  And yet somehow, their opinions on (e.g.) the router business move the industry.  They of course adhere to the ethos of corporatism.  And I’m sure they get paid better than I do.

Starbucks seems to be correcting a mistake by hiring back someone who actually knows their business.  Would that all corporations learn from Starbucks’ mistake, and ensure their leaders know at least something about what they are leading.

The case came into the routing protocols queue, even though it was simply a line card crash.  The RP queue in HTTS was the dumping ground for anything that did not fit into one of the few other specialized queues we had.  A large US service provider had a Packet over SONET (PoS) line card on a GSR 12000-series router crashing over and over again.

Problem Details: 8 Port ISE Packet Over SONET card continually crashing due to

SLOT 2:Aug  3 03:58:31: %EE48-3-ALPHAERR: TX ALPHA: error: cpu int 1 mask 277FFFFF
SLOT 2:Aug  3 03:58:31: %EE48-4-GULF_TX_SRAM_ERROR: ASIC GULF: TX bad packet header detected. Details=0x4000

A previous engineer had the case, and he did what a lot of TAC engineers do when faced with an inexplicable problem:  he RMA’d the line card.  As I have said before, RMA is the default option for many TAC engineers, and it’s not a bad one.  Hardware errors are frequent and replacing hardware often is a quick route to solving the problem.  Unfortunately the RMA did not fix the problem, the case got requeued to another engineer, and he…RMA’d the line card.  Again.  When that didn’t work, he had them try the card in a different slot, but it continued to generate errors and crash.

The case bounced through two other engineers before getting to me.  Too bad the RMA option was out.  But the simple line card crash and error got even weirder.  The customer had two GSR routers in two different cities that were crashing with the same error.  Even stranger:  the crash was happening at precisely the same time in both cities, down to the second.  It couldn’t be a coincidence, because each crash on the first router was mirrored by a crash at exactly the same time on the second.

The conversation with my fellow engineers ranged from plausible to ludicrous.  There was a legend in TAC, true or not, that solar flares cause parity errors in memory and hence crashes.  Could a solar flare be triggering the same error on both line cards at the same time?  Some of my colleagues thought it was likely, but I thought it was silly.

Meanwhile, internal emails were going back and forth with the business unit to figure out what the errors meant.  Even for experienced network engineers, Cisco internal emails can read like a foreign language.  “The ALPHA errors are side-effects the GULF errors,” one development engineer commented, not so helpfully.  “Engine is feeding invalid packets to GULF and that causes the bad header error being detected on GULF,” another replied, only slightly more helpfully.

The customer, meanwhile, had identified a faulty fabric card on a Juniper router in their core.  Apparently the router was sending malformed packets to multiple provider edge (PE) routers all at once, which explained the simultaneous crashing.  Because all the PEs were in the US, forwarding was a matter of milliseconds, and thus there was very little variation in the timing.  How did the packets manage to traverse the several hops of the provider network without crashing any GSRs in between?  Well, the customer was using MPLS, and the corruption was in the IP header of the packets.  The intermediate hops forwarded the packets, without ever looking at the IP header, to the edge of the network, where the MPLS labels get stripped, and IP forwarding kicks in.  It was at that point that the line card crashed due to the faulty IP headers.  That said, when a line card receives a bad packet, it should drop it, not crash.  We had a bug.

The development engineers could not determine why the line card was crashing based on log info.  By this time, the customer had already replaced the faulty Juniper module and the network was stable.  The DEs wanted us to re-introduce the faulty line card into the core, and load up an engineering special debug image on the GSRs to capture the faulty packet.  This is often where we have a gulf, pun intended, between engineering and TAC.  No major service provider or customer wants to let Cisco engineering experiment on their network.  The customer decided to let it go.  If it came back, at least we could try to blame the issue on sunspots.

In the last article on technical interviewing, I told the story of how I got my first networking job.  The interview was chaotic and unorganized, and resulted in me getting the job and being quite successful.  In this post, I’d like to start with a very basic question:  Why is it that we interview job candidates in first place?

This may seem like an obvious question, but if you think about it face-to-face interviewing is not necessarily the best way to assess a candidate for a networking position.  To evaluate their technical credentials, why don’t we administer a test? Or, force network engineering candidates to configure a small network? (Some places do!)  What exactly is it that we hope to achieve by sitting down for an hour and talking to this person face-to-face?

Interviewing is fundamentally a subjective process.  Even when an interviewer attempts to bring objectivity to the interview by, say, asking right/wrong questions, interviews are just not structured as objective tests.  The interviewer feedback is usually derived from gut reactions and feelings as much as it is from any objective criteria.  The interviewer has a narrow window into the candidate’s personality and achievements, and frequently an interviewer will make an incorrect assessment in either direction:

  • By turning down a candidate who is qualified for the job.  When I worked at TAC, I remember declining a candidate who didn’t answer some questions about OSPF correctly.  Because he was a friend of a TAC engineer, he got a second chance and did better in his second interview.  He got hired and was quite successful.
  • By hiring a candidate who is unqualified for the job.  This happens all the time.  We pass people through interviews who end up being terrible at the job.  Sometimes we just assess their personality wrong and they end up being complete jerks.  Sometimes, they knew enough technical material to skate through the interview.

Having interviewed hundreds of people in my career, I think I’m a very good judge of people.  I was on the interview team for TAC, and everyone we hired was a successful engineer.  Every TME I’ve hired as a manager has been top notch.  That said, it’s tricky to assess someone in such a short amount of time. As the interviewee, you need to remember that you only have an hour or so to convince this person you are any good, and one misplaced comment could torpedo you unfairly.

I remember when I interviewed for the TME job here at Cisco.  I did really well, and had one final interview with the SVP at the time.  He was very personable, and I felt at ease with him.  He asked me for my proudest accomplishment in my career.  I mentioned how I had hated TAC when I started, but I managed to persevere and left TAC well respected and successful.  He looked at my quizzically.  I realized it was a stupid answer.  I was interviewing for a director-level position.  He wanted to hear some initiative and drive, not that I stuck it out at a crappy job.  I should have told him about how I started the Juniper on Juniper project, for example.  Luckily I got through but that one answer gave him an impression that took me down a bit.

When you are interviewing, you really need to think about the impression you create.  You need empathy.  You need to feel how your interviewer feels, or at least be self-aware enough to know the impression you are creating.  That’s because this is a subjective process.

I remember a couple of years back I was interviewing a candidate for an open position.I asked him why he was interested in the job. The candidate proceeded to give me a depressing account of how bad things were in his current job.”It’s miserable here,” he said.  “Nobody’s going anywhere in his job.  I don’t like the team they’re not motivated.”  And so forth.  He claimed he had programming capabilities and so I asked him what his favorite programming language.”I hate them all,” he said. I actually think that he was technically fairly competent but in my opinion working with this guy would’ve been such a downer that I didn’t hire him.

In my next article I’ll take a look at different things hiring managers and interviewers are looking for in a candidate, and how they assess them in an interview.

 

When you open a TAC case, how exactly does the customer support engineer (CSE) figure out how to solve the case?  After all, CSEs are not super-human.  Just like any engineer, in TAC you have a range of brilliant to not-so-brilliant, and everything in between.  Let me give an example:  I worked at HTTS, or high-touch TAC, serving customers who paid a premium for higher levels of support.  When a top engineer at AT&T or Verizon opened a case, how was it that I, who had never worked professionally in a service provider environment, was able to help them at all?  Usually when those guys opened a case, it was something quite complex and not a misconfigured route map!

TAC CSEs have an arsenal of tools at their disposal that customers, and even partners, do not.  One of the most powerful is well known to anyone who has ever worked in TAC:  Topic.  Topic is an internal search engine.  It can do more now, but at the time I was in TAC, Topic could search bugs, TAC cases, and internal mailers.  If you had a weird error message or were seeing inexplicable behavior, popping the message or symptoms into Topic frequently resulted in a bug.  Failing that, it might pull up another TAC case, which would show the best troubleshooting steps to take.

Topic also searches internal mailers, the email lists used internally by Cisco employees.  TAC agents, sales people, TMEs, product managers, and engineering all exchange emails on these mailers, which are then archived.  Oftentimes a problem would show up in the mailer archives and engineering had already provided an answer.  Sometimes, if Topic failed, we would post the symptoms to the mailers in hopes engineering, a TME, or any expert would have a suggestion.  I was always careful in doing so, as if you posted something that was already answered, or asked too often, flames would be coming your way.

TAC engineers have the ability to file bugs across the Cisco product portfolio.  This is, of course, a powerful way to get engineering attention.  Customer found defects are taken very seriously, and any bug that is opened will get a development engineer (DE) assigned to it quickly.  We were judged on the quality of bugs we filed since TAC does not like to abuse the privilege and waste engineering time.  If a bug is filed for something that is not really a bug, it gets marked “J” for Junk, and you don’t want to have too many junked bugs.  That said, on one or two occasions, when I needed engineering help and the mailers weren’t working, I knowingly filed a Junk bug to get some help from engineering.  Fortunately, I filed a few real bugs that got fixed.

My team was the “routing protocols” team for HTTS, but we were a dumping ground for all sorts of cases.  RP often got crash cases, cable modem problems, and other issues, even though these weren’t strictly RP.  Even within the technical limits of RP, there is a lot of variety among cases.  Someone who knows EIGRP cold may not have a clue about MPLS.  A lot of times, when stuck on a case, we’d go find the “guy who knows that” and ask for help.  We had a number of cases on Asynchronous Transfer Mode (ATM) when I worked at TAC, which was an old WAN (more or less) protocol.  We had one guy who knew ATM, and his job was basically just to help with ATM cases.  He had a desk at the office but almost never came in, never worked a shift, and frankly I don’t know what he did all day.  But when an ATM case came in, day or night, he was on it, and I was glad we had him, since I knew little about the subject.

Some companies have NOCs with tier 1, 2, and 3 engineers, but we just had CSEs.  While we had different pay grades, TAC engineers were not tiered in HTTS.  “Take the case and get help” was the motto.  Backbone (non-HTTS) TAC had an escalation team, with some high-end CSEs who jumped in on the toughest cases.  HTTS did not, and while backbone TAC didn’t always like us pulling on their resources, at the end of the day we were all about killing cases, and a few times I had backbone escalation engineers up in my cube helping me.

The more heated a case gets, the higher the impact, the longer the time to resolve, the more attention it gets.  TAC duty managers can pull in more CSEs, escalation, engineering, and others to help get a case resolved.  Occasionally, a P1 would come in at 6pm on a Friday and you’d feel really lonely.  But Cisco being Cisco, if they need to put resources on an issue, there are a lot of talented and smart people available.

There’s nothing worse than the sinking feeling a CSE gets when realizing he or she has no clue what to do on a case.  When the Topic searches fail, when escalation engineers are stumped, when the customer is frustrated, you feel helpless.  But eventually, the problem is solved, the case is closed, and you move on to the next one.

I’ve mentioned before that EIGRP SIA was my nightmare case at TAC, but there was one other type of case that I hated–QoS problems.  Routing protocol problems tend to be binary.  Either the route is there or it isn’t;  either the pings go through or they don’t.  Even when a route is flapping, that’s just an extreme version of the binary problem.  QoS is different.  QoS cases often involved traffic that was passing sometimes or in certain amounts, but would start having problems when different sizes of traffic went through, or possibly traffic was dropping at a certain rate.  Thus, the routes could be perfectly fine, pings could pass, and yet QoS was behaving incorrectly.

In TAC, we would regularly get cases where the customer claimed traffic was dropping on a QoS policy below the configured rate.  For example, if they configured a policing profile of 1000 Mbps, sometimes the customer would claim the policer was dropping traffic at, say, 800 Mbps.  The standard response for a TAC agent struggling to figure out a QoS policy issue like this was to say that the link was experiencing “microbursting.”  If a link is showing a 800 Mbps traffic rate, this is actually an average rate, meaning the link could be experiencing short bursts above this rate that exceed the policing rate, but are averaged out in the interface counters.  “Microbursting” was a standard response to this problem for two reasons:  first, it was most often the problem;  second, it was an easy way to close the case without an extensive investigation.  The second reason is not as lazy as it may sound, as microbursts are common and are usually the cause of these symptoms.

Thus, when one of our large service provider customers opened a case stating that their LLQ policy was dropping packets before the configured threshold, I was quick to suspect microbursts.  However, working in high-touch TAC, you learn that your customers aren’t pushovers and don’t always accept the easy answer.  In this case, the customer started pushing back, claiming that the call center which was connected to this circuit generated a constant stream of traffic and that he was not experiencing microbursts.  So much for that.

This being the 2000’s, the customer had four T1’s connected in a single multi-link PPP (MLPPP) bundle.  The LLQ policy was dropping traffic at one quarter of the threshold it was configured for.  Knowing I wouldn’t get much out of a live production network, I reluctantly opened a lab case for the recreate, asking for a back-to-back router with the same line cards, a four-link T1 interconnection, and a traffic generator.  As always, I made sure my lab had exactly the same IOS release as the customer.

Once the lab was set up I started the traffic flowing, and much to my surprise, I saw traffic dropping at one quarter of the configured LLQ policy.  Eureka!  Anyone who has worked in TAC will tell you that more often than not, lab recreates fail to recreate the customer problem.  I removed and re-applied the service policy, and the problem went away.  Uh oh.  The only thing worse than not recreating a problem is recreating it and then losing it again before developers get a chance to look at it.

I spent some time playing with the setup, trying to get the problem back.  Finally, I reloaded the router to start over and, sure enough, I got the traffic loss again.  So, the problem occurred at start-up, but when the policy was removed and re-applied, it corrected itself.  I filed a bug and sent it to engineering.

Because it was so easy to recreate, it didn’t take long to find the answer.  The customer was configuring their QoS policy using bandwidth percentages instead of absolute bandwidth numbers.  This means that the policy bandwidth would be determined dynamically by the router based on the links it was applied to.  It turns out that IOS was calculating the bandwidth numbers before the MLPPP bundle was fully up, and hence was using only a single T1 as the reference for the calculation instead of all four.  The fix was to change the priority of operations in IOS, so that the MLPPP bundle came up before the QoS policy was applied.

So much for microbursts.  The moral(s) of the story?  First, the most obvious cause is often not the cause at all.  Second, determined customers are often right.  And third:  even intimidating QoS cases can have an easy fix.

I’ve been in this industry a while now, and I’ve done a lot of jobs.  Certainly not every job, but a lot.  My first full time network engineering job came in 2000, but I was doing some networking for a few years before that.

I often see younger network engineers posting in public forums asking about the pros and cons of different job roles.  I’ve learned over the years that you have to take people’s advice with a grain of salt.  Jobs in one category may have some common characteristics, but a huge amount is dependent on the company, your manager, and the people you work with.  However, we all have a natural tendency to try to figure out the situation at a potential job in advance, and the experience of others can be quite helpful in sizing up a role.  Therefore, I’ve decided to post a summary of the jobs I’ve done, and the advantages/disadvantages of each.

IT Network Engineer

Summary:
This is an in-house engineer at a corporation, government agency, educational institution, or really any entity that runs a network.  The typical job tasks vary depending on level, but they usually involve a large amount of day-to-day network management.  This can be responding to complaints about network performance, patching in network connectivity (less so these days because of wireless), upgrading and maintaining devices, working with carriers, etc.  Larger scale projects could be turning up new buildings and sites, planning for adding new functionality (e.g. multicast), etc.

Pros:

  • Stable and predictable work environment.  You show up at the same place and know the people, unlike consulting.
  • You know the network.  You’re not showing up a new place trying to figure out what’s going on.
  • It can be a great chance to learn if the company is growing and funds new projects

Cons:

  • You only get to see one network and one way of doing things.
  • IT is a cost center, so there is a constant desire to cut personnel/expenses.
  • Automation is reducing the type of on-site work that was once a staple for these engineers.
  • Your fellow employees often hate IT and blame you for everything.
  • Occasionally uncomfortable hours due to maintenance windows.

Key Takeaway:
I often tell people that if you want to do an in-house IT job, try to find an interesting place to work.  Being an IT guy at a law firm can be kind of boring.  Being an IT guy at the Pentagon could be quite interesting.  I worked for a major metropolitan newspaper for five years (when there was such a thing) and it was fascinating to see how newspapers work.  Smaller companies can be better in that you often get to touch more technologies, but the work can be less interesting.  Larger companies can pigeonhole you into a particular area.  You might work only on the WAN and never touch the campus or data center side of things, for example.

Technical Support Engineer

Summary:
Work at a vendor like Cisco or Juniper taking cases when things go wrong.  Troubleshoot problems, recreate them in the lab, file bugs, find solutions for customers.  Help resolve outages.  See my TAC Tales for the gory details.

Pros:

  • Great way to get a vast amount of experience by taking lots of tough cases
  • Huge support organization to help you through trouble
  • Short-term work for the most part–when you close a case you’re done with it and move on to something new
  • Usually works on a shift schedule, with predictable hours.  Maintenance windows can often be handed off.

Cons:

  • Nearly every call involves someone who is unhappy.
  • Complex and annoying technical problems.  Your job is 100% troubleshooting and it gets old.
  • Usually a high case volume which means a mountain of work.

Key Takeway:
Technical Support is a tough and demanding environment, but a great way to get exposure to a constant stream of major technical issues.  Some people actually like tech support and make a career out of it, but most I’ve known can burn out after a while.  I wouldn’t trade my TAC years for anything despite the difficulties, as it was an incredible learning experience for me.

Sales Engineer

Summary:
I’ve only filled this role at a partner, so I cannot speak directly to the experience inside a company like Cisco (although I constantly work with Cisco SE’s).  This is a pre-sales technical role, generally partnered with a less-technical account manager.  SE’s ultimately are responsible for generating sales, but act as a consultant or adviser to the customer to ensure they are selling something that fits.  SE’s do initial architecture of a given solution, work on developing the Bill of Materials (BoM), and in the case of partners, help to write the Statement of Work (SoW) for deployment.  SE’s are often involved in deployment of the solutions they sell but it is not their primary job.

Pros:

  • Architectural work is often very rewarding;  great chance to partner with customer and build networks.
  • Often allows working on a broad range of technologies and customers.
  • Because it involves sales, usually good training on the latest technologies.
  • Unlike pure sales (account managers in Cisco lingo), a large amount of compensation is salary so better financial stability.
  • Often very lucrative.

Cons:

  • Like any account-based job, success/enjoyability is highly dependent on the account(s) you are assigned to.
  • Compensation tied to sales, so while there are good opportunities to make money, there is also a chance to lose a lot of discretionary income.
  • Often take the hit for poor product quality from the company whose products you are selling.
  • Because it is a pre-sales role, often don’t get as much hands-on as post-sales engineers.
  • For some products, building BoM’s can be tedious.
  • Sales pressure.  Your superiors have numbers to make and if you’re not seen to be helping, you could be in trouble.

Key Takeaway:
Pre-sales at a partner or vendor can be a well-paying and enjoyable job.  Working on architecture is rewarding and interesting, and a great chance to stay current on the latest technologies.  However, like any sales/account-based job, the financial and career success of SE’s is highly dependent on the customers they are assigned to and the quality of the sales team they are working with.  Generally SE’s don’t do technical support, but often can get pulled into late-night calls if a solution they sell doesn’t work.  SEs are often the face of the company and can take a lot of hits for things that they sell which don’t work as expected.  Overall I enjoyed being a partner SE for the most part, although the partner I worked for had some problems.

Post-Sales/Advanced Services

Summary:
I’m including both partner post-sales, which I have done, and advanced services at a vendor like Cisco, which are similar.  A post-sales engineer is responsible for deploying a solution once the customer has purchased it, and oftentimes the AS/deployment piece is a part of the sale.  Occasionally these engineers are used for non-project-based work, more so at partners.  In this case, the engineer might be called to be on site to do some regular maintenance, fill in for a vacationing engineer, etc.

Pros:

  • Hands-on network engineering.  This is what we all signed up for, right?  Getting into real networks, setting stuff up, and making things happen.
  • Unlike IT network engineers, this job is more deployment focused so you don’t have to spend as much time on day-to-day administrative tasks.
  • Unlike sales, the designs you work on are lower-level and more detailed, so again, this is a great nuts-and-bolts engineering role.

Cons:

  • As with sales, the quality and enjoyability is highly dependent on the customers you end up with.
  • You can get into some nasty deployment scenarios with very unhappy customers.
  • Often these engagements are short-term, so less of a chance to learn a customer/network.  Often it is get in, do the deployment, and move on to the next one.
  • Can involve a lot of travel.
  • Frequently end up assisting technical support with deployments you have done.
  • Can have odd hours.
  • Often left scrambling when sales messed up the BoM and didn’t order the right gear or parts.

Key Takeway:
I definitely enjoyed many of my post-sales deployments at the VAR.  Being on-site, and doing a live deployment with a customer is great.  I remember one time when I did a total network refresh and  VoIP deployment up at St. Helena Unified School District in Napa, CA.  It was a small school district, but over a week in the summer we went building-by-building replacing the switches and routers and setting up the new system.  The customer was totally easygoing, gave us 100% free reign to do it how we wanted, was understanding of complications, and was satisfied with the result.  Plus, I enjoyed spending a week up in Napa, eating well and loving the peace. However, I also had some nightmare customers who micromanaged me or where things just went south.  It’s definitely a great job to gain experience on a variety of live customer networks.

Technical Marketing Engineer

Summary:
I’m currently a Principal TME and a manager of TMEs.  This is definitely my favorite job in the industry.  I give more details in my post on what a TME does, but generally we work in a business unit of a vendor, on a specific product or product family, both guiding the requirements for the product and doing outbound work to explain the product to others, via white papers, videos, presentations, etc.

Pros:

  • Working on the product side allows a network engineer to actually guide products and see the results.  It’s exciting to see a new feature, CLI, etc., added to a product because you drove it.
  • Get to attend at least several trade shows a year.  Everyone likes getting a free pass to a conference like Cisco Live, but to be a part of making it happen is exhilarating.
  • Great career visibility.  Because the nature of the job requires producing content related to your product, you have an excellent body of work to showcase when you decide to move on.
  • Revenue side.  I didn’t mention this in the sales write-up, but it’s true there too.  Being close to revenue is generally more fun than being in a cost center like IT, because you can usually spend more money.  This means getting new stuff for your lab, etc.
  • Working with products before they are ever released to the public is a lot of fun too.
  • Mostly you don’t work on production networks so not as many maintenance windows and late nights as IT or AS.

Cons:

  • Relentless pace of work.  New software releases are constantly coming;  as soon as one trade show wraps up it’s time to prepare for the next one.  I often say TME work is as relentless as TAC.
  • Can be heavy on the travel.  That could, of course, be a good thing but it gets old.
  • Difficulty of influencing engineering without them reporting to you.  Often it’s a fight to get your ideas implemented when people don’t agree.
  • If you don’t like getting up in front of an audience, or writing documents, this job may not be for you.
  • For new products, often the TMEs are the only resources who are customer facing with a knowledge of the product, so you can end up working IT/AS-type hours anyways.  Less an issue with established/stable products.

Key Takeaway:
As I said, I love this job but it is a frenetic pace.  Most of the posts I manage to squeeze in on the blog are done in five minute intervals over a course of weeks.  But I have to say, I like being a TME more than anything else I’ve done.  Being on the product side is fascinating, especially if you have been on the consumer side.  Going to shows is a lot of fun.  If you like to teach and explain, and mess around with new things in your lab, this is for you.

It’s not a comprehensive list of the jobs you can do as a network engineer, but it covers some of the main ones.  I’m certainly interested in your thoughts on other jobs you’ve done, or if you’ve done one of the above, whether you agree with my assessment.  Please drop a comment–I don’t require registration but do require an email address just to keep spam down.

Everyone who’s worked in TAC can tell you their nightmare case–the type of case that, when they see it in the queue, makes them want to run away, take an unexpected lunch break, and hope some other engineer grabs it.  The nightmare case is the case you know you’ll get stuck on for hours, on a conference bridge, escalating to other engineers, trying to find a solution to an impossible problem.  For some it’s unexplained packet loss.  For others, it’s multicast.  For me, it was EIGRP Stuck-in-Active (SIA).

Some customer support engineers (CSEs) thought SIA cases were easy.  Not me.  A number of times I had a network in total meltdown due to SIA with no clue as to where the problem was.  Often the solution required a significant redesign of the network.

As a review, EIGRP is more-or-less a distance-vector routing protocol, which uses an algorithm called DUAL to achieve better performance than a traditional DV protocol like RIP.  I don’t want to get into all the fun CCIE questions on the protocol details, but what matters for this article is how querying works.  When an EIGRP neighbor loses a route, it sets the route as “Active” and then queries its neighbors as to where the route went.  Then, if the neighbors don’t have it, they set it active and query their neighbors.  If those neighbors don’t have the route active, they of course mark it active and query their neighbors.  And so forth.

It should be obvious from this process that in a large network, the queries can multiply quite quickly.  If a router has a lot of neighbors, and its neighbors have a lot of neighbors, the queries multiply exponentially, and can get out of control.  Meanwhile, when a router sets a route active, it sets a timer.  If it doesn’t get a reply before the timer expires, then the router marks the route “Stuck In Active”, and resets the entire EIGRP adjacency.  In a large network with a lot of neighbors, even if the route is present, the time lag between sending a query and getting a response can be so long that the route gets reset before the response makes it to the original querying router.

I’ve ironed out some of the details here, since obviously an EIGRP router can lose a route entirely without going SIA.  For details, see this article.  The main point to remember is that the SIA route happens when the querying route just doesn’t get a response back.

Back in my TAC days, I of course wasn’t happy to see an SIA drop in the queue.  I waited to see if one of my colleagues would take the case and alleviate the burden, but the case turned blue after 20 minutes, meaning someone had to take it.  Darn.

Now I can show my age, because the customer had adjacencies resetting on Token Ring interfaces.  I asked the customer for a topology diagram, some debugs, and to check whether there was packet loss across the network.  Sometimes, if packets are getting dropped, the query responses don’t make it back to the original router, causing SIA.  The logs from the resets looked like this:

rtr1 - 172.16.109.118 - TokenRing1/0
Sep 1 16:58:06: %DUAL-3-SIA: Route 172.16.161.58/32 stuck-in-active state in IP-EIGRP(0) 55555. Cleaning up
Sep 1 16:58:06: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is down: stuck in active
Sep 1 16:58:07: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is up: new adjacency

This is typical of SIA.  The adjacency flapped, but the logs showed no particular reason why.

I thought back to my first troubleshooting experience as a network engineer.  I had brought up a new branch office but it couldn’t talk back to HQ.  Mike, my friend and mentor, showed up and started pinging hop-by-hop until he found a missing route.  “That’s how I learned it,” he said, “just go one hop a time.”  The big clue I had in the SIA case was the missing route:  172.16.161.58/32.  I started tracing it back, hop-by-hop.

I found that the route originated from a router on the edge of the customer network, which had an ISDN PRI connected.  (Showing my age again!)  They had a number of smaller offices that would dial into the ISDN on-demand, and then drop off.  ISDN had per-minute charges and thus, in this pre-VPN era, it was common to setup ISDN in on-demand mode.  ISDN was a digital dial-up technology with very short call setup times.  I discovered that, as these calls were going up and down, the router was generating /32 peer routes for the neighbors and injecting them into EIGRP.  They had a poorly designed network with a huge query domain size, and so as these dial peers were going up and down, routers on the opposite side of the network were going into active on the route and not getting responses back.

They were advertising a /16 for the entire 172.16.x.x network, so sending a /32 per dial peer was totally unnecessary.  I recommended they enable “no peer neighbor-route” on the PRI to suppress the /32’s and the SIAs went away.

I hate to bite the hand that feeds me, but even though I work at Cisco I can say I really never liked EIGRP.  EIGRP is fast, and if the network is designed well, it works fine.  However, networks often grow organically, and the larger the domain, the more unstable EIGRP becomes.  I’ve never seen this sort of problem with OSPF or ISIS.  Fortunately, this case ended up being much less problematic than I expected, but often these cases were far nastier.  Oftentimes it was nearly impossible to find the route causing the problem and why it was going crazy.  Anyhow it’s always good to relive a case with both Token Ring and ISDN for a double case of nostalgia.