HTTS

All posts tagged HTTS

I haven’t written anything for a while, because of the simple fact that I had nothing to say.  The problem with being a writer is that sometimes you have nothing to write.  I also have a day job, and sometimes it can keep me quite busy.  Finally, an afternoon drive provided some inspiration.

There’s a funny thing about the buildings I work in–they all tend to be purchased by Google.  When I started at Juniper, I worked in their Ariba campus in Mountain View, several buildings they rented from that software company.  We were moved to Juniper’s (old) main campus, on Mathilda Drive, and the old Ariba buildings were bought and re-purposed by Google.  Then the Mathilda campus was bought by Google.

When I worked in TAC, from 2005-2007, I worked in building K on Tasman Drive in San Jose.  Back then, the meteoric growth of Cisco was measured by the size of its campus, which stretched all along Tasman, down Cisco way, and even extended into Milpitas.

Cisco’s campus has been going the opposite direction for a while now.  The letter buildings (on Tasman, West of Zanker Street) started closing before the COVID lockdowns changed everything.  Now a lot of buildings sit empty and will certainly be sold, including quite possibly the ones in Milpitas, where I work.

Building K closed, if not sometime during the lockdowns, shortly after.  I hadn’t driven by it in months, and when I did yesterday-lo and behold!-it was now a Google building!

What used to be building K

It’s funny how our memories can be so strongly evoked by places.  Building K was, for a long time, the home to Cisco TAC.  I vividly remember parking on Champion Drive, reviewing all of my technological notes before going in to be panel-interviewed by four tough TAC engineers.  I remember getting badged in the day I started, after passing the interview, and being told by my mentor that he wouldn’t be able to put me “on the queue” for three months, because I had so much to learn.

Two weeks later I was taking cases.  Not because I was a quick study, but because they needed a body.

I worked in High Touch Techical Support, dealing with Cisco’s largest customers.  The first team I was on was called ESO.  Nobody knew what it stood for.  The team specialzied in taking all route/switch cases for Cisco’s large financial customers like Goldman Sacks and JPMC.  Most of the cases involved the Cat 6k, although we supported a handful of other enterprise platforms.

When a priority 1 case came in, the Advanced Services Hotline (ASH) call center agents would call a special number that would cause all of the phones on the ESO team to play a special ring tone.  I grew to develop a visceral hatred of that ring tone.  Hearing it today would probably trigger PTSD.  I’d wait and wait for another TAC engineer (called CSEs) to answer it.  If nobody did, I’d swallow hard and grab the phone.

The first time I did it was a massive multicast meltdown disrputing operations on the NYSE trading floor.  I had just gotten my CCIE, but I had only worked previosly as a network engineer in a small environment.  Now I was dealing with a major outage, and it was the first time I had to handle real-world multicast.  Luckily, my mentor showed up after 20 minutes or so and helped me work the case.

My first boss in HTTS told me on the day I started, “at Cisco, if you don’t like your boss or your cubicle, wait three months.”  Three months later I had a new boss and a new cubicle.  The ESO team was broken up, and its engineers dispersed to other teams.  I was given a choice:  LAN Switch or Routing Protocols.  I chose the latter.

I joined the RP-LSA team as a still new TAC engineer.  The LSA stood for “Large Scale Architectures.”  The team was focused on service provider routing and platform issues.  Routing protocol cases were actually a minority of our workload.  We spent a lot of time dealing with platform issues on the GSR 12000-series router, the broadband aggregation 10000-series, and the 7500.  Many of the cases were crashes, others were ASIC issues.  I’d never even heard of the 12k and 10k, now I was expected to take cases and speak with authority.  I leaned on my team a lot in the early days.

Fortunately for me, these were service provider guys, and they knew little about enterprise networking or LAN switching.  With the breakup of the ESO team, the large financials were now coming into the RP-LSA queue.  And anyone who has worked in TAC can tell you, a routing protocols case is often not an RP case at all.  When the customer opens a case for flapping OSPF adjacencies, it’s usually just a symptom of a layer 2 issue.  The SP guys had no clue how to deal with these, but I did, so we ended up mutually educating each other.

In those days, most of the protocol cases were on Layer 3 MPLS.  I had never even heard of MPLS before I started there, but I did a one week online course (with lab), and started taking cases like a champ.  MPLS cases were often easily because it was new, but usually when a large service provider like AT&T, Orange, or Verizon opens a case on soemthing like BGP, it’s not because they misconfigured a route map.  They’ve looked at everything before opening the case, and so the CSE becomes essentially a middleman to coordinate the customer talking to developers.  In many cases the CSE is like a paramedic, stabilizing the patient before the doctor takes over to figure out what is wrong.  We often knew we were facing a bug, but our job was to find workarounds to bring the network back up so developers could find a fix.

I had my share of angry customers in those days, some even lividly angry.  But most customers were nice.  These were professional network engineers who understood that the machines we build don’t always act as we expect them to.  Nevertheless, TAC is a high-stress job.  It’s also relentless.  Close one case, and two more are waiting in the queue.  There is no break, no downtime.  The best thing about it was that when you went home, you were done.  If a call came in on an open case in your backlog, it would be routed to another engineer.  (Though sometimes they routed it back to you in the morning.)  In HTTS, we had the distinct disadvantage of having to work cases to resolution.  If the case came in at 5:55pm on Friday night, and your shift ended at 6pm, you might spend the next five hours in the office.  Backbone TAC engineers “followed the sun” and re-assigned cases as soon as their shift ended.

I make no secret of the fact that I hated the job.  My dream was to work at Cisco, but shortly after I started, I wanted out.  And yet the two years I spent in TAC are two of the most memorable of my career.  TAC was a crucible, a brutal environment dealing with nasty technical problems.  The fluff produced by marketeers has no place there.  There was no “actualize your business intent by optimizing and observing your network”-type nonsense.  Our emails were indeciperhable jumbles of acronyms and code names.  “The CEF adjacency table is not being programmed because the SNOOPY ASIC has a fault.”  (OK, I made that up…  but you get the point.)  This was not a place for the weak-minded.

When things got too sticky for me, I could call in escalation engineers.  I remember one case where four backbone TAC escalation engineers and one from HTTS took over my cube, peering at my screen and trying to figure out what was going on during a customer meltdown.

Building K was constructed with the brutalist-stlye of architecture so common in Silicon Valley.  One look at the concrete and glass and the non-descript offices and conference rooms is enough to drain one’s soul.  These buildings are pure function over form.  They are cheap to put up and operate, but emotionally crushing to work in.  There is no warmth, even on a winter day with the heat on.

Still, when I look at building K, or what’s become of it, I think of all the people I knew there.  I think of the battles fought and won, the cases taken and closed, the confrontational customers and the worthless responses from engineering, leaving us unable to close a case.  I think of the days I would approach that building in dread, not knowing what hell I would go through that day.  I also think of the incredible rush of closing a complex case, of finding a workaround, and of getting an all-5’s “bingo” (score) from a customer.  TAC is still here, but for those of us who worked in building K, its closure represents the end of an era.

I’ve mentioned before that, despite being on the Routing Protocols team, I spent a lot of time handling crash cases in TAC.  At the time, my queue was just a dumping ground for cases that didn’t fit into any other bucket in the High Touch structure.  Backbone TAC had a much more granular division of teams, including a team entirely dedicated to crash.  But in HTTS, we did it all.

Some crashes are minor, like a (back then) 2600-series router reloading due to a bus error.  Some were catastrophic, particularly crashes on large chassis-type routing systems in service provider networks.  These could have hundreds of interfaces, and with sub-interfaces, potentially thousands of customers affected by a single outage.  Chassis platforms vary in their architecture, but many of the platforms we ran at the time used a distributed architecture in which the individual line cards ran a subset of IOS.  Thus, unlike a 2600 which had “dumb” WIC cards for interface connections, on chassis systems line cards themselves could crash in addition to the route processors.  Oftentimes, when a line card crashed, the effect would cascade through the box, with multiple line cards crashing, which would result in a massive meltdown.

The 7500 was particularly prone to these.  A workhorse of Cisco’s early product line, the 7500 line cards ran IOS but forwarded packets between each other by placing them into special queues on the route processor.  This was quite unlike later products, such as the Gigabit Switch Router (GSR), which had a fabric architecture enabling line cards to communicate directly.  On the 7500, oftentimes a line card having a problem would write bad data into the shared queues, which the subsequent line cards would read and then crash, causing a cascading failure.

One of our big customers, a Latin American telecommunications company I’ll call LatCom, was a heavy user of 7500’s.  They were a constant source of painful cases, and for some reason had a habit of opening P1 cases on Fridays at 5:55pm.  Back then HTTS day-shift engineers’ shifts ended at 6pm, at which point the night shift took over, but once we accepted a P1 or P2 case, unlike backbone TAC, we had to work it until resolution.  LatCom drove us nuts.  Five minutes was the difference between going home for the weekend and potentially being stuck on the phone until 10pm on a Friday night.  The fact that LatCom’s engineers barely spoke English also proved a challenge and drew out the cases–occasionally we had to work through non-technical translators, and getting them to render “there was a CEF bug causing bad data to be placed into the queue on the RP” into Spanish was problematic.

After years of nightmare 7500 crashes, LatCom finally did what we asked:  they dropped a lot of money to upgrade their routers to GSRs with PRPs, at that time our most modern box.  All the HTTS RP engineers breathed a sigh of relief knowing that the days of nightmare cascading line card failures on 7500’s were coming to an end.  We never had a seen a single case of such a failure on a GSR.

That said, we knew that if anything bad was going to happen, it would happen to these guys.  And sure enough, one day I got a case with…you guessed it, a massive cascading line card failure on a GSR!  The first one I had seen.  In the case notes I described the failure as follows:

  1. Six POS (Packet over Sonet) interfaces went down at once
  2. Fifteen seconds later, slots 1 and 15 started showing CPUHOG messages followed by tracebacks
  3. Everything stabilized until a few hours later, when the POS interfaces go down again
  4. Then, line cards in slots 0, 9, 10, 11, and 13 crashed
  5. Fifteen seconds later, line cards in slots 6 and 2 crash
  6. And so forth

My notes said: “basically we had a meltdown of the box.”  To make matters worse, 4 days later they had an identical crash on another GSR!

When faced with a this sort of mess, TAC agents usually would send the details to an internal mailer, which is exactly what I did.  The usual attempt by some on the mailer to throw hardware at the problem didn’t go far as we saw the exact same crash on another router.  This seemed to be a CEF bug.

Re-reading the rather extensive case notes bring up a lot of pain.  Because the customer had just spent millions of dollars to replace their routers with a new platform that, we assured them, would not be susceptible to the same problem, this went all the way to their top execs and ours.  We were under tremendous pressure to find a solution, and frankly, we all felt bad because we were sure the new platform would be an end to their problems.

There are several ways for a TAC engineer to get rid of a case:  resolve the problem, tell the customer it is not reproducible, wait for it to get re-queued to another engineer.  But after two long years at TAC, two years of constant pressure, a relentless stream of cases, angry customers, and problem after problem, my “dream job” at Cisco was taking a toll.  When my old friend Mike, who had hired me at the San Francisco Chronicle, my first network engineering job, called me and asked me to join him at a gold partner, the call wasn’t hard to make.  And so I took the easiest route to getting rid of cases, a lot of them all at once, and quit.  LatCom would be someone else’s problem.  My newest boss, the fifth in two years, looked at me with disappointment when I gave him my two weeks notice.

I can see the case notes now that I work at Cisco again, and they solved the case, as TAC does.  A bug was filed and the problem fixed.  Still, I can tell you how much of a relief it was to turn in my badge and walk out of Cisco for what I wrongly thought would be the last time.  I felt, in many ways, like a failure in TAC, but at my going away party, our top routing protocols engineer scoffed at my choice to leave.  “Cisco needs good engineers,” he said.  “I could have gotten you any job you wanted here!”  True or not, it was a nice comment to hear.

I started writing these TAC tales back in 2013, when I still worked at Juniper.  I didn’t expect they’d attract much interest, but they’ve been one of the most consistently popular features of this blog. I’ve cranked out 20 of these covering a number of subjects, but I’m afraid my reservoir of stories is running dry.  I’ve decided that number 20 will be the last TAC Tale on my blog.  There are plenty of other stories to tell, of course, but I’m finished with TAC, as I was back in 2007.  My two years in TAC were some of the hardest in my career, but also incredibly rewarding.  I have so much respect for my fellow TAC engineers, past, present, and future, who take on these complex problems without fear, and find answers for our customers.

 

This one falls into the category of, “I probably shouldn’t post this, especially now that I’m at Cisco again,” but what the heck.

I’ve often mentioned, in this series, the different practices of “backbone TAC” (or WW-TAC) and High Touch Technical Support (HTTS), the group I was a part of.  WW-TAC was the larger TAC organization, where the vast majority of the cases landed.  HTTS was (and still is) a specialized TAC group dedicated to Cisco’s biggest customers, who generally pay for the additional service.  HTTS was supposed to provide a deeper knowledge of the specifics of customer networks and practices, but generally worked the same as TAC.  We had our own queues, and when a high-touch customer would open a case, Cisco’s entitlement tool would automatically route their case to HTTS based on the contract number.

Unlike WW-TAC, HTTS did not use the “follow the sun” model.  This meant that regular TAC cases would be picked up by a region where it was currently daytime, and when a TAC agent’s shift ended, they would find another agent in the next timezone over to pick up a live (P1/P2) case.  At HTTS, we had US-based employees only, at the time, and they had to work P1/P2 cases to resolution.  This meant if your shift ended at 6pm, and a P1 case came in at 5:55, you might be stuck in the office for hours until you resolved it.  We did have a US-based nightshift that came on at 6pm, but they only accepted new cases–we couldn’t hand off a live one to nightshift.

Weekends were covered by a model I hated, called “BIC”.  I asked my boss what it stood for and he explained it was either “Butt In Chair” or “Bullet In the Chamber.”  The HTTS managers would publish a schedule (quarterly if I recall) assigning each engineer one or two 6 hour shifts during the weekends of that quarter.  During those 6 hours, we had to be online and taking cases.

Why did I hate it?  First, I hated working weekends, of course.  Second, the caseload was high.  A normal day on my queue might see 4 cases per engineer, but on BIC you typically took seven or eight.  Third, you had to take cases on every topic.  During the week, only a voice engineer would pick up a voice case.  But on BIC, I, a routing protocols engineer, might pick up a voice case, a firewall case, a switching case…or whatever.  Fourth, because BIC took place on a weekend, normal escalation channels were not available.  If you had a major P1 outage, you couldn’t get help easily.

Remember that a lot of the cases you accepted took weeks or even months to resolve.  Part of a TAC engineer’s day is working his backlog of cases:  researching, working in the lab to recreate a problem, talking to engineering, etc., all to resolve these cases.  When you picked up seven cases on a weekend, you were slammed for weeks after that.

We did get paid extra for BIC, although I don’t remember how much.  It was hundreds of dollars per shift, if I recall.  Because of this, a number of engineers loaded up on BIC shifts and earned thousands of dollars per quarter.  Thankfully, this meant there were plenty of willing recipients when I wanted to give away my shifts, which I did almost always.  (I worked two during my two years at TAC.)  However, sometimes I could not find anyone to take my shift, and in that case I actually would sell my shift, offering a hundred additional dollars if someone would take the shift.  That’s how much I hated BIC.  Of course, this was done without the company knowing about it, as I’m sure they wouldn’t approve of me selling my work!

We had one CSE on our team, I’ll call him Omar, who loaded up on BICs.  Then he would come into his week so overloaded with cases from the weekend that he would hardly take a case during the week.  We’d all get burdened with extra load because Omar was off working his weekend cases.  Finally, as team lead, I called him out on it in our group chat and Omar blew up on me.  Well, I was right of course but I had to let it go.

I don’t know if HTTS still does BIC, although I suspect it’s gone away.  I still work almost every weekend I have, but it’s to stay on top of work rather than taking on more.

The case came into the routing protocols queue, even though it was simply a line card crash.  The RP queue in HTTS was the dumping ground for anything that did not fit into one of the few other specialized queues we had.  A large US service provider had a Packet over SONET (PoS) line card on a GSR 12000-series router crashing over and over again.

Problem Details: 8 Port ISE Packet Over SONET card continually crashing due to

SLOT 2:Aug  3 03:58:31: %EE48-3-ALPHAERR: TX ALPHA: error: cpu int 1 mask 277FFFFF
SLOT 2:Aug  3 03:58:31: %EE48-4-GULF_TX_SRAM_ERROR: ASIC GULF: TX bad packet header detected. Details=0x4000

A previous engineer had the case, and he did what a lot of TAC engineers do when faced with an inexplicable problem:  he RMA’d the line card.  As I have said before, RMA is the default option for many TAC engineers, and it’s not a bad one.  Hardware errors are frequent and replacing hardware often is a quick route to solving the problem.  Unfortunately the RMA did not fix the problem, the case got requeued to another engineer, and he…RMA’d the line card.  Again.  When that didn’t work, he had them try the card in a different slot, but it continued to generate errors and crash.

The case bounced through two other engineers before getting to me.  Too bad the RMA option was out.  But the simple line card crash and error got even weirder.  The customer had two GSR routers in two different cities that were crashing with the same error.  Even stranger:  the crash was happening at precisely the same time in both cities, down to the second.  It couldn’t be a coincidence, because each crash on the first router was mirrored by a crash at exactly the same time on the second.

The conversation with my fellow engineers ranged from plausible to ludicrous.  There was a legend in TAC, true or not, that solar flares cause parity errors in memory and hence crashes.  Could a solar flare be triggering the same error on both line cards at the same time?  Some of my colleagues thought it was likely, but I thought it was silly.

Meanwhile, internal emails were going back and forth with the business unit to figure out what the errors meant.  Even for experienced network engineers, Cisco internal emails can read like a foreign language.  “The ALPHA errors are side-effects the GULF errors,” one development engineer commented, not so helpfully.  “Engine is feeding invalid packets to GULF and that causes the bad header error being detected on GULF,” another replied, only slightly more helpfully.

The customer, meanwhile, had identified a faulty fabric card on a Juniper router in their core.  Apparently the router was sending malformed packets to multiple provider edge (PE) routers all at once, which explained the simultaneous crashing.  Because all the PEs were in the US, forwarding was a matter of milliseconds, and thus there was very little variation in the timing.  How did the packets manage to traverse the several hops of the provider network without crashing any GSRs in between?  Well, the customer was using MPLS, and the corruption was in the IP header of the packets.  The intermediate hops forwarded the packets, without ever looking at the IP header, to the edge of the network, where the MPLS labels get stripped, and IP forwarding kicks in.  It was at that point that the line card crashed due to the faulty IP headers.  That said, when a line card receives a bad packet, it should drop it, not crash.  We had a bug.

The development engineers could not determine why the line card was crashing based on log info.  By this time, the customer had already replaced the faulty Juniper module and the network was stable.  The DEs wanted us to re-introduce the faulty line card into the core, and load up an engineering special debug image on the GSRs to capture the faulty packet.  This is often where we have a gulf, pun intended, between engineering and TAC.  No major service provider or customer wants to let Cisco engineering experiment on their network.  The customer decided to let it go.  If it came back, at least we could try to blame the issue on sunspots.

When you open a TAC case, how exactly does the customer support engineer (CSE) figure out how to solve the case?  After all, CSEs are not super-human.  Just like any engineer, in TAC you have a range of brilliant to not-so-brilliant, and everything in between.  Let me give an example:  I worked at HTTS, or high-touch TAC, serving customers who paid a premium for higher levels of support.  When a top engineer at AT&T or Verizon opened a case, how was it that I, who had never worked professionally in a service provider environment, was able to help them at all?  Usually when those guys opened a case, it was something quite complex and not a misconfigured route map!

TAC CSEs have an arsenal of tools at their disposal that customers, and even partners, do not.  One of the most powerful is well known to anyone who has ever worked in TAC:  Topic.  Topic is an internal search engine.  It can do more now, but at the time I was in TAC, Topic could search bugs, TAC cases, and internal mailers.  If you had a weird error message or were seeing inexplicable behavior, popping the message or symptoms into Topic frequently resulted in a bug.  Failing that, it might pull up another TAC case, which would show the best troubleshooting steps to take.

Topic also searches internal mailers, the email lists used internally by Cisco employees.  TAC agents, sales people, TMEs, product managers, and engineering all exchange emails on these mailers, which are then archived.  Oftentimes a problem would show up in the mailer archives and engineering had already provided an answer.  Sometimes, if Topic failed, we would post the symptoms to the mailers in hopes engineering, a TME, or any expert would have a suggestion.  I was always careful in doing so, as if you posted something that was already answered, or asked too often, flames would be coming your way.

TAC engineers have the ability to file bugs across the Cisco product portfolio.  This is, of course, a powerful way to get engineering attention.  Customer found defects are taken very seriously, and any bug that is opened will get a development engineer (DE) assigned to it quickly.  We were judged on the quality of bugs we filed since TAC does not like to abuse the privilege and waste engineering time.  If a bug is filed for something that is not really a bug, it gets marked “J” for Junk, and you don’t want to have too many junked bugs.  That said, on one or two occasions, when I needed engineering help and the mailers weren’t working, I knowingly filed a Junk bug to get some help from engineering.  Fortunately, I filed a few real bugs that got fixed.

My team was the “routing protocols” team for HTTS, but we were a dumping ground for all sorts of cases.  RP often got crash cases, cable modem problems, and other issues, even though these weren’t strictly RP.  Even within the technical limits of RP, there is a lot of variety among cases.  Someone who knows EIGRP cold may not have a clue about MPLS.  A lot of times, when stuck on a case, we’d go find the “guy who knows that” and ask for help.  We had a number of cases on Asynchronous Transfer Mode (ATM) when I worked at TAC, which was an old WAN (more or less) protocol.  We had one guy who knew ATM, and his job was basically just to help with ATM cases.  He had a desk at the office but almost never came in, never worked a shift, and frankly I don’t know what he did all day.  But when an ATM case came in, day or night, he was on it, and I was glad we had him, since I knew little about the subject.

Some companies have NOCs with tier 1, 2, and 3 engineers, but we just had CSEs.  While we had different pay grades, TAC engineers were not tiered in HTTS.  “Take the case and get help” was the motto.  Backbone (non-HTTS) TAC had an escalation team, with some high-end CSEs who jumped in on the toughest cases.  HTTS did not, and while backbone TAC didn’t always like us pulling on their resources, at the end of the day we were all about killing cases, and a few times I had backbone escalation engineers up in my cube helping me.

The more heated a case gets, the higher the impact, the longer the time to resolve, the more attention it gets.  TAC duty managers can pull in more CSEs, escalation, engineering, and others to help get a case resolved.  Occasionally, a P1 would come in at 6pm on a Friday and you’d feel really lonely.  But Cisco being Cisco, if they need to put resources on an issue, there are a lot of talented and smart people available.

There’s nothing worse than the sinking feeling a CSE gets when realizing he or she has no clue what to do on a case.  When the Topic searches fail, when escalation engineers are stumped, when the customer is frustrated, you feel helpless.  But eventually, the problem is solved, the case is closed, and you move on to the next one.

When you work at TAC, you are required to be “on-shift” for 4 hours each day.  This doesn’t mean that you work four hours a day, just that you are actively taking cases only four hours per day.  The other four (or more) hours you work on your existing backlog, calling customers, chasing down engineering for bug fixes, doing recreates, and, if you’re lucky, doing some training on the side.  While you were on shift, you would still work on the other stuff, but you were responsible for monitoring your “queue” and taking cases as they came in.  On our queue we generally liked to have four customer support engineers (CSE’s) on shift at any time.  Occasionally we had more or less, but never less than two.  We didn’t like to run with two engineers for very long;  if a P1 comes in, a CSE can be tied up for hours unable to deal with the other cases that come in, and the odds are not low that more than one P1 come in.  With all CSE’s on-shift tied up, it was up to the duty manager to start paging off-shift engineers as cases came in, never a good thing.  If ever you were on hold for a long time with a P1, there is a good chance the call center agent was simply unable to find a CSE because they were all tied up.  Sometimes it was due to bad planning, sometimes lack of staff.  Sometimes you would start a shift with five CSE’s on the queue and they’d all get on P1’s in the first five minutes.  The queue was always unpredictable.

At TAC, when you were on-shift, you could never be far from your desk.  You were expected to stay put, and if you had to get up to use the bathroom or go to the lab, you notified your fellow on-shift engineers so they knew you wouldn’t be available.  Since I preferred the 10am-2pm shift, in 2 years I took lunch away from my desk maybe 5 times.  Most days I told the other guys I was stepping out, ran to the cafeteria, and ran back to my desk to eat while taking cases.

Thus, I was quite happy one day when I had a later shift and my colleague Delvin showed up at my desk and asked if I wanted to go to a nice Chinese lunch.  Eddy Lau, one of our new CSE’s who had recently emigrated from China, had found an excellent and authentic restaurant.  We hopped into Eddy’s car and drove over to the restaurant, where we proceeded to have a two-hour long Chinese feast.  “It’s so great to actually go to lunch,” I said to Delvin, “since I eat at my desk every day.”  Eddy was happy to help his new colleagues out.

As we were driving back, Devlin asked Eddy, “When are you on shift?”

“Right now,” said Eddy.

“You’re on shift now?!” Delvin asked incredulously.  “Dude, you can’t leave for two hours if you’re on shift.  Who are you on shift with?”

“Just me and Sarah,” Eddy said, not really comprehending the situation.

“You left Sarah on shift by herself?!” Devlin asked.  “What if a P1 comes in?  What if she gets swamped by cases?  You can’t leave someone alone on the queue!”

We hurried back to the office and pulled up WebMonitor, which showed not only active cases, but who had taken cases that shift and how many.  Sarah had taken a single case.  By some amazing stroke of luck, it had been a very quiet shift.

I walked by Eddy’s desk and he gave me a thumbs up and a big smile.  I figured he wouldn’t last long.  A couple months later, after blowing a case, Eddy got put on RMA duty and subsequently quit.

If you ever wonder why you had to wait so long on the phone, it could be a busy day.  Or it could be your CSE’s decided to take a long lunch without telling anyone.

When I was still a new engineer, a fellow customer support engineer (CSE) asked a favor of me. I’ll call him Andy.

“I’m going on PTO, could you cover a case for me? I’ve filed a bug and while I’m gone there will be a conference call. Just jump on it and tell them that the bug has been filed an engineering is working on it.” The case was with one of our largest service provider clients. I won’t say which, but they were a household name.

When you’re new and want to make a good impression, you jump on chances like this. It was a simple request and would prove I’m a team player. Of course I accepted the case and went about my business with the conference call on my calendar for the next week.

Before I got on the call I took a brief look at the case notes and the DDTS (what Cisco calls a bug.) Everything seemed to be in order. The bug was filed and in engineering’s hands. Nothing to do but hop on the call and report that the bug was filed and we were working on it.

I dialed the bridge and after I gave my name the automated conference bridge said “there are 20 other parties in the conference.” Uh oh. Why did they need so many?

After I joined, someone asked for introductions. As they went around the call, there were a few engineers, several VP’s, and multiple senior directors. Double uh oh.

“Jeff is calling from Cisco,” the leader of the call said. “He is here to report on the P1 outage we had last week affecting multiple customers. I’m happy to tell you that Cisco has been working diligently on the problem and is here to report their findings and their solution. Cisco, take it away.”

I felt my heart in my throat. I cleared my voice, and sheepishly said: “Uh, we’ve, uh, filed a bug for your problem and, uh, engineering is looking into it.”

It was dead silence, followed by a VP chiming in: “That’s it?”

I was then chewed out thoroughly for not doing enough and wasting everyone’s time.

When Andy got back he grabbed the case back from me. “How’d the call go?” he asked.

I told him how it went horribly, how they were expecting more than I delivered, and how I took a beating for him.

Andy just smiled. Welcome to TAC.

When I first started at TAC, I wasn’t allowed to take cases by myself.  If I grabbed a case, I had to get an experienced engineer to help me out.  One day I grabbed a case on a Catalyst 6k power supply, and asked Veena (not her real name) to help me on the case.

We got the customer on the phone.  He was an engineer at a New York financial institution, and sounded like he came from Brooklyn.  I lived in Williamsburg for a while with my mom back in the 1980’s before it was cool, and I know the accent.  He explained that he had a new 6k, and it wasn’t recognizing the power supply he had bought for it.  All of the modules had a “power denied” message on them.

I put the customer on speaker phone in my cube and Veena looked at the case notes.  As was often the case in TAC, we put the customer on mute while discussing the issues.  Veena thought it was a bad connection between the power supply and the switch.

“Here’s what I want you to do,” Veena said to the customer, un-muting the phone.  “I used to work in the BU, and a lot of times these power supplies don’t connect to the backplane.  You need to put it in hard.  Pull the power supply out and slam it in to the chassis.  I want to hear it crack!”

The customer seemed surprised.  “You want me to do what?!” he bristled.

“Slam it in!  Just slam it in as hard as you can!  We saw this in the BU all the time!”

“Hey lady,” he responded, “we paid a couple hundred grand for this box and I don’t want to break it.”

“It’s ok,” she said, “it’ll be fine.  I want to hear the crack!”

“Well, ok,” he said with resignation.  He put the phone down and we heard him shuffle off to the switch.  Meanwhile Veena looked at me and said “Pull up the release notes.”  I pulled up the notes, and we saw that the power supply wasn’t supported in his version of Catalyst OS.

Meanwhile in the background:  CRACK!!!

The customer came back on the line.  “Lady, I slammed that power supply into the chassis as hard as I could.  I think I broke something on it, and it still doesn’t work!”

“Yes,” Veena replied.  “We’ve discovered that your software doesn’t support the power supply and you will need to do an upgrade…”

I’ve come back to Cisco recently, and I think I can say that I haven’t worked this hard since the last time I was at Cisco.  I remember my first manager at TAC telling me in an interview that “Cisco loves workaholics.”  In an attempt to get more organized, I’ve been taking a second crack at using OmniFocus and the GTD methodology.  To be honest, I haven’t had much luck with these systems in the past.  I usually end up entering a bunch of tasks into the system, and then quickly get behind on crossing them off.  I find that the tasks I really want to do, or need to do, I would do without the system, and the ones that I am putting off I keep putting off anyway.  I have so much to do now, however, that I need to track things more efficiently and I am hoping OmniFocus is the solution. Continue Reading

In this article in my “Ten Years a CCIE” series, I describe my experience going to work at Cisco as a CCIE.  Unlike many Cisco-employed CCIE’s, I earned my certification outside of Cisco.

A CCIE leads to a job at Cisco

I returned to my old job at the Chronicle and had my business cards reprinted with my CCIE number. I loved handing it out, particularly at meetings with telephone companies and Internet service providers whose salespeople were likely to know what such a certification meant.  I remember one such sales person, duly impressed, saying “wow, on that test you can be forced to configure any feature on any Cisco product…I don’t know how anyone could prepare for that!”  (Uh, right.  See “The CCIE Mystique“.)

At that time, the most popular forum for aspiring CCIE’s was an email distribution list called groupstudy.com. I had been a subscriber to this mailing list, but prior to passing my exam, I didn’t feel adequate to post anything there. However, once I passed, I began posting regularly, beginning with a summary of my test preparation process. One day I got an email from a mysterious CCIE who told me that I sounded like I knew what I was talking about, asking me if I wanted to interview for a job. I thought his name and number sounded familiar, and when I got home I confirm my suspicions by digging through my bookshelf. He was the author of one of my books about Catalyst Quality of Service.

A brutal interview

It turns out he was a manager at Cisco High Touch Technical Support, a group of TAC engineers who specialized in high profile customers. I scheduled an interview right away.

This interview was by far the most difficult I’ve had in my career. They brought me into a room with four CCIE’s, two of them double, all of them sharp. Each one of them had a different specialty. One of them was a security guy, another one was an expert on multicast, another was an expert on switching. When it came to Kumar, the voice guy, I figured I was scot-free. After all, I didn’t claim to know anything about voice over IP. Kumar looked over my resume, and then he looked up at me. “I see you have ISDN on your resume,” Kumar said. And then he began to grill me on ISDN. Darn, I should have thought of that!  Thankfully, I was well prepared.

Despite one or two mistakes in the interview, I got hired on and began my new job as a customer support engineer at HTTS. My first few months were in a group called ESO, which supported large enterprises and was very focused on Catalyst switching.  I won’t go into the details of the job here, but you can see my many TAC tales if you are interested.

Cisco's San Jose campus

Cisco’s San Jose campus

Little purple stickers everywhere!

One thing I quickly noticed when I got to Cisco was that a lot of the people in my department had a nickel-sized purple dot on their ID badges and cubicle nameplates. I found out that these purple dots were actually stickers with the CCIE logo. Cisco employees who had their CCIE’s stuck these purple dots on their badges and nameplates to show it off. Many of the CCIE’s who had passed multiple exams actually placed multiple dots on their badges and nameplates. I wanted one quite badly. The problem was, sheets of these purple stickers were sent out only to the early CCIE’s, and by the time I had passed, Cisco was no longer providing the sheets of stickers. I suppose I could’ve had some printed out, but I asked around looking for a CCIE who was generous to give me one of his dots. They were in scarce supply, however, and nobody was willing to part with one. It was just another way newer CCIE’s were getting jipped.

The real CCIE logo

The real CCIE logo

Even though the exam had now switched to the one day format, you still didn’t meet too many CCIE’s outside of Cisco. It was thus quite a shock when I went to Cisco and saw purple dots everywhere. It seemed like fully half of the people I was working with in my new job had CCIE’s. And many of them had low numbers, in the 2000’s and even in the 1000’s. I was quite relieved to find that they all treated me with total respect; nobody ever challenged me on account of my one-day CCIE. Still, I always had (and always will have) a great deal of respect for those people who passed the test when it was a two day test, and the cottage industry devoted to minting CCIE’s had not yet come into existence.

CCIE challenges customer

Customers were another story. I remember one BGP case in particular. I looked at the customer’s configuration and immediately realized that it was a simple matter of misconfiguration. I fired it up in my lab reproduced his configuration and proved to him that it was indeed a configuration error on his part. I wrote it all up in an email and proudly signed it with my CCIE number. Within a half an hour I got a call from the customer and one of his colleagues on the line. They proceeded to grill me rapidly on BGP asking me all sorts of questions that weren’t relevant to their case and stumping me several times. At that point I realized that when many people see you are a CCIE, they take it as a challenge. In some cases they failed the test themselves, or else they’ve met stupid CCIE’s in the past and they feel themselves to be on a mission to discredit all CCIE’s. After that episode, I removed my CCIE number from my email signature. I gained a feeling of self-importance after I passed my exam, but working among so many people with the same certification, and dealing with such intelligent customers, I realized that the CCIE didn’t always carry the prestige I thought it did.  The mystique diminished even further.

Incidentally, I became friends with all of the guys who interviewed me, and I was on the interview team myself during my tenure at Cisco. One extremely sharp CCIE we hired told me our interview was so tough he had to “hit the bottle” afterwards. It was considered a rite of passage at TAC to go through a tough interview, but I have gotten a lot nicer in my interview style now, having been on the receiving end of a few grillings.

The value of a CCIE

One of the later posts in this series will examine the question of the value of a CCIE certification.  After all, this is one of the most common questions I see in forums dedicated to certification.  However, my experience getting hired into Cisco (the first time) has some lessons.

  • The immediate reason I got hired was because of my experience and willingness to go out of my way helping others to get their CCIE on Groupstudy.  However, I would not have gotten the position without a CCIE, so clearly it proved its value there.
  • Once you are at Cisco, although people commonly display their stickers and plaques, having a CCIE certification will not necessarily distinguish you.
  • There are many CCIE’s who have made a bad impression on others, whether they are only book-knowledgeable, or even cheaters.  Often people challenge you when you have a CCIE, instead of respecting you.

In the next article in the series, Multiple CCIEs, Multiple Attempts, I describe passing the CCIE Security exam.  I talk about my experience suffering the agony of defeat for the first time, and how I eventually conquered that test.