HTTS

All posts tagged HTTS

I’ve mentioned before that, despite being on the Routing Protocols team, I spent a lot of time handling crash cases in TAC.  At the time, my queue was just a dumping ground for cases that didn’t fit into any other bucket in the High Touch structure.  Backbone TAC had a much more granular division of teams, including a team entirely dedicated to crash.  But in HTTS, we did it all.

Some crashes are minor, like a (back then) 2600-series router reloading due to a bus error.  Some were catastrophic, particularly crashes on large chassis-type routing systems in service provider networks.  These could have hundreds of interfaces, and with sub-interfaces, potentially thousands of customers affected by a single outage.  Chassis platforms vary in their architecture, but many of the platforms we ran at the time used a distributed architecture in which the individual line cards ran a subset of IOS.  Thus, unlike a 2600 which had “dumb” WIC cards for interface connections, on chassis systems line cards themselves could crash in addition to the route processors.  Oftentimes, when a line card crashed, the effect would cascade through the box, with multiple line cards crashing, which would result in a massive meltdown.

The 7500 was particularly prone to these.  A workhorse of Cisco’s early product line, the 7500 line cards ran IOS but forwarded packets between each other by placing them into special queues on the route processor.  This was quite unlike later products, such as the Gigabit Switch Router (GSR), which had a fabric architecture enabling line cards to communicate directly.  On the 7500, oftentimes a line card having a problem would write bad data into the shared queues, which the subsequent line cards would read and then crash, causing a cascading failure.

One of our big customers, a Latin American telecommunications company I’ll call LatCom, was a heavy user of 7500’s.  They were a constant source of painful cases, and for some reason had a habit of opening P1 cases on Fridays at 5:55pm.  Back then HTTS day-shift engineers’ shifts ended at 6pm, at which point the night shift took over, but once we accepted a P1 or P2 case, unlike backbone TAC, we had to work it until resolution.  LatCom drove us nuts.  Five minutes was the difference between going home for the weekend and potentially being stuck on the phone until 10pm on a Friday night.  The fact that LatCom’s engineers barely spoke English also proved a challenge and drew out the cases–occasionally we had to work through non-technical translators, and getting them to render “there was a CEF bug causing bad data to be placed into the queue on the RP” into Spanish was problematic.

After years of nightmare 7500 crashes, LatCom finally did what we asked:  they dropped a lot of money to upgrade their routers to GSRs with PRPs, at that time our most modern box.  All the HTTS RP engineers breathed a sigh of relief knowing that the days of nightmare cascading line card failures on 7500’s were coming to an end.  We never had a seen a single case of such a failure on a GSR.

That said, we knew that if anything bad was going to happen, it would happen to these guys.  And sure enough, one day I got a case with…you guessed it, a massive cascading line card failure on a GSR!  The first one I had seen.  In the case notes I described the failure as follows:

  1. Six POS (Packet over Sonet) interfaces went down at once
  2. Fifteen seconds later, slots 1 and 15 started showing CPUHOG messages followed by tracebacks
  3. Everything stabilized until a few hours later, when the POS interfaces go down again
  4. Then, line cards in slots 0, 9, 10, 11, and 13 crashed
  5. Fifteen seconds later, line cards in slots 6 and 2 crash
  6. And so forth

My notes said: “basically we had a meltdown of the box.”  To make matters worse, 4 days later they had an identical crash on another GSR!

When faced with a this sort of mess, TAC agents usually would send the details to an internal mailer, which is exactly what I did.  The usual attempt by some on the mailer to throw hardware at the problem didn’t go far as we saw the exact same crash on another router.  This seemed to be a CEF bug.

Re-reading the rather extensive case notes bring up a lot of pain.  Because the customer had just spent millions of dollars to replace their routers with a new platform that, we assured them, would not be susceptible to the same problem, this went all the way to their top execs and ours.  We were under tremendous pressure to find a solution, and frankly, we all felt bad because we were sure the new platform would be an end to their problems.

There are several ways for a TAC engineer to get rid of a case:  resolve the problem, tell the customer it is not reproducible, wait for it to get re-queued to another engineer.  But after two long years at TAC, two years of constant pressure, a relentless stream of cases, angry customers, and problem after problem, my “dream job” at Cisco was taking a toll.  When my old friend Mike, who had hired me at the San Francisco Chronicle, my first network engineering job, called me and asked me to join him at a gold partner, the call wasn’t hard to make.  And so I took the easiest route to getting rid of cases, a lot of them all at once, and quit.  LatCom would be someone else’s problem.  My newest boss, the fifth in two years, looked at me with disappointment when I gave him my two weeks notice.

I can see the case notes now that I work at Cisco again, and they solved the case, as TAC does.  A bug was filed and the problem fixed.  Still, I can tell you how much of a relief it was to turn in my badge and walk out of Cisco for what I wrongly thought would be the last time.  I felt, in many ways, like a failure in TAC, but at my going away party, our top routing protocols engineer scoffed at my choice to leave.  “Cisco needs good engineers,” he said.  “I could have gotten you any job you wanted here!”  True or not, it was a nice comment to hear.

I started writing these TAC tales back in 2013, when I still worked at Juniper.  I didn’t expect they’d attract much interest, but they’ve been one of the most consistently popular features of this blog. I’ve cranked out 20 of these covering a number of subjects, but I’m afraid my reservoir of stories is running dry.  I’ve decided that number 20 will be the last TAC Tale on my blog.  There are plenty of other stories to tell, of course, but I’m finished with TAC, as I was back in 2007.  My two years in TAC were some of the hardest in my career, but also incredibly rewarding.  I have so much respect for my fellow TAC engineers, past, present, and future, who take on these complex problems without fear, and find answers for our customers.

 

This one falls into the category of, “I probably shouldn’t post this, especially now that I’m at Cisco again,” but what the heck.

I’ve often mentioned, in this series, the different practices of “backbone TAC” (or WW-TAC) and High Touch Technical Support (HTTS), the group I was a part of.  WW-TAC was the larger TAC organization, where the vast majority of the cases landed.  HTTS was (and still is) a specialized TAC group dedicated to Cisco’s biggest customers, who generally pay for the additional service.  HTTS was supposed to provide a deeper knowledge of the specifics of customer networks and practices, but generally worked the same as TAC.  We had our own queues, and when a high-touch customer would open a case, Cisco’s entitlement tool would automatically route their case to HTTS based on the contract number.

Unlike WW-TAC, HTTS did not use the “follow the sun” model.  This meant that regular TAC cases would be picked up by a region where it was currently daytime, and when a TAC agent’s shift ended, they would find another agent in the next timezone over to pick up a live (P1/P2) case.  At HTTS, we had US-based employees only, at the time, and they had to work P1/P2 cases to resolution.  This meant if your shift ended at 6pm, and a P1 case came in at 5:55, you might be stuck in the office for hours until you resolved it.  We did have a US-based nightshift that came on at 6pm, but they only accepted new cases–we couldn’t hand off a live one to nightshift.

Weekends were covered by a model I hated, called “BIC”.  I asked my boss what it stood for and he explained it was either “Butt In Chair” or “Bullet In the Chamber.”  The HTTS managers would publish a schedule (quarterly if I recall) assigning each engineer one or two 6 hour shifts during the weekends of that quarter.  During those 6 hours, we had to be online and taking cases.

Why did I hate it?  First, I hated working weekends, of course.  Second, the caseload was high.  A normal day on my queue might see 4 cases per engineer, but on BIC you typically took seven or eight.  Third, you had to take cases on every topic.  During the week, only a voice engineer would pick up a voice case.  But on BIC, I, a routing protocols engineer, might pick up a voice case, a firewall case, a switching case…or whatever.  Fourth, because BIC took place on a weekend, normal escalation channels were not available.  If you had a major P1 outage, you couldn’t get help easily.

Remember that a lot of the cases you accepted took weeks or even months to resolve.  Part of a TAC engineer’s day is working his backlog of cases:  researching, working in the lab to recreate a problem, talking to engineering, etc., all to resolve these cases.  When you picked up seven cases on a weekend, you were slammed for weeks after that.

We did get paid extra for BIC, although I don’t remember how much.  It was hundreds of dollars per shift, if I recall.  Because of this, a number of engineers loaded up on BIC shifts and earned thousands of dollars per quarter.  Thankfully, this meant there were plenty of willing recipients when I wanted to give away my shifts, which I did almost always.  (I worked two during my two years at TAC.)  However, sometimes I could not find anyone to take my shift, and in that case I actually would sell my shift, offering a hundred additional dollars if someone would take the shift.  That’s how much I hated BIC.  Of course, this was done without the company knowing about it, as I’m sure they wouldn’t approve of me selling my work!

We had one CSE on our team, I’ll call him Omar, who loaded up on BICs.  Then he would come into his week so overloaded with cases from the weekend that he would hardly take a case during the week.  We’d all get burdened with extra load because Omar was off working his weekend cases.  Finally, as team lead, I called him out on it in our group chat and Omar blew up on me.  Well, I was right of course but I had to let it go.

I don’t know if HTTS still does BIC, although I suspect it’s gone away.  I still work almost every weekend I have, but it’s to stay on top of work rather than taking on more.

The case came into the routing protocols queue, even though it was simply a line card crash.  The RP queue in HTTS was the dumping ground for anything that did not fit into one of the few other specialized queues we had.  A large US service provider had a Packet over SONET (PoS) line card on a GSR 12000-series router crashing over and over again.

Problem Details: 8 Port ISE Packet Over SONET card continually crashing due to

SLOT 2:Aug  3 03:58:31: %EE48-3-ALPHAERR: TX ALPHA: error: cpu int 1 mask 277FFFFF
SLOT 2:Aug  3 03:58:31: %EE48-4-GULF_TX_SRAM_ERROR: ASIC GULF: TX bad packet header detected. Details=0x4000

A previous engineer had the case, and he did what a lot of TAC engineers do when faced with an inexplicable problem:  he RMA’d the line card.  As I have said before, RMA is the default option for many TAC engineers, and it’s not a bad one.  Hardware errors are frequent and replacing hardware often is a quick route to solving the problem.  Unfortunately the RMA did not fix the problem, the case got requeued to another engineer, and he…RMA’d the line card.  Again.  When that didn’t work, he had them try the card in a different slot, but it continued to generate errors and crash.

The case bounced through two other engineers before getting to me.  Too bad the RMA option was out.  But the simple line card crash and error got even weirder.  The customer had two GSR routers in two different cities that were crashing with the same error.  Even stranger:  the crash was happening at precisely the same time in both cities, down to the second.  It couldn’t be a coincidence, because each crash on the first router was mirrored by a crash at exactly the same time on the second.

The conversation with my fellow engineers ranged from plausible to ludicrous.  There was a legend in TAC, true or not, that solar flares cause parity errors in memory and hence crashes.  Could a solar flare be triggering the same error on both line cards at the same time?  Some of my colleagues thought it was likely, but I thought it was silly.

Meanwhile, internal emails were going back and forth with the business unit to figure out what the errors meant.  Even for experienced network engineers, Cisco internal emails can read like a foreign language.  “The ALPHA errors are side-effects the GULF errors,” one development engineer commented, not so helpfully.  “Engine is feeding invalid packets to GULF and that causes the bad header error being detected on GULF,” another replied, only slightly more helpfully.

The customer, meanwhile, had identified a faulty fabric card on a Juniper router in their core.  Apparently the router was sending malformed packets to multiple provider edge (PE) routers all at once, which explained the simultaneous crashing.  Because all the PEs were in the US, forwarding was a matter of milliseconds, and thus there was very little variation in the timing.  How did the packets manage to traverse the several hops of the provider network without crashing any GSRs in between?  Well, the customer was using MPLS, and the corruption was in the IP header of the packets.  The intermediate hops forwarded the packets, without ever looking at the IP header, to the edge of the network, where the MPLS labels get stripped, and IP forwarding kicks in.  It was at that point that the line card crashed due to the faulty IP headers.  That said, when a line card receives a bad packet, it should drop it, not crash.  We had a bug.

The development engineers could not determine why the line card was crashing based on log info.  By this time, the customer had already replaced the faulty Juniper module and the network was stable.  The DEs wanted us to re-introduce the faulty line card into the core, and load up an engineering special debug image on the GSRs to capture the faulty packet.  This is often where we have a gulf, pun intended, between engineering and TAC.  No major service provider or customer wants to let Cisco engineering experiment on their network.  The customer decided to let it go.  If it came back, at least we could try to blame the issue on sunspots.

When you open a TAC case, how exactly does the customer support engineer (CSE) figure out how to solve the case?  After all, CSEs are not super-human.  Just like any engineer, in TAC you have a range of brilliant to not-so-brilliant, and everything in between.  Let me give an example:  I worked at HTTS, or high-touch TAC, serving customers who paid a premium for higher levels of support.  When a top engineer at AT&T or Verizon opened a case, how was it that I, who had never worked professionally in a service provider environment, was able to help them at all?  Usually when those guys opened a case, it was something quite complex and not a misconfigured route map!

TAC CSEs have an arsenal of tools at their disposal that customers, and even partners, do not.  One of the most powerful is well known to anyone who has ever worked in TAC:  Topic.  Topic is an internal search engine.  It can do more now, but at the time I was in TAC, Topic could search bugs, TAC cases, and internal mailers.  If you had a weird error message or were seeing inexplicable behavior, popping the message or symptoms into Topic frequently resulted in a bug.  Failing that, it might pull up another TAC case, which would show the best troubleshooting steps to take.

Topic also searches internal mailers, the email lists used internally by Cisco employees.  TAC agents, sales people, TMEs, product managers, and engineering all exchange emails on these mailers, which are then archived.  Oftentimes a problem would show up in the mailer archives and engineering had already provided an answer.  Sometimes, if Topic failed, we would post the symptoms to the mailers in hopes engineering, a TME, or any expert would have a suggestion.  I was always careful in doing so, as if you posted something that was already answered, or asked too often, flames would be coming your way.

TAC engineers have the ability to file bugs across the Cisco product portfolio.  This is, of course, a powerful way to get engineering attention.  Customer found defects are taken very seriously, and any bug that is opened will get a development engineer (DE) assigned to it quickly.  We were judged on the quality of bugs we filed since TAC does not like to abuse the privilege and waste engineering time.  If a bug is filed for something that is not really a bug, it gets marked “J” for Junk, and you don’t want to have too many junked bugs.  That said, on one or two occasions, when I needed engineering help and the mailers weren’t working, I knowingly filed a Junk bug to get some help from engineering.  Fortunately, I filed a few real bugs that got fixed.

My team was the “routing protocols” team for HTTS, but we were a dumping ground for all sorts of cases.  RP often got crash cases, cable modem problems, and other issues, even though these weren’t strictly RP.  Even within the technical limits of RP, there is a lot of variety among cases.  Someone who knows EIGRP cold may not have a clue about MPLS.  A lot of times, when stuck on a case, we’d go find the “guy who knows that” and ask for help.  We had a number of cases on Asynchronous Transfer Mode (ATM) when I worked at TAC, which was an old WAN (more or less) protocol.  We had one guy who knew ATM, and his job was basically just to help with ATM cases.  He had a desk at the office but almost never came in, never worked a shift, and frankly I don’t know what he did all day.  But when an ATM case came in, day or night, he was on it, and I was glad we had him, since I knew little about the subject.

Some companies have NOCs with tier 1, 2, and 3 engineers, but we just had CSEs.  While we had different pay grades, TAC engineers were not tiered in HTTS.  “Take the case and get help” was the motto.  Backbone (non-HTTS) TAC had an escalation team, with some high-end CSEs who jumped in on the toughest cases.  HTTS did not, and while backbone TAC didn’t always like us pulling on their resources, at the end of the day we were all about killing cases, and a few times I had backbone escalation engineers up in my cube helping me.

The more heated a case gets, the higher the impact, the longer the time to resolve, the more attention it gets.  TAC duty managers can pull in more CSEs, escalation, engineering, and others to help get a case resolved.  Occasionally, a P1 would come in at 6pm on a Friday and you’d feel really lonely.  But Cisco being Cisco, if they need to put resources on an issue, there are a lot of talented and smart people available.

There’s nothing worse than the sinking feeling a CSE gets when realizing he or she has no clue what to do on a case.  When the Topic searches fail, when escalation engineers are stumped, when the customer is frustrated, you feel helpless.  But eventually, the problem is solved, the case is closed, and you move on to the next one.

When you work at TAC, you are required to be “on-shift” for 4 hours each day.  This doesn’t mean that you work four hours a day, just that you are actively taking cases only four hours per day.  The other four (or more) hours you work on your existing backlog, calling customers, chasing down engineering for bug fixes, doing recreates, and, if you’re lucky, doing some training on the side.  While you were on shift, you would still work on the other stuff, but you were responsible for monitoring your “queue” and taking cases as they came in.  On our queue we generally liked to have four customer support engineers (CSE’s) on shift at any time.  Occasionally we had more or less, but never less than two.  We didn’t like to run with two engineers for very long;  if a P1 comes in, a CSE can be tied up for hours unable to deal with the other cases that come in, and the odds are not low that more than one P1 come in.  With all CSE’s on-shift tied up, it was up to the duty manager to start paging off-shift engineers as cases came in, never a good thing.  If ever you were on hold for a long time with a P1, there is a good chance the call center agent was simply unable to find a CSE because they were all tied up.  Sometimes it was due to bad planning, sometimes lack of staff.  Sometimes you would start a shift with five CSE’s on the queue and they’d all get on P1’s in the first five minutes.  The queue was always unpredictable.

At TAC, when you were on-shift, you could never be far from your desk.  You were expected to stay put, and if you had to get up to use the bathroom or go to the lab, you notified your fellow on-shift engineers so they knew you wouldn’t be available.  Since I preferred the 10am-2pm shift, in 2 years I took lunch away from my desk maybe 5 times.  Most days I told the other guys I was stepping out, ran to the cafeteria, and ran back to my desk to eat while taking cases.

Thus, I was quite happy one day when I had a later shift and my colleague Delvin showed up at my desk and asked if I wanted to go to a nice Chinese lunch.  Eddy Lau, one of our new CSE’s who had recently emigrated from China, had found an excellent and authentic restaurant.  We hopped into Eddy’s car and drove over to the restaurant, where we proceeded to have a two-hour long Chinese feast.  “It’s so great to actually go to lunch,” I said to Delvin, “since I eat at my desk every day.”  Eddy was happy to help his new colleagues out.

As we were driving back, Devlin asked Eddy, “When are you on shift?”

“Right now,” said Eddy.

“You’re on shift now?!” Delvin asked incredulously.  “Dude, you can’t leave for two hours if you’re on shift.  Who are you on shift with?”

“Just me and Sarah,” Eddy said, not really comprehending the situation.

“You left Sarah on shift by herself?!” Devlin asked.  “What if a P1 comes in?  What if she gets swamped by cases?  You can’t leave someone alone on the queue!”

We hurried back to the office and pulled up WebMonitor, which showed not only active cases, but who had taken cases that shift and how many.  Sarah had taken a single case.  By some amazing stroke of luck, it had been a very quiet shift.

I walked by Eddy’s desk and he gave me a thumbs up and a big smile.  I figured he wouldn’t last long.  A couple months later, after blowing a case, Eddy got put on RMA duty and subsequently quit.

If you ever wonder why you had to wait so long on the phone, it could be a busy day.  Or it could be your CSE’s decided to take a long lunch without telling anyone.

When I was still a new engineer, a fellow customer support engineer (CSE) asked a favor of me. I’ll call him Andy.

“I’m going on PTO, could you cover a case for me? I’ve filed a bug and while I’m gone there will be a conference call. Just jump on it and tell them that the bug has been filed an engineering is working on it.” The case was with one of our largest service provider clients. I won’t say which, but they were a household name.

When you’re new and want to make a good impression, you jump on chances like this. It was a simple request and would prove I’m a team player. Of course I accepted the case and went about my business with the conference call on my calendar for the next week.

Before I got on the call I took a brief look at the case notes and the DDTS (what Cisco calls a bug.) Everything seemed to be in order. The bug was filed and in engineering’s hands. Nothing to do but hop on the call and report that the bug was filed and we were working on it.

I dialed the bridge and after I gave my name the automated conference bridge said “there are 20 other parties in the conference.” Uh oh. Why did they need so many?

After I joined, someone asked for introductions. As they went around the call, there were a few engineers, several VP’s, and multiple senior directors. Double uh oh.

“Jeff is calling from Cisco,” the leader of the call said. “He is here to report on the P1 outage we had last week affecting multiple customers. I’m happy to tell you that Cisco has been working diligently on the problem and is here to report their findings and their solution. Cisco, take it away.”

I felt my heart in my throat. I cleared my voice, and sheepishly said: “Uh, we’ve, uh, filed a bug for your problem and, uh, engineering is looking into it.”

It was dead silence, followed by a VP chiming in: “That’s it?”

I was then chewed out thoroughly for not doing enough and wasting everyone’s time.

When Andy got back he grabbed the case back from me. “How’d the call go?” he asked.

I told him how it went horribly, how they were expecting more than I delivered, and how I took a beating for him.

Andy just smiled. Welcome to TAC.

When I first started at TAC, I wasn’t allowed to take cases by myself.  If I grabbed a case, I had to get an experienced engineer to help me out.  One day I grabbed a case on a Catalyst 6k power supply, and asked Veena (not her real name) to help me on the case.

We got the customer on the phone.  He was an engineer at a New York financial institution, and sounded like he came from Brooklyn.  I lived in Williamsburg for a while with my mom back in the 1980’s before it was cool, and I know the accent.  He explained that he had a new 6k, and it wasn’t recognizing the power supply he had bought for it.  All of the modules had a “power denied” message on them.

I put the customer on speaker phone in my cube and Veena looked at the case notes.  As was often the case in TAC, we put the customer on mute while discussing the issues.  Veena thought it was a bad connection between the power supply and the switch.

“Here’s what I want you to do,” Veena said to the customer, un-muting the phone.  “I used to work in the BU, and a lot of times these power supplies don’t connect to the backplane.  You need to put it in hard.  Pull the power supply out and slam it in to the chassis.  I want to hear it crack!”

The customer seemed surprised.  “You want me to do what?!” he bristled.

“Slam it in!  Just slam it in as hard as you can!  We saw this in the BU all the time!”

“Hey lady,” he responded, “we paid a couple hundred grand for this box and I don’t want to break it.”

“It’s ok,” she said, “it’ll be fine.  I want to hear the crack!”

“Well, ok,” he said with resignation.  He put the phone down and we heard him shuffle off to the switch.  Meanwhile Veena looked at me and said “Pull up the release notes.”  I pulled up the notes, and we saw that the power supply wasn’t supported in his version of Catalyst OS.

Meanwhile in the background:  CRACK!!!

The customer came back on the line.  “Lady, I slammed that power supply into the chassis as hard as I could.  I think I broke something on it, and it still doesn’t work!”

“Yes,” Veena replied.  “We’ve discovered that your software doesn’t support the power supply and you will need to do an upgrade…”

I’ve come back to Cisco recently, and I think I can say that I haven’t worked this hard since the last time I was at Cisco.  I remember my first manager at TAC telling me in an interview that “Cisco loves workaholics.”  In an attempt to get more organized, I’ve been taking a second crack at using OmniFocus and the GTD methodology.  To be honest, I haven’t had much luck with these systems in the past.  I usually end up entering a bunch of tasks into the system, and then quickly get behind on crossing them off.  I find that the tasks I really want to do, or need to do, I would do without the system, and the ones that I am putting off I keep putting off anyway.  I have so much to do now, however, that I need to track things more efficiently and I am hoping OmniFocus is the solution. Continue Reading

In this article in my “Ten Years a CCIE” series, I describe my experience going to work at Cisco as a CCIE.  Unlike many Cisco-employed CCIE’s, I earned my certification outside of Cisco.

A CCIE leads to a job at Cisco

I returned to my old job at the Chronicle and had my business cards reprinted with my CCIE number. I loved handing it out, particularly at meetings with telephone companies and Internet service providers whose salespeople were likely to know what such a certification meant.  I remember one such sales person, duly impressed, saying “wow, on that test you can be forced to configure any feature on any Cisco product…I don’t know how anyone could prepare for that!”  (Uh, right.  See “The CCIE Mystique“.)

At that time, the most popular forum for aspiring CCIE’s was an email distribution list called groupstudy.com. I had been a subscriber to this mailing list, but prior to passing my exam, I didn’t feel adequate to post anything there. However, once I passed, I began posting regularly, beginning with a summary of my test preparation process. One day I got an email from a mysterious CCIE who told me that I sounded like I knew what I was talking about, asking me if I wanted to interview for a job. I thought his name and number sounded familiar, and when I got home I confirm my suspicions by digging through my bookshelf. He was the author of one of my books about Catalyst Quality of Service.

A brutal interview

It turns out he was a manager at Cisco High Touch Technical Support, a group of TAC engineers who specialized in high profile customers. I scheduled an interview right away.

This interview was by far the most difficult I’ve had in my career. They brought me into a room with four CCIE’s, two of them double, all of them sharp. Each one of them had a different specialty. One of them was a security guy, another one was an expert on multicast, another was an expert on switching. When it came to Kumar, the voice guy, I figured I was scot-free. After all, I didn’t claim to know anything about voice over IP. Kumar looked over my resume, and then he looked up at me. “I see you have ISDN on your resume,” Kumar said. And then he began to grill me on ISDN. Darn, I should have thought of that!  Thankfully, I was well prepared.

Despite one or two mistakes in the interview, I got hired on and began my new job as a customer support engineer at HTTS. My first few months were in a group called ESO, which supported large enterprises and was very focused on Catalyst switching.  I won’t go into the details of the job here, but you can see my many TAC tales if you are interested.

Cisco's San Jose campus

Cisco’s San Jose campus

Little purple stickers everywhere!

One thing I quickly noticed when I got to Cisco was that a lot of the people in my department had a nickel-sized purple dot on their ID badges and cubicle nameplates. I found out that these purple dots were actually stickers with the CCIE logo. Cisco employees who had their CCIE’s stuck these purple dots on their badges and nameplates to show it off. Many of the CCIE’s who had passed multiple exams actually placed multiple dots on their badges and nameplates. I wanted one quite badly. The problem was, sheets of these purple stickers were sent out only to the early CCIE’s, and by the time I had passed, Cisco was no longer providing the sheets of stickers. I suppose I could’ve had some printed out, but I asked around looking for a CCIE who was generous to give me one of his dots. They were in scarce supply, however, and nobody was willing to part with one. It was just another way newer CCIE’s were getting jipped.

The real CCIE logo

The real CCIE logo

Even though the exam had now switched to the one day format, you still didn’t meet too many CCIE’s outside of Cisco. It was thus quite a shock when I went to Cisco and saw purple dots everywhere. It seemed like fully half of the people I was working with in my new job had CCIE’s. And many of them had low numbers, in the 2000’s and even in the 1000’s. I was quite relieved to find that they all treated me with total respect; nobody ever challenged me on account of my one-day CCIE. Still, I always had (and always will have) a great deal of respect for those people who passed the test when it was a two day test, and the cottage industry devoted to minting CCIE’s had not yet come into existence.

CCIE challenges customer

Customers were another story. I remember one BGP case in particular. I looked at the customer’s configuration and immediately realized that it was a simple matter of misconfiguration. I fired it up in my lab reproduced his configuration and proved to him that it was indeed a configuration error on his part. I wrote it all up in an email and proudly signed it with my CCIE number. Within a half an hour I got a call from the customer and one of his colleagues on the line. They proceeded to grill me rapidly on BGP asking me all sorts of questions that weren’t relevant to their case and stumping me several times. At that point I realized that when many people see you are a CCIE, they take it as a challenge. In some cases they failed the test themselves, or else they’ve met stupid CCIE’s in the past and they feel themselves to be on a mission to discredit all CCIE’s. After that episode, I removed my CCIE number from my email signature. I gained a feeling of self-importance after I passed my exam, but working among so many people with the same certification, and dealing with such intelligent customers, I realized that the CCIE didn’t always carry the prestige I thought it did.  The mystique diminished even further.

Incidentally, I became friends with all of the guys who interviewed me, and I was on the interview team myself during my tenure at Cisco. One extremely sharp CCIE we hired told me our interview was so tough he had to “hit the bottle” afterwards. It was considered a rite of passage at TAC to go through a tough interview, but I have gotten a lot nicer in my interview style now, having been on the receiving end of a few grillings.

The value of a CCIE

One of the later posts in this series will examine the question of the value of a CCIE certification.  After all, this is one of the most common questions I see in forums dedicated to certification.  However, my experience getting hired into Cisco (the first time) has some lessons.

  • The immediate reason I got hired was because of my experience and willingness to go out of my way helping others to get their CCIE on Groupstudy.  However, I would not have gotten the position without a CCIE, so clearly it proved its value there.
  • Once you are at Cisco, although people commonly display their stickers and plaques, having a CCIE certification will not necessarily distinguish you.
  • There are many CCIE’s who have made a bad impression on others, whether they are only book-knowledgeable, or even cheaters.  Often people challenge you when you have a CCIE, instead of respecting you.

In the next article in the series, Multiple CCIEs, Multiple Attempts, I describe passing the CCIE Security exam.  I talk about my experience suffering the agony of defeat for the first time, and how I eventually conquered that test.

I feel a bit of guilt for letting this blog languish for a while. I can see from the response to my articles explaining confusing Juniper features that my work had some benefit outside my own edification, and so I hate to leave articles unfinished which might have been helpful. In addition, WordPress is not easy to maintain and I keep losing notifications of comments, which means that when I am not logging in, I miss the opportunity to respond to kind words and questions.

As it is, my work explaining Juniper to the masses will have to be put on hold, as I have left Juniper after six years and returned to my old employer Cisco! I worked at Juniper longer than I had anywhere else, and it’s amazing to consider that I just closed the door on half a decade. But, I even after attaining my JNCIE I always felt like a Cisco guy at heart, and so here I am again. A few random thoughts then:

1. I interviewed for a number of jobs, and now that I am hired I can say that I really hate interviewing. My interviews at Cisco were very fair and reasonable. Just for the heck of it I did a phone screen with Google and completely bombed it. I’m not ashamed to admit that. I’m not supposed to reveal their questions, and I won’t, but they were mostly basic questions about TCP functionality, and MAC/ARP stuff, and it’s amazing how you forget some of the basics over the years. I wasn’t really interested in working there so I did no preparation, and in fact the recruiter warned me to brush up on basics. I just figured my work and blog show that I am at least somewhat technical. I plan to write some posts on the art of technical interviewing, but I was certainly underwhelmed by Google’s screening process, as I’m sure they were by my performance. I really wanted the Cisco job, and what a difference attitude makes! (Oh, and I completely munged an MPLS FRR/Node & Link protection question, less than a year after passing the JNCIE-SP. Uh, whoops.)

2. I bear Juniper no ill will. It was an interesting six years. When I came on board, during the Kevin Johnson years, it was all rah-rah pep talks about how we were going to be the next $10 billion company (errr, no…) followed by a plethora of product disasters. Killing off Netscreen gave the firewall market to Palo Alto, Fortinet, and amazingly resuscitated Checkpoint. Junos Space was a disaster, and Pulse slightly less so. QFabric was not a bad idea, but was far too complex. You needed to buy a professional services contract with the product, because it was too complex to install by itself. And yet it supposedly simplified the data center? There was a fiasco with our load balancer product. And then came the activist investors with their Integrated Operating Plan. I will permanently loathe activist investors. Juniper was hurting and they just magnified the hurt. There’s nothing worse than a bunch of generic business-types who wouldn’t know a router if they saw one trying to tell a router company how to do its business. They thought they could apply the same formula you learn in B-school to any company no matter what it manufactures or does. Then we had the CEO revolving door.

Despite all of this, as I said, I like Juniper. I did ok there, and there are a lot of people I respect working there. Rami Rahim is a good choice for CEO. I left for personal reasons. They still have some good products and good ideas, and I think competition is always good for the marketplace. For the sake of my friends there, I hope Juniper does well.

3. If you read my bio, you will see that I was THE network architect for Juniper IT, meaning I covered everything. This included (in theory at least) campus LAN, WAN, data center, wireless, network security, etc. I did something in all of these spaces. It was a broad level of knowledge, but not deep. That’s why I did my JNCIE-SP–I was hungering to go deep on something. My new job at Cisco is principal technical strategy engineer for data center. This is an opportunity to go deep but not as broad, and I’m happy to be doing that. The data center space is where it’s at these days, and I can’t wait to get deeper into it.

4. Coming back to Cisco after an eight year hiatus was bizarre. It was cool to pull up all my old bugs and postings to internal aliases to see what I was doing back then. Heck, I actually sounded like I knew a thing or two. I was thrilled to find out I am on the same team as Tim Stevenson, whose work as a Cat 6K TME I admired when I worked in TAC. Just for fun I walked though my old building and floor (K, floor 2) and nearly fell over when I saw that it looked identical. I mean, not only the cubes, but there were these giant signs for the different teams (e.g. “HTTS AT&T TEAM”) which were still hanging there as though the intervening eight years had never happened.

Unfortunately, I have to leave a few in progress articles in the dustbin. First, I shouldn’t really be promoting Juniper now that I am working for Cisco. And second, I’ve lost access to VMM, the internal Juniper tool I used to spin up VM versions of Juniper routers. However, I hope to start posting on Cisco topics now that I have access to that gear. Cisco’s products are generally better documented than Juniper’s, but I promise to fill any gaps I might find. And I will leave my previous articles up in hopes that they will benefit future engineers who struggle with Junos.

Onwards!