TAC Tales #10: Out to Lunch

When you work at TAC, you are required to be “on-shift” for 4 hours each day.  This doesn’t mean that you work four hours a day, just that you are actively taking cases only four hours per day.  The other four (or more) hours you work on your existing backlog, calling customers, chasing down engineering for bug fixes, doing recreates, and, if you’re lucky, doing some training on the side.  While you were on shift, you would still work on the other stuff, but you were responsible for monitoring your “queue” and taking cases as they came in.  On our queue we generally liked to have four customer support engineers (CSE’s) on shift at any time.  Occasionally we had more or less, but never less than two.  We didn’t like to run with two engineers for very long;  if a P1 comes in, a CSE can be tied up for hours unable to deal with the other cases that come in, and the odds are not low that more than one P1 come in.  With all CSE’s on-shift tied up, it was up to the duty manager to start paging off-shift engineers as cases came in, never a good thing.  If ever you were on hold for a long time with a P1, there is a good chance the call center agent was simply unable to find a CSE because they were all tied up.  Sometimes it was due to bad planning, sometimes lack of staff.  Sometimes you would start a shift with five CSE’s on the queue and they’d all get on P1’s in the first five minutes.  The queue was always unpredictable.

At TAC, when you were on-shift, you could never be far from your desk.  You were expected to stay put, and if you had to get up to use the bathroom or go to the lab, you notified your fellow on-shift engineers so they knew you wouldn’t be available.  Since I preferred the 10am-2pm shift, in 2 years I took lunch away from my desk maybe 5 times.  Most days I told the other guys I was stepping out, ran to the cafeteria, and ran back to my desk to eat while taking cases.

Thus, I was quite happy one day when I had a later shift and my colleague Delvin showed up at my desk and asked if I wanted to go to a nice Chinese lunch.  Eddy Lau, one of our new CSE’s who had recently emigrated from China, had found an excellent and authentic restaurant.  We hopped into Eddy’s car and drove over to the restaurant, where we proceeded to have a two-hour long Chinese feast.  “It’s so great to actually go to lunch,” I said to Delvin, “since I eat at my desk every day.”  Eddy was happy to help his new colleagues out.

As we were driving back, Devlin asked Eddy, “When are you on shift?”

“Right now,” said Eddy.

“You’re on shift now?!” Delvin asked incredulously.  “Dude, you can’t leave for two hours if you’re on shift.  Who are you on shift with?”

“Just me and Sarah,” Eddy said, not really comprehending the situation.

“You left Sarah on shift by herself?!” Devlin asked.  “What if a P1 comes in?  What if she gets swamped by cases?  You can’t leave someone alone on the queue!”

We hurried back to the office and pulled up WebMonitor, which showed not only active cases, but who had taken cases that shift and how many.  Sarah had taken a single case.  By some amazing stroke of luck, it had been a very quiet shift.

I walked by Eddy’s desk and he gave me a thumbs up and a big smile.  I figured he wouldn’t last long.  A couple months later, after blowing a case, Eddy got put on RMA duty and subsequently quit.

If you ever wonder why you had to wait so long on the phone, it could be a busy day.  Or it could be your CSE’s decided to take a long lunch without telling anyone.

TAC Tales #9: Left Hanging

When I was still a new engineer, a fellow customer support engineer (CSE) asked a favor of me. I’ll call him Andy.

“I’m going on PTO, could you cover a case for me? I’ve filed a bug and while I’m gone there will be a conference call. Just jump on it and tell them that the bug has been filed an engineering is working on it.” The case was with one of our largest service provider clients. I won’t say which, but they were a household name.

When you’re new and want to make a good impression, you jump on chances like this. It was a simple request and would prove I’m a team player. Of course I accepted the case and went about my business with the conference call on my calendar for the next week.

Before I got on the call I took a brief look at the case notes and the DDTS (what Cisco calls a bug.) Everything seemed to be in order. The bug was filed and in engineering’s hands. Nothing to do but hop on the call and report that the bug was filed and we were working on it.

I dialed the bridge and after I gave my name the automated conference bridge said “there are 20 other parties in the conference.” Uh oh. Why did they need so many?

After I joined, someone asked for introductions. As they went around the call, there were a few engineers, several VP’s, and multiple senior directors. Double uh oh.

“Jeff is calling from Cisco,” the leader of the call said. “He is here to report on the P1 outage we had last week affecting multiple customers. I’m happy to tell you that Cisco has been working diligently on the problem and is here to report their findings and their solution. Cisco, take it away.”

I felt my heart in my throat. I cleared my voice, and sheepishly said: “Uh, we’ve, uh, filed a bug for your problem and, uh, engineering is looking into it.”

It was dead silence, followed by a VP chiming in: “That’s it?”

I was then chewed out thoroughly for not doing enough and wasting everyone’s time.

When Andy got back he grabbed the case back from me. “How’d the call go?” he asked.

I told him how it went horribly, how they were expecting more than I delivered, and how I took a beating for him.

Andy just smiled. Welcome to TAC.

TAC Tales #8: The problem with data

My job as a customer support engineer (CSE) at TAC was the most quantified I’ve ever had.  Every aspect of our job performance was tracked and measured.  We live in the era of big data, and while numbers can be helpful, they can also mislead.  In TAC, there were many examples of that.

Take, for example, our customer satisfaction rating, known as a “bingo” score.  Every time a customer filled out a survey at the end of a TAC case, the engineer was notified and the bingo score recorded and averaged with all his previous scores.  While this would seem to be an effective measure of an engineer’s performance, it often wasn’t.

In TAC, we often ended up taking cases that were “requeues.”  These were cases that were previously worked by another engineer.  Imagine you got a requeue of a case that another CSE had handled terribly.  You close the case quickly, but the customer is still angry at the first CSE, so he gives a low bingo score.  That score was credited against the CSE who closed the case, so even though you took care of it, you got stuck with the low numbers.

This also happened with create-to-close numbers.  We were measured on how quickly we closed cases.  Imagine another CSE had been sitting on a case for six months doing nothing.  The customer requeues it, and you end up with the case, closing it immediately.  You end up with a six month create-to-close number even though it wasn’t your fault.

Even worse, if you think about it, the create-to-close number discouraged engineers from taking hard cases.  Easy cases close quickly, but hard ones stay open while recreates are done and bugs are filed.  The engineers who took the hardest cases and were very skilled often had terrible create-to-close numbers.

The bottom line is that you need more than data to understand a person.  Most things in life don’t lend themselves to easy quantification.  Numbers always need to be in context.  Corporate managers are obsessed with quantification, and the Google’s of the world are helping to drive our number-love even further.  Meanwhile, reducing people to numbers is a great way to treat them less humanly.

TAC Tales #7: “Just slam it in!”

When I first started at TAC, I wasn’t allowed to take cases by myself.  If I grabbed a case, I had to get an experienced engineer to help me out.  One day I grabbed a case on a Catalyst 6k power supply, and asked Veena (not her real name) to help me on the case.

We got the customer on the phone.  He was an engineer at a New York financial institution, and sounded like he came from Brooklyn.  I lived in Williamsburg for a while with my mom back in the 1980’s before it was cool, and I know the accent.  He explained that he had a new 6k, and it wasn’t recognizing the power supply he had bought for it.  All of the modules had a “power denied” message on them.

I put the customer on speaker phone in my cube and Veena looked at the case notes.  As was often the case in TAC, we put the customer on mute while discussing the issues.  Veena thought it was a bad connection between the power supply and the switch.

“Here’s what I want you to do,” Veena said to the customer, un-muting the phone.  “I used to work in the BU, and a lot of times these power supplies don’t connect to the backplane.  You need to put it in hard.  Pull the power supply out and slam it in to the chassis.  I want to hear it crack!”

The customer seemed surprised.  “You want me to do what?!” he bristled.

“Slam it in!  Just slam it in as hard as you can!  We saw this in the BU all the time!”

“Hey lady,” he responded, “we paid a couple hundred grand for this box and I don’t want to break it.”

“It’s ok,” she said, “it’ll be fine.  I want to hear the crack!”

“Well, ok,” he said with resignation.  He put the phone down and we heard him shuffle off to the switch.  Meanwhile Veena looked at me and said “Pull up the release notes.”  I pulled up the notes, and we saw that the power supply wasn’t supported in his version of Catalyst OS.

Meanwhile in the background:  CRACK!!!

The customer came back on the line.  “Lady, I slammed that power supply into the chassis as hard as I could.  I think I broke something on it, and it still doesn’t work!”

“Yes,” Veena replied.  “We’ve discovered that your software doesn’t support the power supply and you will need to do an upgrade…”

TAC Tales #6: The best organization tool ever

I’ve come back to Cisco recently, and I think I can say that I haven’t worked this hard since the last time I was at Cisco.  I remember my first manager at TAC telling me in an interview that “Cisco loves workaholics.”  In an attempt to get more organized, I’ve been taking a second crack at using OmniFocus and the GTD methodology.  To be honest, I haven’t had much luck with these systems in the past.  I usually end up entering a bunch of tasks into the system, and then quickly get behind on crossing them off.  I find that the tasks I really want to do, or need to do, I would do without the system, and the ones that I am putting off I keep putting off anyway.  I have so much to do now, however, that I need to track things more efficiently and I am hoping OmniFocus is the solution. Continue reading

Tac Tales #5: MWAM

New Year’s resolutions are made to be broken, and I haven’t been keeping up with my resolution to do more blog posts.  Now that I am back at Cisco, I am focusing on programmability and automation, and I do have a lot to say.  However, in honor of my return to Cisco, I thought I would post a new Tac Tales entry.  There is a moral to this story.

One day my boss came to me and said my team would be supporting the MWAM module in the Cat 6K.  I had done a lot of Cat 6K work at that point, but I had never even heard of an MWAM, and failed to see why cases on it would be sent to the routing protocols team.  My boss didn’t seem too concerned with my objections, and said, “Just go watch the VoD.”  VoD = Video on Demand.  So, I did.  I watched the VoD, and it started out by telling me how many processors were on the card and of what kind;  what types of buses were used to transmit data;  what kind of memory it had;  and how it interfaced with the Catalyst and its backplane.  Never did the video ever tell me what the card actually did.  I had no idea why one would buy an MWAM or what one would do with it.  I hoped a case wouldn’t come in on the card, and when it did I immediately escalated to engineering because I had no idea what to do.  Fortunately I only ever had one case on the MWAM.  (And the fun thing about coming back to Cisco after 10 years is that I can go look up all these cases I remember and read my notes.  Very cool!)

What is the moral, you ask?  Well, as a Technical Marketing Engineer, a big part of my role is communicating technical concepts clearly to others.  How often have you bought a book or looked at a web page to learn some new protocol, only to find that the description of it begins with packet header formats or state machines?  Fine, but tell me what it actually does before you tell me how it works.  Imagine if you went out into the jungle and encountered someone who has never seen a car.  You wouldn’t start explaining it to him by saying “it uses an internal combustion engine which has a four-stroke cycle of intake, compression, power, exhaust.”  You’d say, “it has wheels and takes me places very fast.”  Now in defense of the MWAM VoD guy, he may have designed his video for people who already knew what the card was.  But often I have found that people make this assumption, and when I backtrack and start at the beginning when explaining something, often people say, “you know, I’ve always been afraid to ask about that, but thanks for explaining it.”

Meanwhile, my second try at Cisco is much more fun than the last.  And thankfully, no MWAMs.  TAC was an great experience and period of growth, but it’s a not a fun job.

TAC Tales #4: Airline Outage

I don’t advertise this blog so I’m always amazed that people even find it. I figured the least-read articles on this blog were my “TAC Tales,” but someone recently commented that they wanted to see more… Well, I’m happy to oblige.

The recent events at United reminded me of a case where operations were down for one of the major airlines at Miami International Airport. It didn’t directly impact flight operations, but ticketing and baggage handling systems were down. Naturally, it was a P1 and so I dialed into the conference bridge.

This airline had four Cat 6500’s acting as their core devices for the network. The four switches had vastly disparate configurations, both hardware and software. I seem to recall one of them was running a Supe 1 module, which was even old in 2007 when I took the case. There was a different software version on each of them.

EIGRP was acting funny. As a TAC engineer in the routing protocols team, I absolutely hated EIGRP. EIGRP Stuck-In-Active was my nightmare case. It was always such a pain to track down the source, and meanwhile you’d have peers resetting all over the place. OSPF doesn’t do that, nor ISIS. I once got in a debate on an internal Cisco alias with some EIGRP guys. Granted, I had insulted their life’s work, but I stated that EIGRP was fast, but unreliable and prone to meltdown. Their retort was that properly designed EIGRP networks do not melt down. Great, but when are networks ever properly designed? They are so often slapped together haphazardly, grow organically, and overall need to be resilient when even when unplanned. Of course, those of us in design and architecture positions do our best to build highly available networks, but you don’t want to be running a protocol that flips out when a route at some far end of the network disappears. Anyhow…

The adjacencies on all four boxes were resetting constantly. It was totally unstable. Every five minutes or so, some manager from the airline would hop on the bridge to tell us that they were using handwritten tickets and baggage tags, that lines at the ticket counters were going out the door, etc, etc. Because that really helps me to concentrate. I tried to troubleshoot the way TAC engineers are trained to troubleshoot: collect logs, search for bugs in the relevant software, look for configuration issues. With routing adjacency flaps on switches, always check for STP issues. I couldn’t figure it out.

Finally some high-level engineer for the airline got on the phone and took over like a five-star general. He had his ops team systematically shut down and reset the switches, one at a time. The instability stopped. Wish I’d thought of that.

The standards for a routing protocol like OSPF are written by slow-moving committees, and hence don’t change much. These committees often have members from multiple competing vendors who disagree on exactly what should be done, and even when they do agree, nothing happens fast in IETF committees. Conversely, Cisco owns EIGRP, and they can change it as much as they want. Even their internal committees are nowhere near as bureaucratic as IETF. This means that there can be significant changes in the EIGRP code between IOS releases, much more so than for OSPF, and it is thus vital to keep code revisions amongst participating routers fairly close.

In this case, the consulting engineers for the airline helped them to standardize the hardware and software revisions. They never re-opened the case.

TAC Tales #3: GSR

Shortly after I went to work at TAC, my first team, which was dedicated to enterprise customers, was dissolved, and I ended up on the Routing Protocols team.  The RP team supported both enterprise and service provider customers, and I had zero experience on the SP side.  There was quite a learning curve ahead.

One day my phone rang with a P1.  I dreaded P1’s.  When a P1 came in, you were thrown head-first into a potentially huge outage with no knowledge of the case in advance.  Often times a case that had been worked by another engineer got raised to P1 and you had to deal with someone else’s mess.

On this particular day, the case was a line card problem on a 12000-series, or GSR.  GSR was a service provider box and I knew nothing about it.  I didn’t even think it ran IOS (it does).  I had no idea where to start.  You would think they would have given me training on every product we covered, but at HTTS, at least when I worked there, the general approach was to throw you to the wolves.

So, I politely put the customer on hold and started running around the second floor of building K where HTTS is located, shouting:  “Does anyone know GSR?  Anybody around to help me?!”  Finally I stumbled across my teammate Abe in the break room stirring up a cup of coffee.  Abe had been a product manager for the GSR before he came to TAC.  “Abe, you gotta help me!  I took a P1 on a GSR and I’ve never touched one!”

“Way to go, grab the bull by the horns!” was Abe’s response.  We rushed back to my cube and Abe walked me through GSR line card troubleshooting.  Abe and I have been great friends ever since that day.

Remember, when you call for support and the person on the other end of the phone sounds like they might not know what they are doing, they probably don’t.  And if they put you on hold they may be running around screaming for help.  It happened a lot in TAC.

TAC Tales #2: How to troubleshoot

The case came in P1, and I knew it would be a bad one. One thing you learn as a TAC engineer is that P1 cases are often the easiest. A router is down, send an RMA. But I knew this P1 would be tough because it had been requeued three times. The last engineer who had it was good, very good. And it wasn’t solved. Our hotline gave me a bridge number and I dialed in.

The customer explained to me that he had a 7513 and a 7206, and they had a multilink PPP bundle between them with 8 T1 lines. The MLPPP interface had mysteriously gone down/down and they couldn’t get it back. The member links were all up/down. Why they were connecting them this way was not a question an HTTS engineer was allowed to ask. We were just there to troubleshoot. As I was on the bridge, they were systematically taking each T1 out of the bundle and putting HDLC encapsulation on it, pinging across, and then putting it back into the MLPPP bundle. This bought me time to look over the case notes.

There were multiple RMA’s in the notes. They had RMA’d the line cards and the entire chassis. The 7513 they were shipped had problems and so they RMA’d it a second time. RMA’ing an entire 7513 chassis is a real pain. I perused the configs to see if authentication was configured on the PPP interface, but it wasn’t. It looked like a PPP problem (up/down state) but the interface config was plain MLPPP vanilla.

They finished testing all of the T1’s individually. One of the engineers said “I think we need another RMA.” I told them to hang on. “Take all of the links out of the bundle and give me an MLPPP bundle with one T1,” I said. “But we tested them all individually!” they replied. “Yes, but you tested them with HDLC. I want to test one link with multilink PPP on it.” They agreed. And with a single link it was still down/down. Now we were getting somewhere. I had them switch which link was the active one. Same problem. Now disable multilink and just run straight PPP on a single link. Same thing.

“Can you turn on debug ppp with all options?” I asked. They were worried about doing it on the 7513, but I convinced them to do it on the 7206. They sent me the logs, and this stood out:

AAA/AUTHOR/LCP: Denied

Authorization failed. But why? Nothing was configured under the interface, but I looked at the top of the config, where the AAA commands are, and saw this:

aaa authorization network default

And there it was. “Guys, could you remove this one line from the config?” I asked. They did. The single PPP link came up. “Let’s do this slowly. Add the single link back into multilink mode.” Up/up. “Now add all the links back.” It was working.

It turns out they had a project to standardize their configs across all their routers and accidentally added that line. They had RMA’d an entire 7513 chassis–twice!–for a single line of config. Replacing a 7513 is a lot of work. I still can’t believe it got that far.

Some lessons from this story: first, RMAs don’t always fix the problem. Second, even good engineers make stupid mistakes. Third, when troubleshooting, always limit the scope of the problem. Troubleshoot as little as you can. And finally, even hard P1’s can turn out easy.

Tac Tales #1: Case routing

Before I worked at TAC, I was pretty careless about how I filled in a TAC case online. For example, when I had to select the technology I was dealing with in the drop-down menu, if I didn’t see exactly what I had then I would go ahead and pick something at random and figure TAC would sort it out. And then I would get frustrated when I didn’t get an answer on my case for hours. Working in TAC showed me why.

When you open a TAC case, and you pick a particular technology, your choice determines into which queue the case is routed. For example, if you pick Catalyst 6500, the case ends up in a queue which is being monitored by engineers who are experts on that platform. Under TAC rules (assuming it is a priority 3 case) the engineers have 20 minutes to pick up the case. If they don’t, it turns blue in their display and their duty manager starts asking questions. (In high touch TAC where I worked, we didn’t have too many blue cases, but in backbone TAC it wasn’t uncommon to see a ton of blue and even black (> 1hr) cases sitting in a busy queue.)

If the customer categorized his case wrong, this meant it was sitting in the wrong queue. Now an engineer had to notice his case, review it, determine where it should go, and “punt” it to the appropriate queue, at which point the counters are reset and the case is sitting again.

Imagine for a moment that you are an overworked TAC engineer with 30 minutes left to go on your shift. You are supposed to clear out your queue and take any cases before the next crew comes on (at least we were in HTTS). You don’t want to take any more cases, however. There is a case sitting in your queue which has turned blue and your colleagues may not be happy to see it sitting there when they come on shift. Well, you’re an experienced TAC engineer and you know what to do: punt the case to another queue, even if it’s the wrong one. If you pick a busy queue, it will take at least 30 minutes for the engineers on that queue to see the “mis-queue” and punt the case back to your queue, at which point you are off shift and it becomes the problem of your colleagues on the next shift.

My recommendation is to be very careful to select the right menu options when you open a case online with any tech support organization. Make sure you route the case to the right place the first time so you don’t have to wait for engineers and managers to look at it and re-categorize it.