Skip navigation

Tag Archives: TAC

The case came in P1, and I knew it would be a bad one. One thing you learn as a TAC engineer is that P1 cases are often the easiest. A router is down, send an RMA. But I knew this P1 would be tough because it had been requeued three times. The last engineer who had it was good, very good. And it wasn’t solved. Our hotline gave me a bridge number and I dialed in.

The customer explained to me that he had a 7513 and a 7206, and they had a multilink PPP bundle between them with 8 T1 lines. The MLPPP interface had mysteriously gone down/down and they couldn’t get it back. The member links were all up/down. Why they were connecting them this way was not a question an HTTS engineer was allowed to ask. We were just there to troubleshoot. As I was on the bridge, they were systematically taking each T1 out of the bundle and putting HDLC encapsulation on it, pinging across, and then putting it back into the MLPPP bundle. This bought me time to look over the case notes.

There were multiple RMA’s in the notes. They had RMA’d the line cards and the entire chassis. The 7513 they were shipped had problems and so they RMA’d it a second time. RMA’ing an entire 7513 chassis is a real pain. I perused the configs to see if authentication was configured on the PPP interface, but it wasn’t. It looked like a PPP problem (up/down state) but the interface config was plain MLPPP vanilla.

They finished testing all of the T1’s individually. One of the engineers said “I think we need another RMA.” I told them to hang on. “Take all of the links out of the bundle and give me an MLPPP bundle with one T1,” I said. “But we tested them all individually!” they replied. “Yes, but you tested them with HDLC. I want to test one link with multilink PPP on it.” They agreed. And with a single link it was still down/down. Now we were getting somewhere. I had them switch which link was the active one. Same problem. Now disable multilink and just run straight PPP on a single link. Same thing.

“Can you turn on debug ppp with all options?” I asked. They were worried about doing it on the 7513, but I convinced them to do it on the 7206. They sent me the logs, and this stood out:

AAA/AUTHOR/LCP: Denied

Authorization failed. But why? Nothing was configured under the interface, but I looked at the top of the config, where the AAA commands are, and saw this:

aaa authorization network default

And there it was. “Guys, could you remove this one line from the config?” I asked. They did. The single PPP link came up. “Let’s do this slowly. Add the single link back into multilink mode.” Up/up. “Now add all the links back.” It was working.

It turns out they had a project to standardize their configs across all their routers and accidentally added that line. They had RMA’d an entire 7513 chassis–twice!–for a single line of config. Replacing a 7513 is a lot of work. I still can’t believe it got that far.

Some lessons from this story: first, RMAs don’t always fix the problem. Second, even good engineers make stupid mistakes. Third, when troubleshooting, always limit the scope of the problem. Troubleshoot as little as you can. And finally, even hard P1’s can turn out easy.

Before I worked at TAC, I was pretty careless about how I filled in a TAC case online. For example, when I had to select the technology I was dealing with in the drop-down menu, if I didn’t see exactly what I had then I would go ahead and pick something at random and figure TAC would sort it out. And then I would get frustrated when I didn’t get an answer on my case for hours. Working in TAC showed me why.

When you open a TAC case, and you pick a particular technology, your choice determines into which queue the case is routed. For example, if you pick Catalyst 6500, the case ends up in a queue which is being monitored by engineers who are experts on that platform. Under TAC rules (assuming it is a priority 3 case) the engineers have 20 minutes to pick up the case. If they don’t, it turns blue in their display and their duty manager starts asking questions. (In high touch TAC where I worked, we didn’t have too many blue cases, but in backbone TAC it wasn’t uncommon to see a ton of blue and even black (> 1hr) cases sitting in a busy queue.)

If the customer categorized his case wrong, this meant it was sitting in the wrong queue. Now an engineer had to notice his case, review it, determine where it should go, and “punt” it to the appropriate queue, at which point the counters are reset and the case is sitting again.

Imagine for a moment that you are an overworked TAC engineer with 30 minutes left to go on your shift. You are supposed to clear out your queue and take any cases before the next crew comes on (at least we were in HTTS). You don’t want to take any more cases, however. There is a case sitting in your queue which has turned blue and your colleagues may not be happy to see it sitting there when they come on shift. Well, you’re an experienced TAC engineer and you know what to do: punt the case to another queue, even if it’s the wrong one. If you pick a busy queue, it will take at least 30 minutes for the engineers on that queue to see the “mis-queue” and punt the case back to your queue, at which point you are off shift and it becomes the problem of your colleagues on the next shift.

My recommendation is to be very careful to select the right menu options when you open a case online with any tech support organization. Make sure you route the case to the right place the first time so you don’t have to wait for engineers and managers to look at it and re-categorize it.