mlppp

All posts tagged mlppp

I’ve mentioned before that EIGRP SIA was my nightmare case at TAC, but there was one other type of case that I hated–QoS problems.  Routing protocol problems tend to be binary.  Either the route is there or it isn’t;  either the pings go through or they don’t.  Even when a route is flapping, that’s just an extreme version of the binary problem.  QoS is different.  QoS cases often involved traffic that was passing sometimes or in certain amounts, but would start having problems when different sizes of traffic went through, or possibly traffic was dropping at a certain rate.  Thus, the routes could be perfectly fine, pings could pass, and yet QoS was behaving incorrectly.

In TAC, we would regularly get cases where the customer claimed traffic was dropping on a QoS policy below the configured rate.  For example, if they configured a policing profile of 1000 Mbps, sometimes the customer would claim the policer was dropping traffic at, say, 800 Mbps.  The standard response for a TAC agent struggling to figure out a QoS policy issue like this was to say that the link was experiencing “microbursting.”  If a link is showing a 800 Mbps traffic rate, this is actually an average rate, meaning the link could be experiencing short bursts above this rate that exceed the policing rate, but are averaged out in the interface counters.  “Microbursting” was a standard response to this problem for two reasons:  first, it was most often the problem;  second, it was an easy way to close the case without an extensive investigation.  The second reason is not as lazy as it may sound, as microbursts are common and are usually the cause of these symptoms.

Thus, when one of our large service provider customers opened a case stating that their LLQ policy was dropping packets before the configured threshold, I was quick to suspect microbursts.  However, working in high-touch TAC, you learn that your customers aren’t pushovers and don’t always accept the easy answer.  In this case, the customer started pushing back, claiming that the call center which was connected to this circuit generated a constant stream of traffic and that he was not experiencing microbursts.  So much for that.

This being the 2000’s, the customer had four T1’s connected in a single multi-link PPP (MLPPP) bundle.  The LLQ policy was dropping traffic at one quarter of the threshold it was configured for.  Knowing I wouldn’t get much out of a live production network, I reluctantly opened a lab case for the recreate, asking for a back-to-back router with the same line cards, a four-link T1 interconnection, and a traffic generator.  As always, I made sure my lab had exactly the same IOS release as the customer.

Once the lab was set up I started the traffic flowing, and much to my surprise, I saw traffic dropping at one quarter of the configured LLQ policy.  Eureka!  Anyone who has worked in TAC will tell you that more often than not, lab recreates fail to recreate the customer problem.  I removed and re-applied the service policy, and the problem went away.  Uh oh.  The only thing worse than not recreating a problem is recreating it and then losing it again before developers get a chance to look at it.

I spent some time playing with the setup, trying to get the problem back.  Finally, I reloaded the router to start over and, sure enough, I got the traffic loss again.  So, the problem occurred at start-up, but when the policy was removed and re-applied, it corrected itself.  I filed a bug and sent it to engineering.

Because it was so easy to recreate, it didn’t take long to find the answer.  The customer was configuring their QoS policy using bandwidth percentages instead of absolute bandwidth numbers.  This means that the policy bandwidth would be determined dynamically by the router based on the links it was applied to.  It turns out that IOS was calculating the bandwidth numbers before the MLPPP bundle was fully up, and hence was using only a single T1 as the reference for the calculation instead of all four.  The fix was to change the priority of operations in IOS, so that the MLPPP bundle came up before the QoS policy was applied.

So much for microbursts.  The moral(s) of the story?  First, the most obvious cause is often not the cause at all.  Second, determined customers are often right.  And third:  even intimidating QoS cases can have an easy fix.

The case came in P1, and I knew it would be a bad one. One thing you learn as a TAC engineer is that P1 cases are often the easiest. A router is down, send an RMA. But I knew this P1 would be tough because it had been requeued three times. The last engineer who had it was good, very good. And it wasn’t solved. Our hotline gave me a bridge number and I dialed in.

The customer explained to me that he had a 7513 and a 7206, and they had a multilink PPP bundle between them with 8 T1 lines. The MLPPP interface had mysteriously gone down/down and they couldn’t get it back. The member links were all up/down. Why they were connecting them this way was not a question an HTTS engineer was allowed to ask. We were just there to troubleshoot. As I was on the bridge, they were systematically taking each T1 out of the bundle and putting HDLC encapsulation on it, pinging across, and then putting it back into the MLPPP bundle. This bought me time to look over the case notes.

There were multiple RMA’s in the notes. They had RMA’d the line cards and the entire chassis. The 7513 they were shipped had problems and so they RMA’d it a second time. RMA’ing an entire 7513 chassis is a real pain. I perused the configs to see if authentication was configured on the PPP interface, but it wasn’t. It looked like a PPP problem (up/down state) but the interface config was plain MLPPP vanilla.

They finished testing all of the T1’s individually. One of the engineers said “I think we need another RMA.” I told them to hang on. “Take all of the links out of the bundle and give me an MLPPP bundle with one T1,” I said. “But we tested them all individually!” they replied. “Yes, but you tested them with HDLC. I want to test one link with multilink PPP on it.” They agreed. And with a single link it was still down/down. Now we were getting somewhere. I had them switch which link was the active one. Same problem. Now disable multilink and just run straight PPP on a single link. Same thing.

“Can you turn on debug ppp with all options?” I asked. They were worried about doing it on the 7513, but I convinced them to do it on the 7206. They sent me the logs, and this stood out:

AAA/AUTHOR/LCP: Denied

Authorization failed. But why? Nothing was configured under the interface, but I looked at the top of the config, where the AAA commands are, and saw this:

aaa authorization network default

And there it was. “Guys, could you remove this one line from the config?” I asked. They did. The single PPP link came up. “Let’s do this slowly. Add the single link back into multilink mode.” Up/up. “Now add all the links back.” It was working.

It turns out they had a project to standardize their configs across all their routers and accidentally added that line. They had RMA’d an entire 7513 chassis–twice!–for a single line of config. Replacing a 7513 is a lot of work. I still can’t believe it got that far.

Some lessons from this story: first, RMAs don’t always fix the problem. Second, even good engineers make stupid mistakes. Third, when troubleshooting, always limit the scope of the problem. Troubleshoot as little as you can. And finally, even hard P1’s can turn out easy.