I’ve mentioned before that EIGRP SIA was my nightmare case at TAC, but there was one other type of case that I hated–QoS problems. Routing protocol problems tend to be binary. Either the route is there or it isn’t; either the pings go through or they don’t. Even when a route is flapping, that’s just an extreme version of the binary problem. QoS is different. QoS cases often involved traffic that was passing sometimes or in certain amounts, but would start having problems when different sizes of traffic went through, or possibly traffic was dropping at a certain rate. Thus, the routes could be perfectly fine, pings could pass, and yet QoS was behaving incorrectly.
In TAC, we would regularly get cases where the customer claimed traffic was dropping on a QoS policy below the configured rate. For example, if they configured a policing profile of 1000 Mbps, sometimes the customer would claim the policer was dropping traffic at, say, 800 Mbps. The standard response for a TAC agent struggling to figure out a QoS policy issue like this was to say that the link was experiencing “microbursting.” If a link is showing a 800 Mbps traffic rate, this is actually an average rate, meaning the link could be experiencing short bursts above this rate that exceed the policing rate, but are averaged out in the interface counters. “Microbursting” was a standard response to this problem for two reasons: first, it was most often the problem; second, it was an easy way to close the case without an extensive investigation. The second reason is not as lazy as it may sound, as microbursts are common and are usually the cause of these symptoms.
Thus, when one of our large service provider customers opened a case stating that their LLQ policy was dropping packets before the configured threshold, I was quick to suspect microbursts. However, working in high-touch TAC, you learn that your customers aren’t pushovers and don’t always accept the easy answer. In this case, the customer started pushing back, claiming that the call center which was connected to this circuit generated a constant stream of traffic and that he was not experiencing microbursts. So much for that.
This being the 2000’s, the customer had four T1’s connected in a single multi-link PPP (MLPPP) bundle. The LLQ policy was dropping traffic at one quarter of the threshold it was configured for. Knowing I wouldn’t get much out of a live production network, I reluctantly opened a lab case for the recreate, asking for a back-to-back router with the same line cards, a four-link T1 interconnection, and a traffic generator. As always, I made sure my lab had exactly the same IOS release as the customer.
Once the lab was set up I started the traffic flowing, and much to my surprise, I saw traffic dropping at one quarter of the configured LLQ policy. Eureka! Anyone who has worked in TAC will tell you that more often than not, lab recreates fail to recreate the customer problem. I removed and re-applied the service policy, and the problem went away. Uh oh. The only thing worse than not recreating a problem is recreating it and then losing it again before developers get a chance to look at it.
I spent some time playing with the setup, trying to get the problem back. Finally, I reloaded the router to start over and, sure enough, I got the traffic loss again. So, the problem occurred at start-up, but when the policy was removed and re-applied, it corrected itself. I filed a bug and sent it to engineering.
Because it was so easy to recreate, it didn’t take long to find the answer. The customer was configuring their QoS policy using bandwidth percentages instead of absolute bandwidth numbers. This means that the policy bandwidth would be determined dynamically by the router based on the links it was applied to. It turns out that IOS was calculating the bandwidth numbers before the MLPPP bundle was fully up, and hence was using only a single T1 as the reference for the calculation instead of all four. The fix was to change the priority of operations in IOS, so that the MLPPP bundle came up before the QoS policy was applied.
So much for microbursts. The moral(s) of the story? First, the most obvious cause is often not the cause at all. Second, determined customers are often right. And third: even intimidating QoS cases can have an easy fix.