Everyone who’s worked in TAC can tell you their nightmare case–the type of case that, when they see it in the queue, makes them want to run away, take an unexpected lunch break, and hope some other engineer grabs it. The nightmare case is the case you know you’ll get stuck on for hours, on a conference bridge, escalating to other engineers, trying to find a solution to an impossible problem. For some it’s unexplained packet loss. For others, it’s multicast. For me, it was EIGRP Stuck-in-Active (SIA).
Some customer support engineers (CSEs) thought SIA cases were easy. Not me. A number of times I had a network in total meltdown due to SIA with no clue as to where the problem was. Often the solution required a significant redesign of the network.
As a review, EIGRP is more-or-less a distance-vector routing protocol, which uses an algorithm called DUAL to achieve better performance than a traditional DV protocol like RIP. I don’t want to get into all the fun CCIE questions on the protocol details, but what matters for this article is how querying works. When an EIGRP neighbor loses a route, it sets the route as “Active” and then queries its neighbors as to where the route went. Then, if the neighbors don’t have it, they set it active and query their neighbors. If those neighbors don’t have the route active, they of course mark it active and query their neighbors. And so forth.
It should be obvious from this process that in a large network, the queries can multiply quite quickly. If a router has a lot of neighbors, and its neighbors have a lot of neighbors, the queries multiply exponentially, and can get out of control. Meanwhile, when a router sets a route active, it sets a timer. If it doesn’t get a reply before the timer expires, then the router marks the route “Stuck In Active”, and resets the entire EIGRP adjacency. In a large network with a lot of neighbors, even if the route is present, the time lag between sending a query and getting a response can be so long that the route gets reset before the response makes it to the original querying router.
I’ve ironed out some of the details here, since obviously an EIGRP router can lose a route entirely without going SIA. For details, see this article. The main point to remember is that the SIA route happens when the querying route just doesn’t get a response back.
Back in my TAC days, I of course wasn’t happy to see an SIA drop in the queue. I waited to see if one of my colleagues would take the case and alleviate the burden, but the case turned blue after 20 minutes, meaning someone had to take it. Darn.
Now I can show my age, because the customer had adjacencies resetting on Token Ring interfaces. I asked the customer for a topology diagram, some debugs, and to check whether there was packet loss across the network. Sometimes, if packets are getting dropped, the query responses don’t make it back to the original router, causing SIA. The logs from the resets looked like this:
rtr1 - 172.16.109.118 - TokenRing1/0 Sep 1 16:58:06: %DUAL-3-SIA: Route 172.16.161.58/32 stuck-in-active state in IP-EIGRP(0) 55555. Cleaning up Sep 1 16:58:06: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is down: stuck in active Sep 1 16:58:07: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 55555: Neighbor 172.16.109.124 (TokenRing1/0) is up: new adjacency
This is typical of SIA. The adjacency flapped, but the logs showed no particular reason why.
I thought back to my first troubleshooting experience as a network engineer. I had brought up a new branch office but it couldn’t talk back to HQ. Mike, my friend and mentor, showed up and started pinging hop-by-hop until he found a missing route. “That’s how I learned it,” he said, “just go one hop a time.” The big clue I had in the SIA case was the missing route: 172.16.161.58/32. I started tracing it back, hop-by-hop.
I found that the route originated from a router on the edge of the customer network, which had an ISDN PRI connected. (Showing my age again!) They had a number of smaller offices that would dial into the ISDN on-demand, and then drop off. ISDN had per-minute charges and thus, in this pre-VPN era, it was common to setup ISDN in on-demand mode. ISDN was a digital dial-up technology with very short call setup times. I discovered that, as these calls were going up and down, the router was generating /32 peer routes for the neighbors and injecting them into EIGRP. They had a poorly designed network with a huge query domain size, and so as these dial peers were going up and down, routers on the opposite side of the network were going into active on the route and not getting responses back.
They were advertising a /16 for the entire 172.16.x.x network, so sending a /32 per dial peer was totally unnecessary. I recommended they enable “no peer neighbor-route” on the PRI to suppress the /32’s and the SIAs went away.
I hate to bite the hand that feeds me, but even though I work at Cisco I can say I really never liked EIGRP. EIGRP is fast, and if the network is designed well, it works fine. However, networks often grow organically, and the larger the domain, the more unstable EIGRP becomes. I’ve never seen this sort of problem with OSPF or ISIS. Fortunately, this case ended up being much less problematic than I expected, but often these cases were far nastier. Oftentimes it was nearly impossible to find the route causing the problem and why it was going crazy. Anyhow it’s always good to relive a case with both Token Ring and ISDN for a double case of nostalgia.
Jeff, thanks for sharing your TAC experiences. Each tale in the series resonates deeply with this CSE!
Thanks Olivier! Glad you enjoy them. I’m running out of ideas though, I only spent two years in TAC 🙂