No customer is happy if they have to reboot one of their Internet-facing routers periodically, and this was one of our biggest customers. (At HTTS, they were all big customers.) This customer had a GSR connecting to the Internet, with partial BGP routes, and he kept getting this error:
%RP-3-ENCAP: Failure to allocate encap table entry, exceeded max number of entries, slot 2
Eventually the router would stop passing traffic and when this happened, he had to reload it. Needless to say, he wasn’t happy.
The error came with a traceback, which shows what functions the code was executing when the error was generated. The last function was this:
arp_background(0x5053d290)+0x140
Well, this was obviously some sort of ARP issue. But why was ARP causing the router to stop forwarding traffic?
Looking up the error, I found that it meant the route processor was unable to allocate a rewrite entry for the slot 2 line card. As a packet leaves the fabric of a large router like the GSR, the headers are re-written with the destination layer 2 info. The rewrite table used for this was full. I had the customer run a hidden command a few times, and we could see the table entries incrementing quickly:
Adjacency Table has 3167 adjacencies Adjacency Table has 3291 adjacencies Adjacency Table has 3322 adjacencies Adjacency Table has 3410 adjacencies
Scrolling through the config, I looked for something that could be the culprit. Then I saw it. I remembered a router architecture course I had to take when I first became a TAC agent. One of the escalation engineers told the story of his first P1 case. It was a router that kept needing a reload. He went to another senior escalation engineer, and after looking at the config she said to him, “What are you a f*cking idiot?” He was quite shocked to be addressed in this manner. “There is a static route pointed to a broadcast interface!” she yelled, and then proceeded to chew him out for wasting her time. This lady was famous in TAC for using bad language in nearly every sentence, and our trainer was able to laugh about it in retrospect. “Now that I know her I don’t even care when she talks to me like that,” he reported.
Well, I wasn’t going to be called anything like that. I looked in the config and found this:
ip route 0.0.0.0 0.0.0.0 GigabitEthernet2/0 100
A default route, pointed out a broadcast interface. With partial BGP routes, this meant that the router was generating an ARP entry for every single destination address on the Internet that was not in the partial BGP table. Whoops. There are millions of destinations on the Internet, so it’s no surprise he was filling the capacity on the re-write table on his line card.
He removed the route and replaced it with a static route to the next hop. The adjacency table immediately dropped below 100. Problem solved.
Some TAC cases were mind-bogglingly difficult, involving multiple layers of help from engineering, hours in the lab, and major frustration. Some, like this one, are major problems with major customers that end quickly and easily. I closed the case with this note:
Customer was seeing RP-3-ENCAP error messages on one of his GSR LC’s. The card would eventually stop passing traffic, requiring reload of the router. Customer had a static default route to the Internet pointed out a broadcast interface–this was causing the router to ARP out that interface and create CEF adjacencies for each destination on the Internet. This was overloading the rewrite table on the LC. Customer removed static route, pointed to next hop address instead. Rewrite table entries went back to normal.