I was reading a Reddit thread bashing Cisco today. It’s a few months old. Some of the commentary is fair enough. Some of it might be a bit unfair. But I do think all vendor execs should spend time reading customer Reddit threads to understand how they’re perceived and what they can do better.
That’s not the subject of this post. Rather, I was amused to read an offhand comment about “‘Cosmic rays’ taking down VIP2s.” The poster thought it was a “bullshit excuse” and I just had to laugh. I worked in TAC from 2005-2007, and cosmic rays were very much a thing back then.
I worked on a team called “Routing Protocols and Large Scale Architectures”. We were supposed to work on routing protocols cases, but this was not backbone TAC (WW-TAC), this was High-Touch Tech Support (HTTS). We didn’t have all the specialized teams of WW-TAC, so the RP queue ended up becoming the dumping ground for anything that didn’t fit into a specific queue. While I was on RP, maybe 20% of my cases were actual routing protocol issues. One of the things that went to our queue, for no good reason, was crash.
A customer’s router would reload and then come back up. The “show version” output would tell you the reason why it reloaded. Even if a router reloads and comes back up and is working fine, customers always want to know what happened. Decoding crashes is a pain in the neck, as it is a forensic exercise and an attempt to decode the past. TAC engineers prefer cases where the issue is still happening and they can troubleshoot it live.
Often the reloaded router (or line card) would helpfully provide a traceback, a list of the function pointers in the stack that had been executed up until the crash. The top of the stack was the function that had crashed, and the rest of the stack showed you the functions that had been called leading up to that. The traceback was a series of hexadecimal numbers that are meaningless to customers. TAC engineers have tools that will decode the hex and tell you what the function names are. TAC engineers aren’t software developers, so that’s only slightly more helpful. The next step was to plop the function names into Topic, our internal search tool, and look for bugs that were filed with the same stack pointers, or perhaps another case with a similar traceback. If you were lucky, there was a bug filed and fixed, you recommended the new code, and closed the case.
Most of the time, you were not lucky. If you had good reason to believe this was a legitimate software bug, you could file one and try to get engineering to do the work. Usually, the bug was either junked or marked “unreproducible.”
Sometimes we’d see crashes due to parity errors. A parity error means, essentially, that something which was put into memory is not being read back correctly. One explanation for this is bad memory. Another? Cosmic rays.
I didn’t say it was a good explanation. But it was widely used by TAC engineers. Depending on how credulous your customer was, it might work. You see, often in TAC you end up in a situation where you have no explanation at all, but you want to give the customer something. Often the customer will not accept “I don’t know” as an answer. So, cosmic rays (or the related “sun spots”) was a TAC engineer’s attempt to close the case while providing some sort of explanation.
I remember when I first heard this. I was in my cubicle in building K, a newly minted customer support engineer, staring at a parity error case. Alex, one of the senior CSEs was looking at the case. “Tell them it was sun spots,” Alex said.
“Sun spots?!” I remember asking in disbelief.
Alex smiled. “That’s what engineering says causes these kinds of errors.”
I’m not sure if Alex even believed what he was saying. I’m not even sure engineering had actually said that. Maybe at some point, some where, a hardware engineer had speculated on it. But the story of cosmic rays circulated and became a legend among TAC engineers. I myself never used it after a couple attempts. I found it hard to believe, and our large, high-touch customers received the explanation coldly. Apparently, some of them still remember it and are bothered enough to post about it on Reddit.
P.S. I asked ChatGPT if cosmic rays can cause parity errors in RAM. It said “Yes, cosmic radiation can cause parity errors in RAM.” So there you go, we were right all along. Take that, Redditor!