What’s with the cosmic rays?

 

I was reading a Reddit thread bashing Cisco today.  It’s a few months old.  Some of the commentary is fair enough.  Some of it might be a bit unfair.  But I do think all vendor execs should spend time reading customer Reddit threads to understand how they’re perceived and what they can do better.

That’s not the subject of this post.  Rather, I was amused to read an offhand comment about “‘Cosmic rays’ taking down VIP2s.”  The poster thought it was a “bullshit excuse” and I just had to laugh.  I worked in TAC from 2005-2007, and cosmic rays were very much a thing back then.

I worked on a team called “Routing Protocols and Large Scale Architectures”.  We were supposed to work on routing protocols cases, but this was not backbone TAC (WW-TAC), this was High-Touch Tech Support (HTTS).  We didn’t have all the specialized teams of WW-TAC, so the RP queue ended up becoming the dumping ground for anything that didn’t fit into a specific queue.  While I was on RP, maybe 20% of my cases were actual routing protocol issues.  One of the things that went to our queue, for no good reason, was crash.

A customer’s router would reload and then come back up.  The “show version” output would tell you the reason why it reloaded.  Even if a router reloads and comes back up and is working fine, customers always want to know what happened.  Decoding crashes is a pain in the neck, as it is a forensic exercise and an attempt to decode the past.  TAC engineers prefer cases where the issue is still happening and they can troubleshoot it live.

Often the reloaded router (or line card) would helpfully provide a traceback, a list of the function pointers in the stack that had been executed up until the crash.  The top of the stack was the function that had crashed, and the rest of the stack showed you the functions that had been called leading up to that.  The traceback was a series of hexadecimal numbers that are meaningless to customers.  TAC engineers have tools that will decode the hex and tell you what the function names are.  TAC engineers aren’t software developers, so that’s only slightly more helpful.  The next step was to plop the function names into Topic, our internal search tool, and look for bugs that were filed with the same stack pointers, or perhaps another case with a similar traceback.  If you were lucky, there was a bug filed and fixed, you recommended the new code, and closed the case.

Most of the time, you were not lucky.  If you had good reason to believe this was a legitimate software bug, you could file one and try to get engineering to do the work.  Usually, the bug was either junked or marked “unreproducible.”

Sometimes we’d see crashes due to parity errors.  A parity error means, essentially, that something which was put into memory is not being read back correctly.  One explanation for this is bad memory.  Another?  Cosmic rays.

I didn’t say it was a good explanation.  But it was widely used by TAC engineers.  Depending on how credulous your customer was, it might work.  You see, often in TAC you end up in a situation where you have no explanation at all, but you want to give the customer something.  Often the customer will not accept “I don’t know” as an answer.  So, cosmic rays (or the related “sun spots”) was a TAC engineer’s attempt to close the case while providing some sort of explanation.

I remember when I first heard this.  I was in my cubicle in building K, a newly minted customer support engineer, staring at a parity error case.  Alex, one of the senior CSEs was looking at the case.  “Tell them it was sun spots,” Alex said.

“Sun spots?!” I remember asking in disbelief.

Alex smiled.  “That’s what engineering says causes these kinds of errors.”

I’m not sure if Alex even believed what he was saying.  I’m not even sure engineering had actually said that.  Maybe at some point, some where, a hardware engineer had speculated on it.  But the story of cosmic rays circulated and became a legend among TAC engineers.  I myself never used it after a couple attempts.  I found it hard to believe, and our large, high-touch customers received the explanation coldly.  Apparently, some of them still remember it and are bothered enough to post about it on Reddit.

P.S.  I asked ChatGPT if cosmic rays can cause parity errors in RAM.  It said “Yes, cosmic radiation can cause parity errors in RAM.”  So there you go, we were right all along.  Take that, Redditor!

TAC Tales #17: Escalations

When you open a TAC case, how exactly does the customer support engineer (CSE) figure out how to solve the case?  After all, CSEs are not super-human.  Just like any engineer, in TAC you have a range of brilliant to not-so-brilliant, and everything in between.  Let me give an example:  I worked at HTTS, or high-touch TAC, serving customers who paid a premium for higher levels of support.  When a top engineer at AT&T or Verizon opened a case, how was it that I, who had never worked professionally in a service provider environment, was able to help them at all?  Usually when those guys opened a case, it was something quite complex and not a misconfigured route map!

TAC CSEs have an arsenal of tools at their disposal that customers, and even partners, do not.  One of the most powerful is well known to anyone who has ever worked in TAC:  Topic.  Topic is an internal search engine.  It can do more now, but at the time I was in TAC, Topic could search bugs, TAC cases, and internal mailers.  If you had a weird error message or were seeing inexplicable behavior, popping the message or symptoms into Topic frequently resulted in a bug.  Failing that, it might pull up another TAC case, which would show the best troubleshooting steps to take.

Topic also searches internal mailers, the email lists used internally by Cisco employees.  TAC agents, sales people, TMEs, product managers, and engineering all exchange emails on these mailers, which are then archived.  Oftentimes a problem would show up in the mailer archives and engineering had already provided an answer.  Sometimes, if Topic failed, we would post the symptoms to the mailers in hopes engineering, a TME, or any expert would have a suggestion.  I was always careful in doing so, as if you posted something that was already answered, or asked too often, flames would be coming your way.

TAC engineers have the ability to file bugs across the Cisco product portfolio.  This is, of course, a powerful way to get engineering attention.  Customer found defects are taken very seriously, and any bug that is opened will get a development engineer (DE) assigned to it quickly.  We were judged on the quality of bugs we filed since TAC does not like to abuse the privilege and waste engineering time.  If a bug is filed for something that is not really a bug, it gets marked “J” for Junk, and you don’t want to have too many junked bugs.  That said, on one or two occasions, when I needed engineering help and the mailers weren’t working, I knowingly filed a Junk bug to get some help from engineering.  Fortunately, I filed a few real bugs that got fixed.

My team was the “routing protocols” team for HTTS, but we were a dumping ground for all sorts of cases.  RP often got crash cases, cable modem problems, and other issues, even though these weren’t strictly RP.  Even within the technical limits of RP, there is a lot of variety among cases.  Someone who knows EIGRP cold may not have a clue about MPLS.  A lot of times, when stuck on a case, we’d go find the “guy who knows that” and ask for help.  We had a number of cases on Asynchronous Transfer Mode (ATM) when I worked at TAC, which was an old WAN (more or less) protocol.  We had one guy who knew ATM, and his job was basically just to help with ATM cases.  He had a desk at the office but almost never came in, never worked a shift, and frankly I don’t know what he did all day.  But when an ATM case came in, day or night, he was on it, and I was glad we had him, since I knew little about the subject.

Some companies have NOCs with tier 1, 2, and 3 engineers, but we just had CSEs.  While we had different pay grades, TAC engineers were not tiered in HTTS.  “Take the case and get help” was the motto.  Backbone (non-HTTS) TAC had an escalation team, with some high-end CSEs who jumped in on the toughest cases.  HTTS did not, and while backbone TAC didn’t always like us pulling on their resources, at the end of the day we were all about killing cases, and a few times I had backbone escalation engineers up in my cube helping me.

The more heated a case gets, the higher the impact, the longer the time to resolve, the more attention it gets.  TAC duty managers can pull in more CSEs, escalation, engineering, and others to help get a case resolved.  Occasionally, a P1 would come in at 6pm on a Friday and you’d feel really lonely.  But Cisco being Cisco, if they need to put resources on an issue, there are a lot of talented and smart people available.

There’s nothing worse than the sinking feeling a CSE gets when realizing he or she has no clue what to do on a case.  When the Topic searches fail, when escalation engineers are stumped, when the customer is frustrated, you feel helpless.  But eventually, the problem is solved, the case is closed, and you move on to the next one.