Skip navigation

Tag Archives: TAC

A common approach for TAC engineers and customers working on a tough case is to just “throw hardware at it.”  Sometimes this can be laziness:  why troubleshoot a complex problem when you can send an RMA, swap out a line card, and hope it works?  Other times it’s a legitimate step in a complex process of elimination.  RMA the card and if the problem still happens, well, you’ve eliminated the card as one source of the problem.

Hence, it was not an uncommon event the day that I got a P1 case from a major service provider, requeued (reassigned) after multiple RMAs.   The customer had a 12000-series GSR, top of the line back then, and was frustrated because ISIS wasn’t working.

“We just upgraded the GRP to a PRP to speed the router up,” he said, “but now it’s taking 4 hours for ISIS to converge.  Why did we pay all this money on a new route processor when it just slowed our box way down?!”

The GSR router is a chassis-type router, with multiple line cards with ports of different types, a fabric interconnecting them, and a management module (route processor, or RP) acting as the brains of the device.  The original RP was called a GRP, but Cisco had released an improved version called the PRP.

The GSR 12000-series

The customer seemed to think the new PRP had performance issues, but this didn’t make sense.  Performance issues might cause some small delays or possibly packet loss for packets destined to the RP, but not delays of four hours.  Something else was amiss.  I asked the customer to send me the ISIS database, and it was full of LSPs like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
0651.8412.7001.00-00  0x00000000   0x0000        193               0/0/0

ISIS routers periodically send CSNPs, or Complete Sequence Number PDUs, which contain a list of all the link state packets (LSPs) in the router database.  In this case, the GSR was directly attached to a Juniper router which was its sole ISIS adjacency.  It was receiving the entire ISIS database from this router.  Normally an ISIS database entry looks like this:

#sh isis database

IS-IS Level-2 Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime
bb1-sjc.00-00         0x0000041E   0xF97D        65365             0/0/0

Note that instead of a router ID, we actually have a router name.  Note also that we have a sequence number and a checksum for each LSP.  As the previous output shows, something was wrong with the LSPs we were receiving.  Not only was the name not resolving, the sequence and checksum were zero.  How can we possibly have an LSP which has no sequence number at all?

Even weirder was that as I refreshed the ISIS outputs, the LSPs started resolving, suddenly popping up with names and non-zero sequences and checksums.  I stayed on the phone with the customer for several hours, before finally every LSP was resolved, and the customer had full reachability.  “Don’t do anything to the router until I get back to you,” I said before hanging up.  If only he had listened.

I was about to pack up for the day and I got called by our hotline.  The customer had called in and escalated to a P1 after reloading the router.  The entire link state database was zero’d out again, and the network was down.  He only had a short maintenance window in which to work, and now he had an outage.  It was 6pm.  I knew I wasn’t going home for a while.

Whatever was happening was well beyond my ISIS expertise.  Even in the routing protocols team, it was hard to find deep knowledge of ISIS.  I needed an expert, and Abe Martey, who sat across from me, literally wrote the book on ISIS.  The Cisco Press book, that is.  The only issue:  Abe had decided to take PTO that week.  Of course.  I pinged a protocols escalation engineer, one of our best BGP guys.  He didn’t want anything to do with it.  Finally I reached out to the duty manager and asked for help.  I also emailed our internal mailers for ISIS, but after 6pm I wasn’t too optimistic.

Why were we seeing what appeared to be invalid LSPs?  How could an LSP even have a zero checksum or sequence number?  Why did they seem to clear out, and why so slowly?  Did the upgrade to the PRP have anything to do with it?  Was it hardware?  A bug?  As a TAC engineer, you have to consider every single possibility, from A to Z.

The duty manager finally got Sanjeev, an “ISIS expert” from Australia on the call.  The customer may not realize this while a case is being handled, but if it’s complex and high priority, there is often a flurry of instant messaging going on behind the scenes.  We had a chat room up, and as the “expert” listened to the description of the problem and looked at the notes, he typed in the window:  “This is way over my head.”  Great, so much for expertise.  Our conversation was getting heated with the customer, as his frustration with the lack of progress escalated.  The so-called expert asked him to run a command, which another TAC engineer suggested.

“Fantastic,” said the customer, “Sanjeev wants us to run a command.  Sanjeev, tell us, why do you want to run this command?  What’s it going to do?”

“Uh, I’m not sure,” said Sanjeev, “I’ll have to get back to you on that.”

Not a good answer.

By 8:30 PM we also had a senior routing protocols engineer in the chat window.  He seemed to think it was a hardware issue and was scraping the error counters on the line cards. The dedicated Advanced Services NCE for the account also signed on and was looking at the errors. It’s a painful feeling knowing you and the customer are stranded, but we honestly had no idea what to do.  Because the other end of the problem was a Juniper router, JTAC came on board as well.  We may have been competitors, but we were professionals and put it aside to best help the customer.

Looking at the chat transcript, which I saved, is painful.  One person suggests physically cleaning the fiber connection.  Another thinks it’s memory corruption.  Another believes it is packet corruption.  We schedule a circuit test with the customer to look for transmission errors.

All the while, the 0x0000 LSPs are re-populating with legitimate information, until, by 9pm, the ISIS database was fully converged and routing was working again.  “This time,” I said, “DO NOT touch the router.”  The customer agreed.  I headed home at 9:12pm, secretly hoping they would reload the router so the case would get requeued to night shift and taken off my hands.

In the morning we got on our scheduled update call with the customer.  I was tired, and not happy to make the call.  We had gotten nowhere in the night, and had not gotten helpful responses to our emails.  I wasn’t sure what I was going to say.  I was surprised to hear the customer in a chipper mood.  “I’m happy to report Juniper has reproduced the problem in their lab and has identified the problem.”

There was a little bit of wounded pride knowing they found the fix before we did, but also a sense of relief to know I could close the case.

It turns out that the customer, around the same time they installed the PRP, had attempted to normalize the configs between the Juniper and Cisco devices.  They had mistakenly configured a timer called the “LSP pacing interval” on the Juniper side.  This controls the rate at which the Juniper box sends out LSPs.  They had thought they were configuring the same timer as the LSP refresh interval on the Cisco side, but they were two different things.  By cranking it way up, they ensured that the hundreds of LSPs in the database would trickle in, taking hours to converge.

Why the 0x0000 entries then?  It turns out that in the initial exchange, the ISIS routers share with each other what LSPs they have, without sending the full LSP.  Thus, in Cisco ISIS databases, the 0x0000 entry acts as a placeholder until complete LSP data is received.  Normally this period is short and you don’t see the entry.  We probably would have found the person who knew that eventually, but we didn’t find him that night and our database of cases, newsgroup postings, and bugs turned up nothing to point us in the right direction.

I touched a couple thousand cases in my time at TAC, but this case I remember even 10 years later because of the seeming complexity, the simplicity of the resolution, the weirdness of the symptoms, and the distractors like the PRP upgrade.  Often a major outage sends you in a lot of directions and down many rat holes.  I don’t think we could have done much differently, since the config error was totally invisible to us.  Anyway, if Juniper and Cisco can work together to solve a customer issue, maybe we should have hope for world peace.

When you work at TAC, you are required to be “on-shift” for 4 hours each day.  This doesn’t mean that you work four hours a day, just that you are actively taking cases only four hours per day.  The other four (or more) hours you work on your existing backlog, calling customers, chasing down engineering for bug fixes, doing recreates, and, if you’re lucky, doing some training on the side.  While you were on shift, you would still work on the other stuff, but you were responsible for monitoring your “queue” and taking cases as they came in.  On our queue we generally liked to have four customer support engineers (CSE’s) on shift at any time.  Occasionally we had more or less, but never less than two.  We didn’t like to run with two engineers for very long;  if a P1 comes in, a CSE can be tied up for hours unable to deal with the other cases that come in, and the odds are not low that more than one P1 come in.  With all CSE’s on-shift tied up, it was up to the duty manager to start paging off-shift engineers as cases came in, never a good thing.  If ever you were on hold for a long time with a P1, there is a good chance the call center agent was simply unable to find a CSE because they were all tied up.  Sometimes it was due to bad planning, sometimes lack of staff.  Sometimes you would start a shift with five CSE’s on the queue and they’d all get on P1’s in the first five minutes.  The queue was always unpredictable.

At TAC, when you were on-shift, you could never be far from your desk.  You were expected to stay put, and if you had to get up to use the bathroom or go to the lab, you notified your fellow on-shift engineers so they knew you wouldn’t be available.  Since I preferred the 10am-2pm shift, in 2 years I took lunch away from my desk maybe 5 times.  Most days I told the other guys I was stepping out, ran to the cafeteria, and ran back to my desk to eat while taking cases.

Thus, I was quite happy one day when I had a later shift and my colleague Delvin showed up at my desk and asked if I wanted to go to a nice Chinese lunch.  Eddy Lau, one of our new CSE’s who had recently emigrated from China, had found an excellent and authentic restaurant.  We hopped into Eddy’s car and drove over to the restaurant, where we proceeded to have a two-hour long Chinese feast.  “It’s so great to actually go to lunch,” I said to Delvin, “since I eat at my desk every day.”  Eddy was happy to help his new colleagues out.

As we were driving back, Devlin asked Eddy, “When are you on shift?”

“Right now,” said Eddy.

“You’re on shift now?!” Delvin asked incredulously.  “Dude, you can’t leave for two hours if you’re on shift.  Who are you on shift with?”

“Just me and Sarah,” Eddy said, not really comprehending the situation.

“You left Sarah on shift by herself?!” Devlin asked.  “What if a P1 comes in?  What if she gets swamped by cases?  You can’t leave someone alone on the queue!”

We hurried back to the office and pulled up WebMonitor, which showed not only active cases, but who had taken cases that shift and how many.  Sarah had taken a single case.  By some amazing stroke of luck, it had been a very quiet shift.

I walked by Eddy’s desk and he gave me a thumbs up and a big smile.  I figured he wouldn’t last long.  A couple months later, after blowing a case, Eddy got put on RMA duty and subsequently quit.

If you ever wonder why you had to wait so long on the phone, it could be a busy day.  Or it could be your CSE’s decided to take a long lunch without telling anyone.

When I was still a new engineer, a fellow customer support engineer (CSE) asked a favor of me. I’ll call him Andy.

“I’m going on PTO, could you cover a case for me? I’ve filed a bug and while I’m gone there will be a conference call. Just jump on it and tell them that the bug has been filed an engineering is working on it.” The case was with one of our largest service provider clients. I won’t say which, but they were a household name.

When you’re new and want to make a good impression, you jump on chances like this. It was a simple request and would prove I’m a team player. Of course I accepted the case and went about my business with the conference call on my calendar for the next week.

Before I got on the call I took a brief look at the case notes and the DDTS (what Cisco calls a bug.) Everything seemed to be in order. The bug was filed and in engineering’s hands. Nothing to do but hop on the call and report that the bug was filed and we were working on it.

I dialed the bridge and after I gave my name the automated conference bridge said “there are 20 other parties in the conference.” Uh oh. Why did they need so many?

After I joined, someone asked for introductions. As they went around the call, there were a few engineers, several VP’s, and multiple senior directors. Double uh oh.

“Jeff is calling from Cisco,” the leader of the call said. “He is here to report on the P1 outage we had last week affecting multiple customers. I’m happy to tell you that Cisco has been working diligently on the problem and is here to report their findings and their solution. Cisco, take it away.”

I felt my heart in my throat. I cleared my voice, and sheepishly said: “Uh, we’ve, uh, filed a bug for your problem and, uh, engineering is looking into it.”

It was dead silence, followed by a VP chiming in: “That’s it?”

I was then chewed out thoroughly for not doing enough and wasting everyone’s time.

When Andy got back he grabbed the case back from me. “How’d the call go?” he asked.

I told him how it went horribly, how they were expecting more than I delivered, and how I took a beating for him.

Andy just smiled. Welcome to TAC.

My job as a customer support engineer (CSE) at TAC was the most quantified I’ve ever had.  Every aspect of our job performance was tracked and measured.  We live in the era of big data, and while numbers can be helpful, they can also mislead.  In TAC, there were many examples of that.

Take, for example, our customer satisfaction rating, known as a “bingo” score.  Every time a customer filled out a survey at the end of a TAC case, the engineer was notified and the bingo score recorded and averaged with all his previous scores.  While this would seem to be an effective measure of an engineer’s performance, it often wasn’t.

In TAC, we often ended up taking cases that were “requeues.”  These were cases that were previously worked by another engineer.  Imagine you got a requeue of a case that another CSE had handled terribly.  You close the case quickly, but the customer is still angry at the first CSE, so he gives a low bingo score.  That score was credited against the CSE who closed the case, so even though you took care of it, you got stuck with the low numbers.

This also happened with create-to-close numbers.  We were measured on how quickly we closed cases.  Imagine another CSE had been sitting on a case for six months doing nothing.  The customer requeues it, and you end up with the case, closing it immediately.  You end up with a six month create-to-close number even though it wasn’t your fault.

Even worse, if you think about it, the create-to-close number discouraged engineers from taking hard cases.  Easy cases close quickly, but hard ones stay open while recreates are done and bugs are filed.  The engineers who took the hardest cases and were very skilled often had terrible create-to-close numbers.

The bottom line is that you need more than data to understand a person.  Most things in life don’t lend themselves to easy quantification.  Numbers always need to be in context.  Corporate managers are obsessed with quantification, and the Google’s of the world are helping to drive our number-love even further.  Meanwhile, reducing people to numbers is a great way to treat them less humanly.

I’ve come back to Cisco recently, and I think I can say that I haven’t worked this hard since the last time I was at Cisco.  I remember my first manager at TAC telling me in an interview that “Cisco loves workaholics.”  In an attempt to get more organized, I’ve been taking a second crack at using OmniFocus and the GTD methodology.  To be honest, I haven’t had much luck with these systems in the past.  I usually end up entering a bunch of tasks into the system, and then quickly get behind on crossing them off.  I find that the tasks I really want to do, or need to do, I would do without the system, and the ones that I am putting off I keep putting off anyway.  I have so much to do now, however, that I need to track things more efficiently and I am hoping OmniFocus is the solution. Read More »

In this article in my “Ten Years a CCIE” series, I describe my experience going to work at Cisco as a CCIE.  Unlike many Cisco-employed CCIE’s, I earned my certification outside of Cisco.

A CCIE leads to a job at Cisco

I returned to my old job at the Chronicle and had my business cards reprinted with my CCIE number. I loved handing it out, particularly at meetings with telephone companies and Internet service providers whose salespeople were likely to know what such a certification meant.  I remember one such sales person, duly impressed, saying “wow, on that test you can be forced to configure any feature on any Cisco product…I don’t know how anyone could prepare for that!”  (Uh, right.  See “The CCIE Mystique“.)

At that time, the most popular forum for aspiring CCIE’s was an email distribution list called groupstudy.com. I had been a subscriber to this mailing list, but prior to passing my exam, I didn’t feel adequate to post anything there. However, once I passed, I began posting regularly, beginning with a summary of my test preparation process. One day I got an email from a mysterious CCIE who told me that I sounded like I knew what I was talking about, asking me if I wanted to interview for a job. I thought his name and number sounded familiar, and when I got home I confirm my suspicions by digging through my bookshelf. He was the author of one of my books about Catalyst Quality of Service.

A brutal interview

It turns out he was a manager at Cisco High Touch Technical Support, a group of TAC engineers who specialized in high profile customers. I scheduled an interview right away.

This interview was by far the most difficult I’ve had in my career. They brought me into a room with four CCIE’s, two of them double, all of them sharp. Each one of them had a different specialty. One of them was a security guy, another one was an expert on multicast, another was an expert on switching. When it came to Kumar, the voice guy, I figured I was scot-free. After all, I didn’t claim to know anything about voice over IP. Kumar looked over my resume, and then he looked up at me. “I see you have ISDN on your resume,” Kumar said. And then he began to grill me on ISDN. Darn, I should have thought of that!  Thankfully, I was well prepared.

Despite one or two mistakes in the interview, I got hired on and began my new job as a customer support engineer at HTTS. My first few months were in a group called ESO, which supported large enterprises and was very focused on Catalyst switching.  I won’t go into the details of the job here, but you can see my many TAC tales if you are interested.

Cisco's San Jose campus

Cisco’s San Jose campus

Little purple stickers everywhere!

One thing I quickly noticed when I got to Cisco was that a lot of the people in my department had a nickel-sized purple dot on their ID badges and cubicle nameplates. I found out that these purple dots were actually stickers with the CCIE logo. Cisco employees who had their CCIE’s stuck these purple dots on their badges and nameplates to show it off. Many of the CCIE’s who had passed multiple exams actually placed multiple dots on their badges and nameplates. I wanted one quite badly. The problem was, sheets of these purple stickers were sent out only to the early CCIE’s, and by the time I had passed, Cisco was no longer providing the sheets of stickers. I suppose I could’ve had some printed out, but I asked around looking for a CCIE who was generous to give me one of his dots. They were in scarce supply, however, and nobody was willing to part with one. It was just another way newer CCIE’s were getting jipped.

The real CCIE logo

The real CCIE logo

Even though the exam had now switched to the one day format, you still didn’t meet too many CCIE’s outside of Cisco. It was thus quite a shock when I went to Cisco and saw purple dots everywhere. It seemed like fully half of the people I was working with in my new job had CCIE’s. And many of them had low numbers, in the 2000’s and even in the 1000’s. I was quite relieved to find that they all treated me with total respect; nobody ever challenged me on account of my one-day CCIE. Still, I always had (and always will have) a great deal of respect for those people who passed the test when it was a two day test, and the cottage industry devoted to minting CCIE’s had not yet come into existence.

CCIE challenges customer

Customers were another story. I remember one BGP case in particular. I looked at the customer’s configuration and immediately realized that it was a simple matter of misconfiguration. I fired it up in my lab reproduced his configuration and proved to him that it was indeed a configuration error on his part. I wrote it all up in an email and proudly signed it with my CCIE number. Within a half an hour I got a call from the customer and one of his colleagues on the line. They proceeded to grill me rapidly on BGP asking me all sorts of questions that weren’t relevant to their case and stumping me several times. At that point I realized that when many people see you are a CCIE, they take it as a challenge. In some cases they failed the test themselves, or else they’ve met stupid CCIE’s in the past and they feel themselves to be on a mission to discredit all CCIE’s. After that episode, I removed my CCIE number from my email signature. I gained a feeling of self-importance after I passed my exam, but working among so many people with the same certification, and dealing with such intelligent customers, I realized that the CCIE didn’t always carry the prestige I thought it did.  The mystique diminished even further.

Incidentally, I became friends with all of the guys who interviewed me, and I was on the interview team myself during my tenure at Cisco. One extremely sharp CCIE we hired told me our interview was so tough he had to “hit the bottle” afterwards. It was considered a rite of passage at TAC to go through a tough interview, but I have gotten a lot nicer in my interview style now, having been on the receiving end of a few grillings.

The value of a CCIE

One of the later posts in this series will examine the question of the value of a CCIE certification.  After all, this is one of the most common questions I see in forums dedicated to certification.  However, my experience getting hired into Cisco (the first time) has some lessons.

  • The immediate reason I got hired was because of my experience and willingness to go out of my way helping others to get their CCIE on Groupstudy.  However, I would not have gotten the position without a CCIE, so clearly it proved its value there.
  • Once you are at Cisco, although people commonly display their stickers and plaques, having a CCIE certification will not necessarily distinguish you.
  • There are many CCIE’s who have made a bad impression on others, whether they are only book-knowledgeable, or even cheaters.  Often people challenge you when you have a CCIE, instead of respecting you.

In the next article in the series, Multiple CCIEs, Multiple Attempts, I describe passing the CCIE Security exam.  I talk about my experience suffering the agony of defeat for the first time, and how I eventually conquered that test.

In my first article in the “Ten Years a CCIE” series, I discuss the mystique of the CCIE certification which made me want to attempt the test.

Learning about the CCIE

My first vague awareness of the CCIE certification came in 1999 while I was a Master’s student in Telecommunications Management at Golden Gate University in San Francisco. A family friend was staying at my father’s house, an instructor in the PhD program in telecommunications at the University of Idaho. I was excited to meet him because I was thinking of pursuing further graduate studies, but I was a bit surprised when I told him of my ambitions to be a network engineer, and of my coursework at GGU. He told me not to waste my time in a graduate program if I wanted to be a network engineer. “You should get a Cisco certification instead,” he said, “they’re gold.” A disappointing comment, seeing that I was in my second year of the Masters program, but I kept it in mind and completed my degree. Read More »

New Year’s resolutions are made to be broken, and I haven’t been keeping up with my resolution to do more blog posts.  Now that I am back at Cisco, I am focusing on programmability and automation, and I do have a lot to say.  However, in honor of my return to Cisco, I thought I would post a new Tac Tales entry.  There is a moral to this story.

One day my boss came to me and said my team would be supporting the MWAM module in the Cat 6K.  I had done a lot of Cat 6K work at that point, but I had never even heard of an MWAM, and failed to see why cases on it would be sent to the routing protocols team.  My boss didn’t seem too concerned with my objections, and said, “Just go watch the VoD.”  VoD = Video on Demand.  So, I did.  I watched the VoD, and it started out by telling me how many processors were on the card and of what kind;  what types of buses were used to transmit data;  what kind of memory it had;  and how it interfaced with the Catalyst and its backplane.  Never did the video ever tell me what the card actually did.  I had no idea why one would buy an MWAM or what one would do with it.  I hoped a case wouldn’t come in on the card, and when it did I immediately escalated to engineering because I had no idea what to do.  Fortunately I only ever had one case on the MWAM.  (And the fun thing about coming back to Cisco after 10 years is that I can go look up all these cases I remember and read my notes.  Very cool!)

What is the moral, you ask?  Well, as a Technical Marketing Engineer, a big part of my role is communicating technical concepts clearly to others.  How often have you bought a book or looked at a web page to learn some new protocol, only to find that the description of it begins with packet header formats or state machines?  Fine, but tell me what it actually does before you tell me how it works.  Imagine if you went out into the jungle and encountered someone who has never seen a car.  You wouldn’t start explaining it to him by saying “it uses an internal combustion engine which has a four-stroke cycle of intake, compression, power, exhaust.”  You’d say, “it has wheels and takes me places very fast.”  Now in defense of the MWAM VoD guy, he may have designed his video for people who already knew what the card was.  But often I have found that people make this assumption, and when I backtrack and start at the beginning when explaining something, often people say, “you know, I’ve always been afraid to ask about that, but thanks for explaining it.”

Meanwhile, my second try at Cisco is much more fun than the last.  And thankfully, no MWAMs.  TAC was an great experience and period of growth, but it’s a not a fun job.

I feel a bit of guilt for letting this blog languish for a while. I can see from the response to my articles explaining confusing Juniper features that my work had some benefit outside my own edification, and so I hate to leave articles unfinished which might have been helpful. In addition, WordPress is not easy to maintain and I keep losing notifications of comments, which means that when I am not logging in, I miss the opportunity to respond to kind words and questions.

As it is, my work explaining Juniper to the masses will have to be put on hold, as I have left Juniper after six years and returned to my old employer Cisco! I worked at Juniper longer than I had anywhere else, and it’s amazing to consider that I just closed the door on half a decade. But, I even after attaining my JNCIE I always felt like a Cisco guy at heart, and so here I am again. A few random thoughts then:

1. I interviewed for a number of jobs, and now that I am hired I can say that I really hate interviewing. My interviews at Cisco were very fair and reasonable. Just for the heck of it I did a phone screen with Google and completely bombed it. I’m not ashamed to admit that. I’m not supposed to reveal their questions, and I won’t, but they were mostly basic questions about TCP functionality, and MAC/ARP stuff, and it’s amazing how you forget some of the basics over the years. I wasn’t really interested in working there so I did no preparation, and in fact the recruiter warned me to brush up on basics. I just figured my work and blog show that I am at least somewhat technical. I plan to write some posts on the art of technical interviewing, but I was certainly underwhelmed by Google’s screening process, as I’m sure they were by my performance. I really wanted the Cisco job, and what a difference attitude makes! (Oh, and I completely munged an MPLS FRR/Node & Link protection question, less than a year after passing the JNCIE-SP. Uh, whoops.)

2. I bear Juniper no ill will. It was an interesting six years. When I came on board, during the Kevin Johnson years, it was all rah-rah pep talks about how we were going to be the next $10 billion company (errr, no…) followed by a plethora of product disasters. Killing off Netscreen gave the firewall market to Palo Alto, Fortinet, and amazingly resuscitated Checkpoint. Junos Space was a disaster, and Pulse slightly less so. QFabric was not a bad idea, but was far too complex. You needed to buy a professional services contract with the product, because it was too complex to install by itself. And yet it supposedly simplified the data center? There was a fiasco with our load balancer product. And then came the activist investors with their Integrated Operating Plan. I will permanently loathe activist investors. Juniper was hurting and they just magnified the hurt. There’s nothing worse than a bunch of generic business-types who wouldn’t know a router if they saw one trying to tell a router company how to do its business. They thought they could apply the same formula you learn in B-school to any company no matter what it manufactures or does. Then we had the CEO revolving door.

Despite all of this, as I said, I like Juniper. I did ok there, and there are a lot of people I respect working there. Rami Rahim is a good choice for CEO. I left for personal reasons. They still have some good products and good ideas, and I think competition is always good for the marketplace. For the sake of my friends there, I hope Juniper does well.

3. If you read my bio, you will see that I was THE network architect for Juniper IT, meaning I covered everything. This included (in theory at least) campus LAN, WAN, data center, wireless, network security, etc. I did something in all of these spaces. It was a broad level of knowledge, but not deep. That’s why I did my JNCIE-SP–I was hungering to go deep on something. My new job at Cisco is principal technical strategy engineer for data center. This is an opportunity to go deep but not as broad, and I’m happy to be doing that. The data center space is where it’s at these days, and I can’t wait to get deeper into it.

4. Coming back to Cisco after an eight year hiatus was bizarre. It was cool to pull up all my old bugs and postings to internal aliases to see what I was doing back then. Heck, I actually sounded like I knew a thing or two. I was thrilled to find out I am on the same team as Tim Stevenson, whose work as a Cat 6K TME I admired when I worked in TAC. Just for fun I walked though my old building and floor (K, floor 2) and nearly fell over when I saw that it looked identical. I mean, not only the cubes, but there were these giant signs for the different teams (e.g. “HTTS AT&T TEAM”) which were still hanging there as though the intervening eight years had never happened.

Unfortunately, I have to leave a few in progress articles in the dustbin. First, I shouldn’t really be promoting Juniper now that I am working for Cisco. And second, I’ve lost access to VMM, the internal Juniper tool I used to spin up VM versions of Juniper routers. However, I hope to start posting on Cisco topics now that I have access to that gear. Cisco’s products are generally better documented than Juniper’s, but I promise to fill any gaps I might find. And I will leave my previous articles up in hopes that they will benefit future engineers who struggle with Junos.

Onwards!

I don’t advertise this blog so I’m always amazed that people even find it. I figured the least-read articles on this blog were my “TAC Tales,” but someone recently commented that they wanted to see more… Well, I’m happy to oblige.

The recent events at United reminded me of a case where operations were down for one of the major airlines at Miami International Airport. It didn’t directly impact flight operations, but ticketing and baggage handling systems were down. Naturally, it was a P1 and so I dialed into the conference bridge.

This airline had four Cat 6500’s acting as their core devices for the network. The four switches had vastly disparate configurations, both hardware and software. I seem to recall one of them was running a Supe 1 module, which was even old in 2007 when I took the case. There was a different software version on each of them.

EIGRP was acting funny. As a TAC engineer in the routing protocols team, I absolutely hated EIGRP. EIGRP Stuck-In-Active was my nightmare case. It was always such a pain to track down the source, and meanwhile you’d have peers resetting all over the place. OSPF doesn’t do that, nor ISIS. I once got in a debate on an internal Cisco alias with some EIGRP guys. Granted, I had insulted their life’s work, but I stated that EIGRP was fast, but unreliable and prone to meltdown. Their retort was that properly designed EIGRP networks do not melt down. Great, but when are networks ever properly designed? They are so often slapped together haphazardly, grow organically, and overall need to be resilient when even when unplanned. Of course, those of us in design and architecture positions do our best to build highly available networks, but you don’t want to be running a protocol that flips out when a route at some far end of the network disappears. Anyhow…

The adjacencies on all four boxes were resetting constantly. It was totally unstable. Every five minutes or so, some manager from the airline would hop on the bridge to tell us that they were using handwritten tickets and baggage tags, that lines at the ticket counters were going out the door, etc, etc. Because that really helps me to concentrate. I tried to troubleshoot the way TAC engineers are trained to troubleshoot: collect logs, search for bugs in the relevant software, look for configuration issues. With routing adjacency flaps on switches, always check for STP issues. I couldn’t figure it out.

Finally some high-level engineer for the airline got on the phone and took over like a five-star general. He had his ops team systematically shut down and reset the switches, one at a time. The instability stopped. Wish I’d thought of that.

The standards for a routing protocol like OSPF are written by slow-moving committees, and hence don’t change much. These committees often have members from multiple competing vendors who disagree on exactly what should be done, and even when they do agree, nothing happens fast in IETF committees. Conversely, Cisco owns EIGRP, and they can change it as much as they want. Even their internal committees are nowhere near as bureaucratic as IETF. This means that there can be significant changes in the EIGRP code between IOS releases, much more so than for OSPF, and it is thus vital to keep code revisions amongst participating routers fairly close.

In this case, the consulting engineers for the airline helped them to standardize the hardware and software revisions. They never re-opened the case.