October 31, 2023 by ccie14023

Ancient Internet

The first company where I worked as a “systems administrator” had no Internet connectivity at all when I started. By the time I left, I had installed an analog phone line which was shared amongst several users with modems for dial-up service. The connectivity options in 1995 were limited, and very expensive. Our company operated on a shoestring budget and could not afford the costly dedicated service offerings from our ISP.

When I moved to a “consulting” company, I finally had the opportunity to work with real dedicated Internet service. For the customers I worked with, our main options were two: ISDN and T1 lines.

ISDN stood for Integrated Services Digital Network. It came in two major flavors, but we exclusively used the lower-end Basic Rate Interface (BRI). ISDN was a digital phone line, and like its analog counterpart required dialing a phone number. The BRI had two data channels which were 64 Kbps each, but we usually ran them together for a combined 128 Kbps. At the time, this was quite speedy, more than double the speed of the modems we had, and our smaller customers loved ISDN. Because it was a dial-up technology, however, per minute rates applied. This meant the line would time out and disconnect periodically to save costs. When it was down and an outbound packet arrived at the router with the ISDN interface, the router would dial up the SP again. The connection time was much faster than analog modems, but it still added latency which was annoying.

I don’t know if other regions did it, but we had a local hack to get around this. Understanding the hack requires a little background on the phone systems of the time. Two kinds of business phone systems were typically in use: PBXs and key systems. With a key system, every phone had an extension number, but no direct dial into it. If you wanted to speak to the person at extension 302, you dialed the main phone number for the business, and either asked the receptionist to connect you, or else an automated system did it for you. For outbound dialing, the user would either lift the handset and select an unused line from the pool of lines available, or perhaps dial 9 for the key system to connect them to the next available line. PBXs, on the other hand, were used by large companies, gave each user their own phone line and allowed inter-office extension-to-extension calling, as well as direct dial from the outside world. If my extension was 3202, I would have a direct dial phone number of, say, 415-555-3202.

Some companies instead opted for the phone company to do their internal switching. This was known as a Centrex service. The phone company provided hard wired analog phone lines to the customer, but enabled extension-to-extension direct dial. Thus, if I was at extension 3202 and I needed to dial extension 3203, I could pick up the phone and just dial the four digits. The phone company took care of routing it.

What does this have to do with ISDN? We used to order Centrex service for our customers in the same Centrex group as their ISP. Thus, the customer’s ISDN line became an “extension” of the ISP’s Centrex group. Not only could the customer then dial the ISP with four digits (not a big deal when the router is doing the dialing), but there were no toll charges on Centrex lines. We used to nail the line up so it would never disconnect, and if it did for an reason, it would auto-redial. And then we had dedicated Internet service on a dial-up line!

T1’s (E1’s elsewhere) were 1.544 Mbps, blazing fast at the time. Unlike the single-pair ISDN line, T1’s were delivered on four wires, two for TX and two for RX. I won’t get into the details of line coding on T1’s, which we all studied as junior network engineers. T1 lines were truly dedicated, and provided a point-to-point connection from customer to ISP. They were distance-priced, but I worked in San Francisco which is a small city, so it wasn’t usually a factor. Because 1.544 Mbps was expensive for some customers, we had the option of ordering fractional T1s, fewer channels at a slower speed, but still faster than ISDN. In the early days we had to terminate the T1 on an external CSU/DSU device, and then run a serial cable to the router, but eventually the CSU/DSU came integrated on the router interface card.

When I worked at the San Francisco Chronicle, we were providing Internet service via a T1 line terminating on a 2500-series router. (The same one disabled by paint roller in this story.) 1000 users on a single T1 was painfully slow, and we made the decision to upgrade to a DS3 (T3) which ran at 45Mbps. The interface for DS3 used two BNC coax connections. I remember being amazed that the phone company could deliver service over coax, but it turns out that service into the building used fiber optics. Inside the building we ran coax. The run of coax from the basement to our 2nd floor data center was expensive, but the result was phenomenal. The DS3, which we terminated on a brand new 7200vxr, was vastly superior to the crawling T1, and the effort paid off with our users.

DSL was groundbreaking. There was nothing consumer-grade before that, and small companies could not even afford Internet connectivity. I was one of the first home adopters of DSL. The freshly-trained phone guy showed up at my apartment and installed a splitter box in the basement. This was needed because residential service was ADSL, which multiplexed digital service on an analog line. Unlike ISDN, which converted analog phone signal to digital, ADSL left the analog in tact, adding the digital part of the signal onto the higher frequencies. The splitter box took the incoming phone line from the street and peeled off the high frequencies, providing an analog signal for telephones. It then passed through the analog/digital mix intact to the modem, which just ignored the analog frequencies. The phone guy then sat down with his toolbelt and tried to configure TCP/IP on my computer. He gave up because he had no idea what he was doing. I told him to leave me the IP addresses and I’d do it myself. Eventually the telco would just send you an small filter to plug into each analog phone jack yourself, and they could turn on the service without sending a phone guy to rewire things. Once DSL in its various forms came out, the Internet was available to the masses. Of course, cable modems came shortly after.

We take for granted instant connectivity from every location on portable devices. Once upon a time, connectivity was only available at certain locations, often requiring dialing a service provider. There was a real excitement as new technologies emerged for making connectivity faster and easier. Now, of course, we just expect things to work and get angry when they don’t.

January 15, 2021 by ccie14023

How not to do Internet Connectivity

My first IT job was at a small company in Novato, California, that designed and built museum exhibits. At the time most companies either designed the exhibits or built them, but ours was the only one that did both. You could separate the services, and just do one or the other, but our end-to-end model was the best offering because the fabricators and designers were in the same building and could collaborate easily. The odd thing about separating the functions was that we could lose a bid to design a project, but win the bid to build it, and hence end up having to work closely with a competitor to deliver their vision.

A museum exhibit we designed and built

The company was small–only 60 employees. Half of them were fabricators who did not have computers, whereas the other half were designers and office staff who did. My original job was to be a “gopher” (or go-fer), who goes for stuff. If someone needed paint, screws, a nail gun, fumigation of a stuffed tiger, whatever, I’d get in the truck and take care of it. However, they quickly realized I was skilled with computers and they asked me to take over as their IT guy. (Note to newbies: When this happens, especially at a small company, people often don’t forget you had the old job. One day I might be fixing a computer, then the next day I’d be hauling the stuffed tiger.)

This was in the mid-1990’s, so let me give you an idea of how Internet connectivity worked: it didn’t. We had none when I started. We had a company-internal network using LocalTalk (which I described in a previous post), so users could share files, but they had no way to access the Internet at all. We had an internal-only email system called SnapMail, but it had no ability to do SNMP or connect beyond our little company.

The users started complaining about this, and I had to brainstorm what to do when we had virtually no operating budget at all. I pulled out the yellow pages and looked under “I”, and found a local ISP. I called them, and the told me I could use Frame Relay, a T1, or ISDN. I had no idea what they were talking about. The sales person faxed me a technical description of these technologies, and I still had no idea what they were talking about. At this point I didn’t know the phone company could deliver anything other than, well, a phone line. I wasn’t at the point where I needed to hear about framing formats and B8ZS line encoding.

We decided we could afford neither the ongoing expense, nor the hardware, so we came up with a really bad solution. We ordered modems for three of the computers in the office: the receptionist, the CEO, and the science researcher. For those of you too young to remember, modems allow you to interface computers using an ordinary phone line. We ordered a single phone line (all we could afford). When one of them wanted to use the Internet, they would run around the office to check with the other two if the line was free.

A circa-1990’s Global Village modem

The reason we gave the receptionist a modem is amusing. Our dial-up ISP allowed us to create public email addresses for all of our employees. However, they all dumped into one mailbox. The receptionist would dial in in the morning, download all the emails, and copy and paste them into the internal email system. If somebody wanted to reply, the would send it to the receptionist via SnapMail and she would dial up, paste it into the administrator account, and send it. Brilliant.

Needless to say, customer satisfaction was not high, even in those days. Sick of trying to run IT with no money, I bailed for a computer consulting company in San Francisco and started installing the aforementioned T1s and ISDN lines for customers, with actual routers.

If ever you’re annoyed with slow Wi-Fi, be glad you aren’t living in the 1990’s.

November 23, 2020 by ccie14023

Netstalgia: Bad Timing

After I left TAC I worked for two years at a Gold Partner in San Francisco. One of my first customers there was one of my most difficult, and it all came down to timing.

I was dispatched to perform a network assessment of a small real-estate SaaS company in the SF East Bay. Having just spent two years in TAC, I had no idea how to perform a network assessment, and unfortunately nobody at the partner was helping me. I had been told they had a dedicated laptop loaded with tools just for this purpose, but nobody could locate it. I started downloading tools on my own, but I couldn’t find a good, free network analysis tool. Another engineer recommended a product called “The Dude” from MikroTik, and since it was easy to install I decided to use it. I needed to leave it collecting data for a few days, and since nobody had provided me an assessment laptop I had to leave my own computer there. I distinctly remember the client asking me what tool I was using to collect data, and sheepishly answering “Uh, it’s called The Dude.” He looked at me skeptically. (Despite the name, the tool was actually quite decent.)

Without any format or examples for an assessment, I looked at bandwidth utilization, device security, IOS versions, and a host of other configuration items. The network was very simple. It was a small company with a handful of switches in their main office, and a T1 line connecting them to a satellite office in LA. They used a Cisco VoIP system for phones, and the satellite phones connected over the T1 back to the main campus. I wrote up a document with recommendations and presented it to the customer. Almost everything was minor, and they agreed to have me come back in and make a few upgrades.

One item I noted in the assessment was that the clocks on the routers and switches were set incorrectly. The clock has absolutely nothing to do with the operation of the device, but having just come from TAC I knew how important device clocks are. If there is a network-wide incident, one of the first things we look at is the logging messages across the network, and without an accurate device clock we cannot properly compare log messages across multiple network devices. We need to know if the log message on this router happened at the same time as that other log message on that switch, but if the clocks are set to some random time, it is difficult or impossible.

I proceeded to make my changes, including synchronizing the clocks to NTP and doing a few IOS upgrades. Then I closed out the work order and moved on to other clients. I thought I was done, but I sure wasn’t.

We started getting calls from the customer complaining that the T1 between the two offices wasn’t working right. I came over and found that it had come back up, but soon they were calling again. And again. I had used up all the hours on our contract, and my VP was not keen on me providing services for free. But our client insisted the problem began as a result of my work, and I had to fix it. Nothing I had done was major, but I went back and reverted to the old IOS (in case of a bug) and reverted to saved configs (which I had kept.)

With the changes rolled back, the problem kept happening. The LA office was not only losing its Internet connectivity, but was dealing with repeated voice outages. The temperature of the client was hot. I opened a TAC case and had an RMA sent out for the WIC, the card that the T1 connected to. I replaced it and the problem persisted. At this point I was insisting it could not have been my fault, since I rolled back the changes, but the customer didn’t see it that way and I don’t blame them.

The customer called up their SBC (now ATT) rep, basically a sales person, to complain as well. He told her a consultant had been working on his network, and she asked what had been changed. He said “the clock on the router” and she immediately flagged that as the problem. Sadly, the rep mistook the router clock, which has no effect on operations, with the T1 clocking, which does. I never touched the T1 clocking. I knew the sales rep as she had been my sales rep years before at the San Francisco Chronicle, and I knew she was a non-technical sales person who had no idea what she was talking about. Alas, she had planted the seeds in the customer’s mind that I had messed everything up by touching the clock. I pleaded my two CCIE’s and two years of TAC experience to try to persuade this customer that the router clock has zero, nada, zilch to do with the T1, to no avail.

The customer then, being sick of us, hired another consultant who got on the phone with SBC. It turns out there was an issue with the line encoding on the T1, which SBC fixed and the problem went away. The new consultant looked like a hero, and the next we heard from the client was a letter from a lawyer. They were demanding their money back.

It’s funny, I’ve never really had another charge of technical incompetence leveled at me. In this case I hadn’t done anything wrong at all, but the telco messed up the T1 line around the same time as I made my changes. So I guess in more than one way, you could say it was a matter of bad timing.

December 18, 2018 by ccie14023

TAC Tales #16: To microburst or not to microburst

I’ve mentioned before that EIGRP SIA was my nightmare case at TAC, but there was one other type of case that I hated–QoS problems. Routing protocol problems tend to be binary. Either the route is there or it isn’t; either the pings go through or they don’t. Even when a route is flapping, that’s just an extreme version of the binary problem. QoS is different. QoS cases often involved traffic that was passing sometimes or in certain amounts, but would start having problems when different sizes of traffic went through, or possibly traffic was dropping at a certain rate. Thus, the routes could be perfectly fine, pings could pass, and yet QoS was behaving incorrectly.

In TAC, we would regularly get cases where the customer claimed traffic was dropping on a QoS policy below the configured rate. For example, if they configured a policing profile of 1000 Mbps, sometimes the customer would claim the policer was dropping traffic at, say, 800 Mbps. The standard response for a TAC agent struggling to figure out a QoS policy issue like this was to say that the link was experiencing “microbursting.” If a link is showing a 800 Mbps traffic rate, this is actually an average rate, meaning the link could be experiencing short bursts above this rate that exceed the policing rate, but are averaged out in the interface counters. “Microbursting” was a standard response to this problem for two reasons: first, it was most often the problem; second, it was an easy way to close the case without an extensive investigation. The second reason is not as lazy as it may sound, as microbursts are common and are usually the cause of these symptoms.

Thus, when one of our large service provider customers opened a case stating that their LLQ policy was dropping packets before the configured threshold, I was quick to suspect microbursts. However, working in high-touch TAC, you learn that your customers aren’t pushovers and don’t always accept the easy answer. In this case, the customer started pushing back, claiming that the call center which was connected to this circuit generated a constant stream of traffic and that he was not experiencing microbursts. So much for that.

This being the 2000’s, the customer had four T1’s connected in a single multi-link PPP (MLPPP) bundle. The LLQ policy was dropping traffic at one quarter of the threshold it was configured for. Knowing I wouldn’t get much out of a live production network, I reluctantly opened a lab case for the recreate, asking for a back-to-back router with the same line cards, a four-link T1 interconnection, and a traffic generator. As always, I made sure my lab had exactly the same IOS release as the customer.

Once the lab was set up I started the traffic flowing, and much to my surprise, I saw traffic dropping at one quarter of the configured LLQ policy. Eureka! Anyone who has worked in TAC will tell you that more often than not, lab recreates fail to recreate the customer problem. I removed and re-applied the service policy, and the problem went away. Uh oh. The only thing worse than not recreating a problem is recreating it and then losing it again before developers get a chance to look at it.

I spent some time playing with the setup, trying to get the problem back. Finally, I reloaded the router to start over and, sure enough, I got the traffic loss again. So, the problem occurred at start-up, but when the policy was removed and re-applied, it corrected itself. I filed a bug and sent it to engineering.

Because it was so easy to recreate, it didn’t take long to find the answer. The customer was configuring their QoS policy using bandwidth percentages instead of absolute bandwidth numbers. This means that the policy bandwidth would be determined dynamically by the router based on the links it was applied to. It turns out that IOS was calculating the bandwidth numbers before the MLPPP bundle was fully up, and hence was using only a single T1 as the reference for the calculation instead of all four. The fix was to change the priority of operations in IOS, so that the MLPPP bundle came up before the QoS policy was applied.

So much for microbursts. The moral(s) of the story? First, the most obvious cause is often not the cause at all. Second, determined customers are often right. And third: even intimidating QoS cases can have an easy fix.

July 30, 2018 by ccie14023

Moscone Microwave

My first full-time networking job was at the San Francisco Chronicle. Now there isn’t much to the Chronicle anymore, but in the early 2000’s the newspaper was still going strong. It was the beginning of the decline, but most people still took their local newspaper as their primary source of news. Being a network engineer at a major metropolitan newspaper was fascinating. It is a massive operation to print and distribute a newspaper every single day, and you can never, ever, miss. There is no slippage of production deadlines. It has to be out every day, and every day you start all over, with a blank page.

As the lead network engineer, I touched everything from editorial (the news and photography content of the paper) to advertising, pre-press, production systems, and circulation. Every one of these was critical. If editorial content didn’t make it through, there was nothing to go into the paper. If advertising didn’t make it in, we didn’t earn revenue. If pre-press or production had problems, the paper wasn’t printed. If circulation wasn’t working, nobody could get their paper.

The Chronicle owned and operated three printing plants in the Bay Area. One was on Army Street in San Francisco, while the other two were in Union City and Richmond in the East Bay. The main office was on Fifth and Mission in downtown SF, so the paper was prepared in San Francisco and then sent to the plants via microwave. That’s where I came in.

Our microwave system used a dish on the clock tower of our building. From 5th and Mission we sent a signal up to Roundtop Mountain in the East Bay hills. At Roundtop we leased space in a little concrete bunker that was used for various kinds of radio communication including cellular. From Roundtop we bounced the signal back to the three printing plants.

Chronicle building with the microwave visible on the clock tower

The microwave presented itself to us as T1 lines. I had the T1 lines connected to dual routers at the main site and each of the plants. In addition to the microwave, we had two additional backup T1’s to each plant which were landlines from different carriers with diverse paths into the buildings. We kept the microwave and the first T1 plugged into the routers, with the third one on manual standby in case we needed it. You don’t take chances with production in a newspaper, and we had triple redundancy on everything. I used OSPF for redundancy between the microwave and #1 backup circuit on the routers, and HSRP for gateway redundancy. With only four sites it was a simple enough topology and it never gave me much trouble.

Until, that is, the day when I got a call from our operations center that the primary circuits were all down. We were running on backups. I immediately called up the production systems engineer who managed the microwave and told him his circuits were down. “Impossible!” he said, “that microwave is five-nines reliable. Check your router!” I tried a few of the usual: shut/no shut the interface, changing the line encoding, etc. No go. He wanted me to start swapping hardware, which was a big deal in a live newspaper environment, and seemed pointless. If it was hardware, why would all of the circuits be down?

We bickered a bit before I moved to have the tertiary backup circuits swapped in so we had automatic failover while we worked on the microwave. I got out our old T-berd tester to see if I could find any indication of the problem. Then the systems engineer called: “We need to meet at the clock tower, I’ve found the problem,” he said. It’s always a relief to hear that when finger pointing is going around.

T-berd T1 Tester

I showed up at the entrance to the tower and followed the systems guy up a rusty ladder mounted to the wall. Up in the tower there were bird droppings and as I climbed higher I fought the urge to look down. I’ve never much liked heights and being out of shape and relying on my own strength to keep from falling several stories onto concrete was not promising. Once I got to the top there was a large separation between the ladder and the floor, and I fought the urge to panic as I flung my leg way over to climb onto the concrete flooring. From there we went outside and I saw the problem right away.

If you’ve ever been to a convention in San Francisco, chances are it took place in the Moscone Center. In the early 2000’s, the city decided to expand Moscone by building a new Moscone Center West on 4th and Howard streets. And from up on the clock tower it was plain as day: they had built a cooling tower on the roof right in the path of our microwave beam. I looked at the systems guy and said, “Well, I guess you could make popcorn in that cooling tower. Anyways, there goes your five nines.”

We hastily called meetings together to decide what to do. Sue the city? Call the FCC? Find another building to bounce the microwave off of? Those were long term solutions but we had an immediate problem. Two circuits might seem like enough, but they were telco circuits and not as reliable as the microwave was, at least when its path wasn’t blocked.

Getting the city to cut the cooling tower off Moscone West was a non-starter, especially when it was the newspaper asking, a newspaper that made its money being critical of city officials. So, we decided to lease roof space from another building and add an additional repeater. However, this was a long process. We needed to negotiate with the landlord, replan the radio deployment, license it and obtain permits, add the new repeater, and re-point the old dish to the new building. That last item was not as simple as it sounded, since this wasn’t a DirecTV dish. It was welded to the tower, so we needed to hire ironworkers to cut it off and re-position it.

Meantime, we ordered T1’s from downtown SF up to Roundtop to bypass the segment that wasn’t working. We’d go hard wire to Roundtop, the microwave the rest of the way. This was not, by any means, an ideal solution, nor was it an overnight solution, but we could at least get some redundancy faster than it would take to add the repeater. I’m glad we did because shortly after the microwave went down we started having terrible problems with the landlines and needed the triple redundancy.

If you drive by Fifth and Mission now, the microwave dish is gone from the clock tower. The Chronicle, a shadow of its former self, no longer operates its own printing plants, and has a circulation far smaller than it did in 2004, when I left. As I said in my last post, it’s great to have a sense of purpose when you work in IT. It wasn’t about fixing a microwave but about getting that paper in the hands of our readers. I’m thankful I got to be a part of that for a few years, even if it cost me some vertigo and sleepless nights.

SubnetZero

Thoughts on network engineering from Jeff McLaughlin

Tag Archives: t1

Ancient Internet

How not to do Internet Connectivity

Netstalgia: Bad Timing

TAC Tales #16: To microburst or not to microburst

Moscone Microwave