Quantitative Network Engineering with Fault Tree Analysis

Background

Names have been changed and/or left out to protect the guilty.

I've worked for a couple of shops that "performed" doing what they called Network Engineering. In reality there was very little actual engineering going on. Network architectures were little more than the so-called Engineers connecting boxes with lines. The Engineer’s goal was to make everything redundant - so no single point of failure would ever effect network operations. This consisted of drawing at least two lines between every box, and making sure that the box itself had dual-everything: dual power supplies, dual routing brains, dual fabrics, dual gizmos and dual whatchamacallits. The word dual was replaced with redundant, alternate, backup and diverse to make things look even more complicated and technical. How traffic routed and flowed was normally an afterthought because if a box was connected to another box traffic could magically flow that direction, and if there was another set of lines and boxes there was automagically an alternate route.

However, with all of the redundant everything, big network outages still occurred, and the results were always a surprise. To fix the problem, which was normally confused with the result, more alternate, redundant, diverse things were thrown at it. The best example I can think of to illustrate this is in regards to power. The top tier managers decided that they needed to do what all of the trade magazines were doing and move everything from commercial A/C power to a DC system, with batteries and a backup generator. This took a major rework of facility power at every site: installing rectifiers, the battery system, a generator (or perhaps two), new DC infrastructure in the facility to provide power to the network equipment and new DC power supplies to supplant the nasty AC ones. This often meant new network equipment, or nasty inverters -- or both! So to implement all of this a lot of funding was spent on construction and infrastructure. It also mean downtime for most devices when they were migrated from the AC system over to Edison's delight. This downtime was taken as a necessity, and written off as not counting against the official 99.99999% availability promised to everyone as it was a planned outage. So everything was now on a nice shiny DC system with battery backup and generators -- the facility could now survive any power catastrophe ever! But it needs to be tested to make sure it works! So now the real fun begins by cutting the commercial power that reliably fed the facility for years. Of course there were problems with the switch... so the entire DC facility went down. Fix the switch. Not quite, the facility drops again. Fix the switch the proper way. Now there's something wrong with the batteries and they don't hold the load. Fix the batteries...after the fourth try. Then it's onto the generator which finally took over after another three or four attempts. So now the system is finally working...but you have to make sure once a month that the emergency power will kick in. So monthly testing drops the DC facility about every other time (on average). This goes on four a few years. In the meantime the commercial power never even fluctuates. But after a few years of installing all kind of extra, oversized and super redundant networking equipment the whole facility now needs a commercial power upgrade. In order to safely add the extra juice, the utility needs to shutdown the power for an hour or two -- but no fear! The emergency power test has worked for the past three months flawlessly and the facility can live on the generators. And it did! Well, not quite, the whole facility was still isolated because the commercial providers muxing equipment that was brining in the nice fat pipes was on commercial power and it dropped during the upgrade.

So in this example, a lot of time, money, effort and outages were taken to ad d absolutely no value -- in fact it made the situation much worse!

Another example is with respect to having two brains for the control plane of a router. The network I was working on was powered mostly by a slew of Cisco 7500 series routers. The control plane of these routers runs on the Route Switch Processor (RSP). If the RSP fails, basically the router was useless as the rest of the hardware couldn't do anything without it. So to add to the reliability of the platform you could install a second redundant RSP. Back when the 7500 was a hot piece of kit, the RSPs were not exactly cheap. So it was no small investment to have a second module that will handle control plane (and punted forwarding plane) duties. You also needed to stock them both with the not so cheap SRAM flash cards. These also did not function the way everyone is used to a contemporary control-plane failover -- with sub-second response, and the rest of the network oblivious that a failover ever occurred. Nope, these were downright ugly. The backup RSP basically just sat there in a pre-boot stage waiting for it's chance to continue to load it's own copy of IOS should the master sputter and die somehow. In order for these to failover, everything had to be configured properly, and IOS referenced properly, and the IOS boot image placed gingerly upon the backup RSPs own flash card properly. If anything went wrong, it wouldn't boot up right when it was supposed to. This could be fairly complicated, and was also prone to typos when referencing the not so intuitively named IOS boot file. It was even worse when virtually all of the network analysts were barely capable of the ping and show interface commands. From what I saw, the redundant RSP failed to boot and "save" the router more often than not. It also had a reasonable probability to cause the whole box to hang if not setup properly. A lot of money, time and resources was spent on these things, and they added almost no value...and often seemed to turn what would normally be a simple reboot of the router -- into a hung box that needed to be power cycled. The other issue that this architecture suffered from was what needed to actually happen to get a failover... complete and utter death of the primary RSP. If the primary RSP hung, or didn't boot properly, the secondary RSP was never going to progress past it's waiting phase. What were the chances of this happening....pretty low.

Once again, a lot of resources were spent on hardware that didn't provide any gain. It protected against a remote chance of a specific failure, and actually even provided some new failure paths that didn't exist wit hout it.

Quantitative Network Engineering with Fault Tree Analysis

Topics

Background