Quantitative Network Engineering with Fault Tree Analysis

This is a simple paper outlining how to use some Risk Analysis Techniques with Network Engineering to obtain quantitative results. An attempt to turn network engineering from an artful guessing game into a science.

Causal Analysis

The next step is to review top event and try to determine all possible causes of this event. These causes should be the first level contributers to the top event. Take care not to skip ahead (which is easy to do). It's also easy to leave things out at first. Building a FTA is usually not done in one fell sweep, but is done with a lot of work and rework.

For the isolation of our system described ealier, there are three good contributing events to start with:

  1. Forwarding Plane Failure
  2. Control Plane Failure

Looking at the diagram of the system, it's very easy to point a finger directly at the UPS an state early on that if the UPS fails, the router will be isolated. But be careful not to get ahead of yourself. Although it definately looks like as single point of failure in the system, there is a more general cause to the top event -- as simple power failure. As the UPS dying may be a contributing factor towards a power failure, it may not be the only cause. There are other events which could lead to a power failure, and may or may not have a higher probabilty of occuring. So rather than jump ahead and start on plans to redesign the UPS connections, keep reserved and work your way down the causal tree. The UPS will come into play sooner or later, but it may not be as important to the top event as you might think. Be patient, and be analytical.

Once the contributing events have been identified, analyse the relationship between them and the top event. Will the top event occur if only one of the faults happens, or does several faults have to happen simultaneously? The contributing events can be linked to the top event with a logic gate. In our example, if any one of the two events occurs the router is isolated. So we link these three faults to the top event with an OR gate.

The selection of what the undesirable event should be is the most important step. This should always be the first step. Figure out what event you want to assess or avoid and model downwards from this point what other factors could contribute to this. If you wind up working upwards from this point in your diagram, you really need to reassess exactly what the problem is.

When selecting an event, make sure they are all plausable and don't expect any miracles to happen. Sure a meteor striking the router or a zombie attack will cause the router to be isolated, but these are neither statistically important or probable. If these are included, the fault tree diagram would never end and the running the math would take more signifcant figures than the world record calculation of pi.




      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |   Version 0   |       C       |            Plenty             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           Router ID - www.blackhole-networks.com              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           Area ID - FTA with Network Engineering              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |          Checksum  OK         |         Construction          |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +-                                                             -+
      |                        PAGE STILL                             | 
      +-                         UNDER                               -+
      |                       CONSTRUCTION                            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+