Quantitative Network Engineering with Fault Tree Analysis

This is a simple paper outlining how to use some Risk Analysis Techniques with Network Engineering to obtain quantitative results. An attempt to turn network engineering from an artful guessing game into a science.

Quantitative Analysis of the Miniumal Cutsets

Although this qualitative analysis points to a lot of first and second order failure modes, it's not time panic and installing a building UPS, replacing all of the humans with robot controllers and specing out routers with redundant backplanes. These are single points of failure, but it is necessary to look at the probability of these failure events happening and the cost associated with redesiging and changing the system to eliminate them. This is where the quantatiative analysis comes in, allowing one to assess the level of risk in the system, and then properly assess if it is worth it to fix or not. There may actually be several second order failure modes that are more likely to occur than any of the first order modes, and they may be easier and less costly to work around.

To oversimplfy, the most likely failure mode will be the cutset with the highest probability. Since a cutset is a reduction to a series of OR gates, following the rules of boolean algebra the product of all of the probabilities of each failure mode is taken to determine the probability of the failure mode. The probability of the top event occurring is the sum of all of the probabilities of all of the contributing failure modes (cutsets). The most likely failure mode will be the cutset with the highest probability. Putting numbers to the failure modes allows for the assessment of the overal risk of the event occureing. The most likely failure methods are seen -- and these may or may not be single points of failure. It is much clearer to see how much of an effect that any failure mode will have on the overall probabllity of the top event occuring.

Where do these numbers come from? For any manufactured part, the reported Mean Time Between Failures (MTBF) can be used to calculate a probablility. Likewise, and Service Level Agreements (SLA) that are in place for any service such as power or a leased circuit can be used to figure a probablity. For other failures, such as when a human is involved, the best source of data for caclulating a probability is analsing historical data and assuming that history will repeat itself with regard to failures. Of course if the historical data is available for any services aquired, it may be prudent to use these numbers as well.




      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |   Version 0   |       C       |            Plenty             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           Router ID - www.blackhole-networks.com              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           Area ID - FTA with Network Engineering              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |          Checksum  OK         |         Construction          |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +-                                                             -+
      |                        PAGE STILL                             | 
      +-                         UNDER                               -+
      |                       CONSTRUCTION                            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+