Quantitative Network Engineering with Fault Tree Analysis

This is a simple paper outlining how to use some Risk Analysis Techniques with Network Engineering to obtain quantitative results. An attempt to turn network engineering from an artful guessing game into a science.

System Probabilities

Calculating a Probability from the MTBF

MTBF 50,000 hours SLA uptime 99.2%

The Mean Time Between Failure (MTBF) is a prediction of the elapsed time a system or component between failures. Most manufacturers will report the MTBF values for systems and substems, usually in hours. The larger the value of the MTBF, the more reliable the system. MTBFs assume a constant failure rate during the operating life of the system. Infant mortality, the tendancy for a system to have higher failure rates when it is new, and wearout, higher failure rates towards the end of the systems lifetime, are not taken into consideration. There are several accepted methods of calculating the MTBF based on lab test data, modeling, and actual failure data. As long as the MTBF numbers being reported are using an accepted method, they should be alright for analysis.

To convert MTBF numbers into a probability, the number needs to be normalized over a peroid that will match any other probabilites garnered from other sources like SLAs or historical data. Using a year is a good starting point as it's used as the time basis for many other reports and numbers. Without going into any derivation (Google and Yadex for more) the probability P, that a system will fucntion for a time perioid t without failure is:

P(t) = exp(-t/MTBF)

Calculating a Probability from a SLA

Avaliablity is the ratio of uptime to downtime. Usually somewhere in a service level agreement, availablity is a parameter. Availability is normally reported as a percentage so it's fairly easy to find the inverse of availability -- the downtime by subtracting it from 100%. Converting it to a probability is as simple as changing it from a percentage to a decimal.

P(t) = (100% - Avaliability%)/100%

so once normalized over the same time peroid t.

Calculating a Probability from Historical Data

Analysing historical data can come in handy in two ways:

  1. To come up with availability numbers
  2. To check actual performance of systems with those promised in SLAs and the MTBFs reported by manufacturers.

When it comes to equipment and system outages, it is usually fairly straight forward to collect information about outages -- but be careful with this data. Often this data can be misreported, miscategorized or even not reported which can really skew results. All that is needed to come up with a probability is the system downtime over the normalized period t. Then availability can be calculated, and a probability can be derived from that.




      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |   Version 0   |       C       |            Plenty             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           Router ID - www.blackhole-networks.com              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           Area ID - FTA with Network Engineering              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |          Checksum  OK         |         Construction          |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +-                                                             -+
      |                        PAGE STILL                             | 
      +-                         UNDER                               -+
      |                       CONSTRUCTION                            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+