Quantitative Network Engineering with Fault Tree Analysis
This is a simple paper outlining how to use some Risk Analysis Techniques with Network Engineering to obtain quantitative results. An attempt to turn network engineering from an artful guessing game into a science.
It's clear to see, that these can become very large, very quickly. Once the size becomes unwieldy, transfer gates come in to allow us
to break the tree down into smaller components.
Here is a further breakdown of each event and gate on the tree which can be read to reinforce the process. It is a very good exercise to
go over a developed fault tree with another expert or group that knows the system being modeled. Outside input is essential in checking
logic, looking for misinterpretations of system functions and looking for missing events. One noteworthy thing is the use
of a voting gate to determine the effects up the tree of the failure of any of the switching fabrics. At least two of the three fabrics needs
to fail in order for the gate to become true and contribute to the overall failure of the system.
Detailed Analysis of the Whole Tree
The Forwarding Plane Branch of the Tree
Two events, either of which will cause the Forwarding Plane to fail are linked together with an OR gate.
The Forwarding Plane Failure could be caused by the Network Links failing, causing a loss of path to forward packets over.
The other big event that could cause the forwarding plane to fail is a hardware failure in that system.
There are two network links, one from Jittercom and one from Taildrop-a-Phonic, each of which need to fail in order for the event to cause the undesired effect.
These are linked together with an AND gate. Each network link has a nearly identical system: a cable, a lease from the service provider, a modem,
and a forwarding and line card in the router. However, the contributing events are mostly different as there are separate components (aside
from the forwarding engine) each of which has their own independent failure rates. Each circuit needs to be modelled independently.
The failure of any of the following events can cause the service provider lease to fail:
The node in the network that normally forms an adjacency stops responding. This could be due to a misconfiguration or fault with the other node.
Historical data would have to be used here unless there was already an analysis done on the other node in terms of what can be expected
from a reliability perspective.
The hardware in the router that supports each circuit could fail. This consists of the Forwarding Engine in Slot 1 and the Line card for each circuit. If either one of these
events occurs, the circuit will fail. Note that each circuit shares the same Forwarding Engine, so this is really the same event in two
separate locations.
The service provder portion of the lease could also fail. This could be caused by any one of the follwing events, an ordinary
service outage, a failure of the cable connecting the lease to the router, or the death of the Customer Premise Modem.
The CPE modem could fail in one of two major ways, each of which will provide input up the tree:
A hardware failure.
A commercial power outage.
The failure of the hardware in the router that services the forwarding plane for the entire box is the final branch of the forwarding
plane failure section. The the forwarding plane consists of two major components, a failure of either of which will fail the system.
A failure of the router's backplane
A failure of two of the Switch Fabrics Cards will cause a failure of the entire switch fabirc. These are linked together with a
voting gate labed with the fraction 2/3 to indicate two of the tree inputs must be true cause the gate to close.
The Power Branch of the Tree
In order for the whole router to loose power, both power supplies must fail simultaneously as each power supply is able to carry the load
of the entire platform by itself. These are linked together with an AND gate.
Each power supply is modeled as a separate tree since the failure of each is an independent event. Additionally, power supply
1 is connected with an UPS, while power supply two is connected directly to commercial power.
The Control Plane Branch of the Tree
The first level of the control plane branch was already discussed, so this will resume with the contributing factors of the failure of the control plane
hardware -- namely the two routing brains. Since there are two, either of which can control the system, these are linked together with an OR gate.
The routing brain consists of a hardware and a software component, each of which needs to function in order for the system to work as it should. Thus the
failure of these two sections of the tree are linked with an OR gate.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Version 0 | C | Plenty |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Router ID - www.blackhole-networks.com |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Area ID - FTA with Network Engineering |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum OK | Construction |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+- -+
| PAGE STILL |
+- UNDER -+
| CONSTRUCTION |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+