Quantitative Network Engineering with Fault Tree Analysis
This is a simple paper outlining how to use some Risk Analysis Techniques with Network Engineering to obtain quantitative results. An attempt to turn network engineering from an artful guessing game into a science.
After puting together the events at each level, it's a good idea to take a closer look at each one of them. Check to see if the tree being constructed is a good representation of the system. For any event that is a
basic event, make sure you can assign a probability to that event even if you don't have a value for it at this
point. If you can't assign a number to it, that may be a sign that you need to develop that event further. Also make sure that you are also not repeating any events.
In the Forwarding Plane Failure branch of our tree, we take a closer look at each contributing event.
Our router is connected via two leased circuits, each with it's own SLA and subsystems, so this event should be
developed further.
Since we have two separate (and supposedly independent) network links, the first event should further developed.
We could probably assign a probablity of the power failing to the entire facility based on service level agreements and historical data, but there are a lot of separate components that make up this event, so it should probably be devloped further -- so we will not treat it as a basic event.
The Blackhole Networks Megabox 2000 Turbo++ is a modular system, so it makes sense to break out each component that could cause a hardware failure -- thus this needs to be developed further
Inspecting the Control Plane Failure branch in more depth..
Looking at the power failure, this is the exact same power failure that could take out the forwarding plane, which is a repeating event. In order to avoid this repitition, the power failure event can be moved up one level as a main contributing event of our top event since the power failure has the same consequences and is at the same level of our tree.
It may be difficult to model the misconfiguration event to any more depth, as a human is a pretty complex system and they come in lots of differnent flavors from very stupid and foolish, to utterly brilliant. Also, more than one human is probably involved with managing and mantaining the router which makes it even more difficult to model. To only way to put a number on this figure would be to look at any historical data from any reporting and ticketing systems looking for past isolations that have involved a misconfiguration and assuming that history will repeat itself. Depending on the types of events that are recorded in the ticketing system, it may be possible to break this figure down
further into human error that led to component or system failure. Hopefully as well, the ticketing system is honest and when there is an event that led to a failure of some sort it was honestly recorded as the cause. For the purposes of this short example, this will be treated as a basic event which is denoted by a small circle underneath the event box.
Since we have two Routing Brains running the control plane, this needs to be developed further.
Our cleaned up and adjusted fault tree diagram looks as follows:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Version 0 | C | Plenty |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Router ID - www.blackhole-networks.com |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Area ID - FTA with Network Engineering |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum OK | Construction |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+- -+
| PAGE STILL |
+- UNDER -+
| CONSTRUCTION |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+