It's common these days to re-purpose a quote or a platitude from one domain into another and assume it's applicable to the second domain. My favorite recent one is
"Layers of redundancy are the central risk management property of natural systems” - Taleb
Taleb is the author of Black Swan, about long tailed statistical processes in the financial domain. These Black Swans tend to bite you when you least expect it. Are there Black Swans in the software development domain? Only if you're not looking. Financial systems are rarely engineered to perform in specific ways. Software systems are, st where I work and I suspect everywhere someone is paying money for the system to be developed or acquired.
So let's look at the Taleb quote that is often re-quoted by agile people and especially those advocating no estimates.
First some full disclosure. One of my graduate degrees is in Systems Management, which is a combination of Systems Engineering and Finance. As well I work with systems engineers and support systems engineering processes in the aerospace and defense domain. So I'll predisposed to view the work through the eyes of Systems Engineering. Everything is a System is a good starting point for what we do.
Now let's look at the Taleb quote through the eyes of Systems Engineering and the software systems that are engineered in the domain we work. There are many kinds of redundancy found in our systems. To avoid falling victim to platitudes that abound in the agile and No estimates domains, let's start with a framing assumption.
Redundancy provides resiliency to the system to withstand disruption within acceptable degradation parameters and to recover within an acceptable time and composite costs and risks.
In Taleb's (financial trading systems) domain resilience is desirable as it is in software intensive systems. Software systems that fly the airliner you ride on, manage the trains, process credit card transactions, control air traffic, manage the moving parts of your car. Any system where software is the dominate component for the proper functioning of the product or service also require resiliency.
But redundancy is not the only way to do this. And many times redundancy is very expensive, and creates less resiliency. - Fool Proof: Why Safety Can Be Dangerous and How Dangerous Makes Us Safe, Greg Ip, Little Brown, 2015
There are rules for assessing the resiliency that results from approaches beyond just redundancy. There are many other system design aspects that provide resiliency.
- Absorption rule - is a buffering characteristic that prevents overload of the system. Redundancy can provide this protection. The Microsoft Always On product provides this as well as other resiliency and redundancy capabilities.
- Limit Degradation support rule - provide a lower limit to which the system can degrade before failing. This is he circuit breaker for your home. Also the circuit breaker for the stick exchange.
- Margin Support Rule - margin is added to the system to protect from disruptions. This can be schedule margin, cost margin, technical performance margin, operational margin. Any kind of margin that allows the system to continue to operate properly inside the range of parameters.
This notion of margin is absent from Agile development. And the result is when things go wrong, you're late, over budget and the product doesn't work. To have margin we must be able to estimate how much margin. Too much margin is a waste. Too little margin will not protect the system from disruption.
- Physical Redundancy rule - buy two in case one breaks was the request when I first started writing code for a Ballistic Missile Defense radar system. We were buying the original Sun I cards to replace legacy computers. I went on from there to work at a Triple Redundant process control startup as the Software Manager. Where we developed a physically redundant computer and software system in the petro-chem and nuclear power domain. Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage, describes how that triple redundancy was protected through realtime fault detection and dynamic reconfiguration of the hardware components.
- Functional Redundancy rule - is sometimes called design diversity and avoids the vulnerabilities of Physical Redundancy.
- Layers of Defense rule - states that for a failure to occur a disturbance has to penetrate a series of layers simulate to layers of Swiss Cheese. The system has holes like the holes in Swiss Cheese, that allow the failure to penetrate to the next level, where they can be handled.
So when we hear a platitude like Layers of redundancy are the central risk management property of natural systems ask what kind of redundancy, what kind of fault handling and response processes. In fact ask first is that quote used as a platitude even applicable in the domain of interest? Or is it just a phrase picked up and repeated with little or no understanding of the principles, practices, or processes to which it CAN be applied.
[1] The Theory and Practice of Reliable System Design, Daniel Siewiorek and Robert Swarz