There have been several posts recently about the use of the Normal Distribution for project management, risk, and related topics. I first came across this issue with Mike Cohn's Agile Estimating and Planning. My review pointed out the misuse of the "normal" distribution for project work. Recently Jurgen Appelo's The Normal Fallacy contained similar "errors in understanding," which John Goodpastuer pointed out.
Here's the core problem, that seems to be misunderstood, even in mature guidance like the GAO Cost Estimating and Assessment Guide, where the "normal" distribution is used to show the "return to the mean," for cost estimates.
First some background - edited from a Herman Rubin newsgroup on health diabetes.
The use of the normal distribution as a model for population characteristics comes from confusion, partially created by statistics books designed for those who do not know enough mathematics. Even those for actuaries are so designed.
The normal distribution might be appropriate for statistical mechanics, where the distribution of particle momenta is "shared" among 10^20, but it is rarely appropriate for real populations. Some statistical procedures based on normality work quite well without it; others definitely not.
Transforming to normality to get a scale or to set limits is pure ignorance. The assumptions in medicine should come from knowledge of biochemistry, or observation, but not in trying to use the normal distribution, nor to declare "normal" to be the middle 95%.
The next critical gap comes from misreading - or not reading - the generating function for the normal distribution.
- The distribution is symmetric - rarely are actual population symmetric
- The members of the population must be statistically independent - rarely is this the case in engineering or software development. It can be true is statistical mechanics of gases.
- The "average" is almost never meaningful in absence of the variance. The Probability Management paper is an easy read on this topic.
- "Randomness" is not usually the case in practice. For list of 1,000,000 random digits you can download a book from RAND generated in 1955. Or you can use a Monte Carlo Simulator for these. But most simple MCS in Excel or VBA do not generate sufficiently independent numbers for actual use.
Why is this important - other than to have the correct understanding of how statistics work? One place to look is the Columbia Accident Investigation Board report. The other place to look is at any project your are working that does not use probabilistic estimates of cost, schedule, and technical performance. Any forecasts produced in the absence of this approach are likely wrong. Some time seriously wrong.
Harvard President Lawrence Lowell wrote in 1909 that statistics,
"...like veal pies, are good if you know the person that made them, and are sure of the ingredients."
Same goes for those using statistics in their attempt to convey information.