There was a recent post titled there is no "average." From the mathematical point of view, for any two numbers there is always an"average." This possible langauge gap aside, here's the problem with using the term "average." This is probably what the poster is speaking about. The "Normal (Gausian) Fallacy" originates from a gap in speaking in statistical terms. This starts with the misunderstanding between Probability and Statistics. The statistics of earthquakes, blog posts, and the like. The samples that form the "sample space" of the probability density function (pdf) may or may not be symmetric and most likely are not "Gaussian" shaped in the absence of a very large number of samples.As well the Pareto concept does not speak about the shape of the curve - the shape of the pdf. Only that 20% of the "value under the curve" has 80% of the impact. Pareto is not a statistical discussion, it is a impact of the contents of the pdf discussion.
Next comes the understanding that the Mean, Mode, and Median are the words that should be used. Mean = Average. But as suggested, Average is not that useful with out other parts of the statistical model. But Average is ALWAYS there. If you have 2 numbers it can't NOT be there.
For example when task durations are mentioned in PERT calculations, the MODE is the same as the MOST LIKELY. This is the number that occurs most often wen a sample of numbers is taken from the same sample space. There is a story we use when we are training Control Account Managers (CAM) on the topic of programmatic risk.
I'm going to send you to Trinidad-Tabago for one year. You'll have a clipboard, a beach chair, sun screen, hat, and enough food and beer to last you 365 days. Every day you'll look at the temperature gauge on the beach shack and write down the high temp for the day.
Next you'll pick up your ticket for Cody Wyoming. Bring you beach chair, sun screen, hat and some more clothes. You'll sit outside the Big Bear motel and record the high temperature of the day for 365 days.
After these two years, bring your clipboard back to Denver and we'll do some statistical calculations. We'll find that the MOST LIKELY number, the number that occurs most often is pretty close in both locations - somewhere between 78 and 81 degrees F.
It's the VARIANCE that is the issue. Don't go to Cody in January in your shorts.
The next "understanding" from this data is that "Single Point Estimates" are always wrong in the absence of the VARIANCE. Why? Here's why.
Each activity in a project, a blog, a sample of temperature, or ANY physical process has a probabilistic distribution associated with the behavior. In this picture it could be duration, cost, reliability, or performance. Knowing the probability distribution, we can calculate the variance. From these two numbers we can have insight into what the numbers mean.
We need to make sure we don't fall into the "fooled by randomness trap." The statement:
Yet, in a complex environment, all events are normal. Both the common ones and the uncommon ones.
Is simply NOT true, without several validation steps. Events are "normal" ONLY if the sample population has been "tested" for normal. This is true ONLY if the sample population follows the rule:
Please note, the "normal" condition is true ONLY when the mean is zero (normalized zero mean) and the standard deviation is 1. This is almost NEVER the case in real world probabilistic processes - stochastic processes. Earthquake, blog posts, task durations, quality of code checked into the server.
So lets look at customer demand. This notion of "demand" is a mature statistical process. It's called queuing theory. And the Poisson distirbution is a good starting point. Siméon Denis Poisson was French, like most good mathematicians in the 19th century, who was asked by Napoleon to calculate how many soldiers died as a result from being thrown off their horses. This is the Poisson distribution. This distribution is used in many "real world" situations. What is the probability of your porch light burning out is a Poisson distribution. The longer you wait the more probable. Poisson models are "arrival rate" models. The number of people in the bank line is Poisson, the probability of failure of a pump in a paper mill is Poisson. Simple models of computer failure or event software - remember I said SIMPLE - are Poisson.
When it is said...
If you assume that customer demand has an average, based on a limited sample of earlier events, you will inevitably be surprised that some future requests are outside of your expected range.
Yes you will be surprised if you assume the average without the variance and the arrival probability distribution. Remember:
Only Fools are Fooled by Randomness
You won't be surprised if you have a Poisson distribution generating function on hand. And hey, Excel has one.
So in the end when we speak about things that are "random" there are several preconditions that must be in place before we can extract any actionable information.
- Is there a known probability distribution for the observed process? If not what tools can we use to make decisions in the absence of a probability distribution? There are lots of tools, but we first must acknowledge that there is or is not a distribution.
- If there is a distribution, what does it look like. If we don't know, there are tools for that as well.
- With the distribution, can we determine some of the attributes - these are called cummulants? The first 2 cummulants are found to be the expectation and variance of an Random Variable. But larger cummulants like kurtosis and skew are useful. These depict the "shape" of the probability distribution.
- Like all good things in statistics there is a "test" to determine the attributes of a distribution. This is the Kolmogorov–Smirnov test (K–S test) is a nonparametric test for the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). So if I have a sample and its pdf, I can ask a question - "is this like another knowm sample distribution?" From there I can start to assess the impact from my new set of samples.
All this "stuff" is available in the seminal work of Leonard Kleinrock's Queuing Systems, Volume 1 and 2. It still use my copies from Grad School, Crica 1980.
It's breathtaking what these books cost today. Mine from 1979/1980 were something like $27.00. On Amazon, they are something like $120 each. But when you start to ask important questions about your customer, the arrival of requests on your web site for service, the throughput of a channel - electronic or the road, or packets traversing the ether, this is the place to start the conversation. There is a huge amount of "bad information" out there. Little's Law is one used by Lean and Agile proponents in usually in wrong ways. So find these books or similar ones and become a "queuing geek."