In a recent post of forecasting capacity planning a time series of data was used as the basis of the discussion.
Some static statistics were then presented.
With a discussion of the upper and lower ranges of the past data. The REAL question though is what is the likely outcomes for data in the future given the past performance data. That is if we recorded what happened in the past, what is the likely data in the future?
The average and upper and lower ranges from the past data are static statistics. That is all the dynamic behavior of the past is wiped out in the averaging and deviation processes, so that information can no longer be used to forecast the possible outcomes of the future.
This is one of the attributes of The Flaw of Averages and How to Lie With Statistics, two books that should be on every managers desk. That is managers tasked with making decisions in the presence of uncertainty when spending other peoples money.
We now have a Time Series and can ask the question what is the range of possible outcomes in the future given the values in the past? This can easily be done with a free tool - R. R is a statistical programming language that is free from the Comprehensive R Archive Network (CRAN). In R, there are several functions that can be used to make these forecasts. That is what are the estimated values in the future form the past and their confidence intervals.
Let's start with some simple steps:
- Record all the data in the past. For example make a text file of the values in the first chart. Name that file NE.Numbers
- Start the R tool. Better yet download an IDE for R. RStudio is one. That way there is a development environment for your statistical work. As well there are many Free R books on statistical forecasting - estimating outcomes in the future.
- OK, read the Time Series of raw data from the file of value as assign it to a Variable
- NETS=ts(NE.Numbers)
- The ts function converts the Time Series into an object - a Time Series - that can be used by the next function
- With the Time Series now in the right format, apply the ARIMA function. ARIMA is Autoregressive Integrated Moving Average. Also know as the Box-Jenkins algorithm. The is George Box of the famously misused and blatantly abused quote all models are wrong some models are useful. If you don't have the full paper where that quote came from and the book Time Series Analysis: Forecasting and Control, Box and Jenkins, please resist re-quoting out of context. That quoyte has become the meme for those not having the background to do the math for time series analysis and it becomes a mantra for willfully ignoring the math needed to actually make estimates of the future - forecasting - using time series of the past in ANY domain. ARIMA is the beginning basis of all statistical forecasting, the science, engineering, and finance.
- The ARIMA algorithm has three parameters - p, d, q
- p is the order of the autoregressive model.
- d is the degree of he differencing
- q is the order of the moving average
- Here's the manual in R for ARIMA
- With the original data turned into a Time Series and presented to the ARIMA function we can now apply the Forecast function. This function provides methods and tools for displaying and analyzing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.
- When applied to the ARIMA output we get a Forecast series that can be plotted.
Here's what all this looks like in RStudio:
NETS=ts(NE.Numbers) - convert the raw numbers to a time series
NETSARIMA=arima(NETS, c=order(0,1,1)) - make an ARIMA object
NEFORECAST = forecast(NETSARIMA) - make a forecast using that
plot(NEFORECAST) - plot it
Here's the plot, with the time series from the raw data and the 80% and 90% confidence bands on the possible outcomes in the future.
The Punch Line
You want to make decisions with other peoples money when the 80% confidence in a possible outcome is itself a - 56% to +68% variance? really. Flipping coins gets a better probability of an outcome inside all the possible outcomes that happened in the past. The time series is essentially a random series with very low confidence of being anywhere near the mean. This is the basis of The Flaw of Averages.
Where I work this would be a non-starter if we came to the Program Manager with this forecast of the Estimate to Complete based on an Average with that wide a variance.
Possible where there is low value at risk, a customer that has little concern for cost and schedule overrun, and maybe where the work is actually and experiment with no deadline or not-to-exceed budget, or any other real constraint. But if your project has a need date for the produced capabilities, a date when those capabilities need to start earning their keep and need to start producing value that can be booked on the balance sheet a much higher confidence in what the future NEEDS to be is likely going to be the key to success
The Primary Reason for Estimates
First estimates are for the business. Yes developers can use them too. But the business has a business goal. Make money at some point in the future on the sunk costs of today - the breakeven date. These sunk costs are recoverable - hopefully - so we need to know when we'll be even with our investment. This is how business works, they make decisions in the presence of uncertainty - not on the opinion of development saying we recorded our past performance on an average for projected that to the future. No, they need a risk adjusted, statistically sound level of confidence that they won't run out money before breakeven. What this means in practice is a management reserve and cost and schedule margin to protect the project from those naturally occurring variances and those probabilistic events to derail all the best laid plans.
Now developers may not think like this. But someone somewhere in a non-trivial business does. Usually in the Office of the CFO. This is called Managerial Finance and it's how serious money at risk firms manage.
So when you see time series like those in the original post, do your homework and show the confidence of the probability of the needed performance actually showing up. And by needed performance I mean the steering target used in the Closed Loop Control system used to increase the probability that the planned value - that the Agilest so dearly treasure - actually appears somewhere near the planned need date and somewhere around the planned cost so the Return on Investment those paying for your work are not disappointed with a negative return and label their spend as underwater.
So What Does This Mean in the End?
Even when you're using past performance - one of the better ways of forecasting the future - you need to give careful consideration of those past numbers. Averages and simple variances which wipe out the actual underlying time series variances - are not only naive, they are bad statistics used to make bad management decisions.
Add to the poorly formed notion that decisions can be made about future outcomes in the presence of uncertainty in the absence of estimates about that future and you've got the makings of management disappointment. The discipline of estimating future outcomes from past behaviors is well developed. The mathematics and especially the terms used in that mathematics are well established. Here's some source we use in our everyday work. These are not populist books, they are math and engineering. They have equations, algorithm, code examples. They are used used the value at risk is sufficiently high that management is on the hook for meeting the performance goals in exchange for the money assigned to the project.
If you work a project that doesn't care too much about deadlines, budget overages, or what gets produced other than the minimal products, then these books and related papers are probably not for you. And most likely Not Estimating the probability that you'll not over spend, show up seriously late, and fail to produce the needed capabilities to meet the Business Plans, will be just fine. But if you are expected to meet the business goals in exchange for the spend plan you've beed assigned, these might be a good place start to avoid being a statistic (dead skunk on the middle of the road) in the next Chaos Report (no matter how poorly the statistics are).
This by the way is an understanding I came to on the plane flight home this week. #Noestimates is a credible way to run your project when these conditions are in place. Otherwise you may what to read how to make credible forecasts of what the cost and schedule is going to be for the value produced with your customer's money, assuming they actually care about not wasting it.
- Time Series, 3rd Edition, Sir Maurice Kendall and J. Keith Ord
- Applied Regression Analysis, 3rd edition, Norman R. Draper and Harry Smith
- Introduction to Probability and Statistics, William Feller
- Estimating Software Intensive Systems: Projects, Products, and Processes, Richard D. Stutzke
- Forecasting Methods and Applications, Spyros Makridakis, Steven C. Whellwright, and Rob J. Hyndman.