Confidence intervals are the means to measure population parameters. A concern in inferential statistics (making a prediction from a sample of data or from a model of that data) is the estimation of the population parameter from the sample statistic.
The sample statistic is calculated from the sampled data and the population parameter is estimated from this sample statistic.
- Statistics are calculated - this means the data from we are looking at, the time series of values for example in a project are used in a calculation
- Parameters are estimated - a parameters from these numbers is then estimated from the time series. This estimate has a confidence interval. From this estimate we can make inferences.
One issue in inference making - estimating - is sample size determination. How large of a sample do we to make an accurate estimation? This is why small sample sizes produce very unreliable inferences. For example sampling 27 stories in an agile project and making in inference about how the remaining stories are going to behave is Very sporty business.
To have a good estimator, that is to make good estimates from sampled or simulated data the estimator must be:
- Unbiased - the expected value of the estimator must be equal to the mean of the parameter
- Consistent - the value of the estimator approaches the value of the parameter as the sample size increases
- Relatively Efficient - the estimator has the smallest variance of all estimators which could be used.
The point estimate differs from the population parameter due to the sampling error, since there is no way to know who close it is to the actual parameter. Because of this, statisticians give an interval estimate as a range of values used to estimate the parameter.
What's the cost of this project going to be when we're done with all our efforts, given we have done some work so far?
The confidence interval is an interval estimate with a specific level of confidence. A level of confidence is the probability that the interval estimate will contain the parameter. The level of confidence is 1 — α. Where 1— α area lies within the confidence interval. The maximum error of the estimate, E, is ½ the width of the confidence interval.
The confidence interval for a symmetric distribution is the point estimate minus the maximum error of the estimate is less than the true population parameter, which is less than the point estimate plus the maximum error of the estimate.
An Example from Actual Observations
While staying at the Yellowstone Lodge during the Millennium (year 2000), our kids got sick with some type of flu going around the lodge. My wife lay in bed, tending them all night long and passed the time recording data about Old Faithful erupting outside our bedroom window.
The data looked something like this:
Eruptions Waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
Eruptions is the duration of the eruption of Old Faithful and Waiting is the waiting time before the next eruption. There is a correlation between these pieces of data. This is due to the physical processes of expelling water at high temperature and the refilling processes of the caverns below the surface
If we use R as our analysis tool, we can get a sense of what is happening statistically with Old Faithful. (R code below)
> attach(faithful) # attach the data frame
> eruption.lm = lm(eruptions ~ waiting)
Then we create a new data frame that set the waiting time value.
We now apply the predict function and set the predictor variable in the newdata argument. We also set the interval type as "confidence", and use the default 0.95 confidence level.
fit lwr upr
1 4.1762 4.1048 4.2476
> detach(faithful) # clean up
What Does The Mean?
It means two things:
- When we say we have an 80% confidence that a parameter will assume to value, we need to know how that parameter behaved in the past.
- When we hear that we are estimating the future from the past, we MUST know about the behaviours of those past values, the size of the population, and the same size, before we can determine the confidence in the possible future outcomes. Have an Average Value without this data is prettu much useless in our decision making process.
What Does This Really Mean?
Anyone suggesting we can make decisions about future outcomes in the presence of uncertainty and at the same time in the absence of estimating those outcomes is pretty much clueless about basic probability and statistics random processes.
Since all project variables - the statistical parameters - are random variables, driven by underlying process that we must estimate using statistical process available in R and our High School Stats book.
Footnote
When it is mentioned I use bayesian statistics, or I use Real Options, ask if they are using something like the R Tutorial Resource with Bayesian Statistics. And of course the source code for the statistical processes described above. Then ask to see their data. There seems to be a lot of people tossing around words, like Bayesian, Real Options, Monte Carlo, and other buzz words without actually being able to show their work or the result that an be tested outside their personal ancedotes. Sad but true.