Today the Rally chart reappeared. This chart is from "The Impact of Agile Quantified."

This chart shows a *performance index. *But, a VERY critical caveat of the report is...

The findings in this document are extracted by looking for correlations between decisions or behaviors (keeping teams stable, setting your team sizes to between five and nine, keeping your work in process— WiP—low, etc.) and outcomes as measured by the dimensions of the SDPI. As long as the correlations meet certain statistical requirements, we report them here. However, correlation does not necessarily mean causation. For example, just because we show that teams with low average WiP have one-quarter as many defects as teams with high WiP, doesn’t necessarily mean that if you lower your WiP, you’ll reduce your defect density to one-quarter of what it is now. The effect may be partially or wholly related to some other underlying mechanism.

What's missing from the chart is a critical understanding - for all the projects sampled, the chart doesn't show the probability density function for all the samples.

As well in the report, there are some other statistical errors. For example, when the responsiveness index is used, Little's Law is mentioned. Little's Law assumes that all the arriving work is I.I.D. That is *Independent Identically Distributed*. This is rare in software development. That would mean the effort (duration) of each *job* arriving in the Kanban is drawn from the same underlying probability Distribution (PDF) and each Job is *Independent* of the other *jobs*. This is the case for people arriving in line at the Bank of America to deposit checks. But not likely true for software development *jobs* arriving in the Kanban queue.

The next issue is Correlation is NOT Causation unless the Root Cause Analysis is done to show WHY that correlation should exist.

Another *statistical* issue is the chart showing #Noestimates versus other types of estimates. The *Performance Index* of #Noestimates in the chart above shows differences from Scrum and Lightweight Scrum

But the sample space of the total population of projects with #Noestmates as their estimating technique is only 3% of the total population. This creates a small sample size error when including those numbers in the overall assessment of impact.

**How to Lie with Statistics**

Let's start with a book that must be on everyone's shelf. *How to Lie With Statistics*, Darrell Huff, 1954. My copy is an original printing (I wasn't doing statistics in 1954, but I collect original edition math and science books. The paperback is at the bookstore). The Huff's book there are 10 ways to *Lie With Statistics.*

The first chart in this post has several - *Sample with a Built-in Bias* and *Well -Chosen Average* are two for starters.

But since the raw data is not available, it's not possible to tell much of anything about the first chart, other than:

- The #Noestimates sample projects are a minuscule collection from the total population.
- All the variances between each of the projects are wiped out since each of the bars doesn't have a confidence band on it.

Let's add to that a few more *Statistical Processing Issues*.

Spurious Relation(or Correlation) (a) A situation in which measures of two or more variables are statistically related (they cover) but are not in fact causally linked—usually because the statistical relation is caused by a third variable. When the effects of the third variable are removed, they are said to have been partialed out.

**But This is Not the Core Problem**

The real problem is that *without the raw data *(ignoring for the moment the disparate sample sizes in the second chart), the variances in this data cannot be determined. The chart (top chart) shows a number for *predictability* for No Estimates and a similar number for *predictability* for Light Weight Scrum. They look pretty close. Maybe 11 to 12 for No Estimates and maybe something less than 15 for Light Weight Scrum, say 13.

But we can't see the *spread* of the range of these values for each of the classifications. Nor can we see is how those numbers were arrived at. If those numbers were the *average* of all the performance indices for *predictability*, what's the Standard Deviation and Variance of that Average? As well, how was each sample group *averaged*? Maybe that's the median or the mode of all the numbers, not the *Average*.

Turns out the *median* is used in two indices

- The responsiveness score for Work In Progress (WIP)
- Team Stability.

But *median* is just the *middle most* value and does not consider the *shape* or the *width* of the curve. Look on page 4 of the Rally report.

So here's the killer problem(s):

- The sample sizes between the four types of projects have not be normalized for their size. For small sample sizes, normality tests have little power to reject the null hypothesis and therefore small samples most often pass normality tests. In this case, the #Noestimates samples are included in the same chart as the 3 other classes
*as if they have equal*meaning. - NO Causation is defined to describe
*why*we should see the numbers we see. This is just a correlation chart. Tyler Vigen's*Spurious Correlations*site is a great place to start. The CA report does call out this issue up front. But that means anyone using that chart must include those qualifications as well. - The final problem, just like the Standish Reports, is that the samples may or may not represent the total population of projects in the Agile space. These are Rally numbers. How about numbers for Jira clients? VersionOne clients? Team Foundation Server clients.

But the killer problem, just like the Killer Problem with #NoEstimates, is there is no causal factor stated as to why we should see the differences in data that we see.

So in the end (or at least the end for today), there are likley differences between methods of developing sofwtare with agile, estiamting that development effort and other parameters. But the simple charts, with all the varainces wiped out and rolled up is Not the way to investigate these difference. A much deeper statistical inference analysis is needed. Since the raw data is not available, this is not likley.

The result is a very nice

summarydocument that can be used or misused to convey pretty much any marketing message one wants.But one more thing. These charts are the beginning of real analysis of processes needed to improve the probability of projct success. When funding returns to do the actual analysis progress will be made.

**Here are a Few Books that Have Served Me Well**

*How to Lie with Statistics*, Darrell Huff, W. W. Norton, 1954*Principles of Statistics*, M. G. Bulmer, Oliver and Boyd 1965*Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics*, Gary Smith, Overlook, 2014*Flaws and Fallacies in Statistical Thinking*, Stephen K. Campbell, Prentice-Hall, 1974*Hard Facts, Dangerous Half-Truths & Total Nonsense: Profiting from Evidence-Based Management*, Jeffrey Pfeffer and Rober Sutton, Harvard Business School Press, 2006.*The Flaw of Averages*:*Why We Underestimate Risk in the Face of Uncertainty*, Sam L. Savage, John Wiley & Sons, 2012