30 I E E E S o f t w a r E P u b l i s h e d b y t h e I E E E C o m p u t e r S o c i e t y 0 74 0 – 74 5 9 / 1 0 / $ 2 6 . 0 0 © 2 0 1 0 I E E Efocusmanagement hardly help increase project success.Over the years, their figures have attracted tremendous attention.However, we question the validity of their figures. Robert Glass2,3 and Magne Jørgensen and hiscolleagues4 indicated that the only way to assess theChaos results’ credibility is to use Standish’s dataand reiterate their analyses. But there’s another way:obtain your own data and reproduce Standish’s research to assess its validity. We applied the Standishdefinitions to our extensive data consisting of 5,457forecasts of 1,211 real-world projects totaling hundreds of millions of euros. Our research shows thatthe Standish definitions of successful and challenged projects have four major problems: they’remisleading, one-sided, pervert the estimation practice, and result in meaningless figures.Misleading DefinitionsThe Standish Group published the first Chaos report in 1994, which summarized Standish’s research findings and aimed to investigate causesof software project failure and find key ways toreduce such failures.1 The group also intended toidentify the scope of software project failures bydefining three project categories that we recallverbatim:■ Resolution Type 1, or project success. Theproject is completed on time and on budget,offering all features and functions as initiallyspecified.■ Resolution Type 2, or project challenged. Theproject is completed and operational but overbudget and over the time estimate, and offersfewer features and functions than originallyspecified.■ Resolution Type 3, or project impaired. Theproject is cancelled at some point during the development cycle.1To find answers to their research questions,Standish sent out questionnaires. Their total sample size was 365 respondents representing 8,380 applications. On the basis of the responses, Standishpublished overall percentages for each project catF or many years, researchers and practitioners have analyzed how to successfully manage IT projects. Among them is the Standish Group, which regularly pub lishes its findings in its Chaos reports. In 1994, Standish reported a shocking 16 percent project success rate, another 53 percent of the projects were chal- –lenged, and 31 percent failed outright.1 In subsequent reports Standish updated its findings, yet the figures remained troublesome. These reports, derived from the StandishGroup’s longitudinal data, suggest that many efforts and best practices to improve projectAlthough the StandishGroup’s Chaos reportsare often used toindicate problems inapplication softwaredevelopment projectmanagement, thereports containmajor flaws.J. Laurenz Eveleens and Chris Verhoef, Vrije Universiteit AmsterdamThe Rise and Fall of theChaos Report Figurespr oje c t m ana gem en tJanuary/February 2010 I E E E S o f t w a r E 31egory. Standish updated its figures in subsequentyears (see Table 1). A number of authors publishedthese figures in various white papers.1,5–7The figures indicate large problems with software engineering projects and have had an enormous impact on application software development.They suggest that the many efforts and best practices put forward to improve how companies develop software are hardly successful. Scientific articles and media reports widely cite these numbers.Many authors use the figures to show that softwaredevelopment project management is in a crisis. Thenumbers even found their way to a report for thePresident of the United States to substantiate theclaim that US software products and processes areinadequate.8The figures’ impact and their widespread useindicate that thousands of authors have acceptedthe Standish findings. They’re perceived as impeccable and unquestionable. However, the Standishdefinitions of successful and challenged projectsare problematic. Standish defines a successful project solely by adherence to an initial forecast ofcost, time, and functionality. The latter is definedonly by the amount of features and functions, notfunctionality itself. Indeed, Standish discussed thisin its report: “For challenged projects, more thana quarter were completed with only 25 percentto 49 percent of originally specified features andfunctions.”1So, Standish defines a project as a success basedon how well it did with respect to its original estimates of the amount of cost, time, and functionality. Therefore, the Standish “successful” and “challenged” definitions are equivalent to the following:■ Resolution Type 1, or project success. The project is completed, the forecast to actual ratios(f/a) of cost and time are ≥1, and the f/a ratio ofthe amount of functionality is ≤1.■ Resolution Type 2, or project challenged. Theproject is completed and operational, but f/a 1 for the amount offunctionality.The reformulated definitions illustrate that the definitions are only about estimation deviation.Jørgensen and his colleagues show that the definitions don’t cover all possibilities.4 For instance, aproject that’s within budget and time but that hasless functionality doesn’t fit any category. In thisarticle, we assume a project that doesn’t complywith one or more of the success criteria belongs tothe challenged-project category.Standish calculates its success measure by counting the number of projects that have an initial forecast larger than the actual for cost and time, andone that’s smaller for functionality. This is dividedby the total number of projects to calculate the success rates. Standish Group defines its success measure as a measure of estimation accuracy of cost,time, and functionality.In reality, the part of a project’s success that’srelated to estimation deviation is highly contextdependent. In some contexts, 25 percent estimation error does no harm and doesn’t impact whatwe would normally consider project success. Inother contexts, only 5 percent overrun wouldcause much harm and make the project challenged. In that sense, there’s no way around including more context (or totally different definitions)when assessing successful and challenged projects.However, the Standish definitions don’t considera software development project’s context, such asusefulness, profit, and user satisfaction.This illustrates the first problem with the definitions. They’re misleading because they’re solelybased on estimation accuracy of cost, time, andfunctionality. But Standish labels projects as successful or challenged, suggesting much more thandeviations from their original estimates.Unrealistic RatesThe next issue is whether the Standish estimationaccuracy definitions are sound. They are not. TheStandish Group’s measures are one-sided becausethey neglect underruns for cost and time and overruns for the amount of functionality.We assessed estimation accuracy with twotools. We derived the first from Barry Boehm’snow-famous cone of uncertainty, a plot that depicts forecast to actual ratios against project progression.9 This plot shows how the forecasts aremade, what deviations they contain, and whetherinstitutional biases exist.Table 1Standish project benchmarks over the yearsYear Successful (%) Challenged (%) failed (%)1994 16 53 311996 27 33 401998 26 46 282000 28 49 232004 29 53 182006 35 46 192009 32 44 2432 I E E E S o f t w a r E w w w . c o m p u t e r. o r g / s o f t w a r eThe second is Tom DeMarco’s EstimationQuality Factor (EQF), a time-weighted estimation accuracy measure he proposed in 1982.10 Thehigher a forecast’s EQF value, the higher its quality.An EQF value of 5 means the time-weighted forecasts of a single project deviate on average 1/5, or20 percent, from the actual.We applied Boehm’s and DeMarco’s work toour own data and detected large biases that theorganizations weren’t aware of. We introduce twodata sets from an anonymous multinational corporation to prove that the one-sided Standish definitions lead to unrealistic rates.CostThe first case study concerns a large financialservices provider. From this organization, Y, weobtained data on 140 software development projects conducted from 2004 to 2006. The organization made 667 forecasts for these projects’ totalcosts. We divided the forecasted cost with the actual project cost and plotted the ratios as shownin Figure 1. The horizontal axis represents projectprogression. The figure depicts the start of a project at zero and represents project completion by1.0. The vertical axis shows the f/a ratio’s value.For instance, a data point at project completion0.2 and an f/a ratio of 2 indicates a forecast wasmade when the project was one-fifth completed.This forecast was two times the actual, meaningthe project turned out to be 50 percent of the estimated cost.The f/a ratios in Figure 1 resemble Boehm’sconical shape, with the forecasts centered aroundthe actual value. A median f/a ratio of 1.0 supports this finding. The forecasts’ quality is relatively high, with a median EQF value of 8.5.This indicates that half the projects have a timeweighted average deviation of 12 percent or lessfrom the actual. Compared to results from theliterature, this organization makes best-in-classforecasts.10,11It turned out that an independent metrics groupassessed this organization’s forecasts. This groupmade its own cost calculations next to those ofthe project managers. If large discrepancies arose,these needed to be resolved before any budget wasapproved. This caused forecasts to aim at predicting the actual value. Yet, even though this organization’s cost forecasts are accurate, when we applythe Standish definitions to the initial forecasts, wefind only a 59 percent success rate.functionalityFrom the same organization Y, we obtained datafor 83 software development projects from 2003to 2005. In total, the organization’s estimatorsmade 100 forecasts for the projects’ functionality,calculated in function points.120.0 0.2 0.4 0.6 0.8 1.00.20.51.02.0Project completionForecast/actual0.0 0.2 0.4 0.6 0.8 1.00.20.51.02.0Project completionForecast/actualFigure 1. 667 f/a ratios for 140 project costs of organization Y, wheref is forecast and a is actual. The ratios are spread equally below andabove the horizontal line f/a = 1, indicating the forecasts are unbiased.The ratios also show that the quality of the forecasts is high comparedto the literature.10,11Figure 2. 100 f/a ratios for 83 project function points of organization Y,where f is forecast and a is actual. The ratios are close to and centeredaround the horizontal line. This indicates the forecasts are unbiasedand of high quality.January/February 2010 I E E E S o f t w a r E 33The functionality f/a plot in Figure 2 shows asituation similar to the f/a ratios for the costs. Thebias is negligible based on the figure and a medianf/a ratio of 1.0. Except for some outliers, the f/a ratios converge to the actual value. The functionalityforecasts have a median EQF of 6.4. This meansthat the function-point forecasts of half the projects have a time-weighted average deviation of 16percent or less from the actual amount.Multiple experienced function-point counterscalculated the projects’ functionality. Becausethey weren’t involved with the projects’ execution,their only incentive was to predict the actual value.However, despite the forecasts’ accuracy, when weapply the Standish definitions to the initial forecasts, we find only a 55 percent success rate.CombinedFifty-five software development projects containedforecasts and actuals of both cost and functionality. There were 231 cost forecasts and 69 functionality forecasts. Both cost and functionality forecasts were unbiased and converged to the actualvalue. The median EQF for the cost forecasts is9.0; for the functionality forecasts, it’s 5.0. So, halfthe projects have a time-weighted average deviation of 11 percent for cost and 20 percent deviationfor functionality.We applied the reformulated Standish definitions to the initial forecasts of the combined data.Even without taking into account failed projectsand the time dimension, the best-in-class organization Y obtains a success rate of 35 percent. Yet,the median EQF of both initial forecasts of costsand functionality is 6.5, showing that half the projects have an average time-weighted deviation ofonly 15 percent from the actuals. If this organization is already so unsuccessful in two dimensionsaccording to Standish, it’s hardly surprising thatStandish found only a 16 percent success rate in itsfirst report.1These case studies show that organization Yobtains unrealistically low success rates for the individual cost and functionality forecasts owing tothe definitions’ one-sidedness. Combining these already low rates further degrades the success rate.Clearly, the Standish success rates don’t give anaccurate indication of true estimation accuracy ofcost and functionality in the case of an unbiasedbest-in-class organization.Perverting AccuracyThe third problem is that steering on the Standishdefinitions causes large cost and time overestimations (and large functionality underestimations),which perverts rather than improves estimationaccuracy.We obtained data from a large multinationalorganization, X, comprising 867 IT-intensive projects that it began and completed in 2005 or 2006.In total, the organization made 3,767 forecasts ofthe projects’ costs.The f/a ratios in Figure 3 show that the organization’s forecasts were generally higher thanthe actuals. Also, the data doesn’t show a conicalshape as we’d expect from Boehm’s cone of uncertainty. Projects even had surplus budget after completion. After discussion with the organization, wefound it steered on Standish project success indicators. The organization adopted the Standish definitions to establish when projects were successful.This caused project managers to overstate budgetrequests to increase the safety margin for success.However, this practice perverted the forecasts’quality, making it low with a median EQF of 0.43.So, 50 percent of the projects have a time-weightedaverage deviation of 233 percent or more from theactual.Meaningless FiguresThe fourth major problem is that the Standishfigures are meaningless. Organization Xshowed that large biases occur in practice. Evenif a company doesn’t steer on Standish’s key0.0 0.2 0.4 0.6 0.8 1.0Project completionForecast/actual0.010.11101001,000Figure 3. 3767 f/a ratios for 867 project costs of organization X, wheref is forecast and a is actual. The forecasts show large deviations anddo not converge to the actuals over time. The figure shows that theseforecasts are generally overestimated and of low quality.34 I E E E S o f t w a r E w w w . c o m p u t e r. o r g / s o f t w a r eperformance indicators, biases exist. We showthis by introducing another case study from anearlier IEEE Software paper.13 Comparing allthe case studies together, we show that withouttaking forecasting biases into account, it’s almost impossible to make any general statementabout estimation accuracy across institutionalboundaries.timeLandmark Graphics is a commercial software vendor for oil and gas exploration and production.We obtained data from Todd Little of LandmarkGraphics, which he reported in IEEE Software,13consisting of 121 software development projectscarried out from 1999 to 2002. Little provided923 distinct forecasts that predict these 121 projects’ duration. We performed the same analysis asbefore by plotting the forecast to actual ratios (seeFigure 4).Most forecasts this organization made arelower than the actual. So, projects take longer than initially anticipated. The median EQFis 4.7. This means that half the projects have atime-weighted average deviation from their forecasts of 21 percent or less from the actual. Landmark Graphics’ institutional bias was to forecastthe minimum value instead of the actual value.This caused most forecasts to be lower than theactuals.applying Standish’s DefinitionsIn two of the three organizations, the forecastswere significantly biased. With organization Y,we determined that the institutional bias wasnegligible. In organization X, the forecasts weremuch higher than the actual values because estimators took large safety margins into account.With Landmark Graphics, most forecasts werelower than the actual values because the companypredicted the minimal time required to finish theproject.To illustrate how forecasting biases introduced by different underlying estimation processes affect the Chaos report figures, we appliedStandish’s definitions to all the cases. BecauseStandish deals with initial forecasts, we also usedthe initial forecast of each project. This is a subset of all data points shown in the f/a plots in Figures 1–4.Also, our resulting figures are an upper boundfor the Chaos successful-project figures. First, ourfigures don’t incorporate failed projects. If wetook failed projects into account, our case studies’success rates would always be equal to or lowerthan the current percentages.Second, in each case study, we present onlycost, time, or functionality data, except in oneinstance where we present both cost and functionality. In our analysis, we assume that the remaining dimensions are 100 percent successful,meaning our percentages are influenced by onlyone or two dimensions. If data for all three dimensions (cost, time, and functionality) is available and taken into account, the success rates willalways be equal to or lower than the successfulpercentages calculated for only one or two dimensions. Still, these rates suffice to prove thatStandish’s success and challenge rates don’t reflect the reality.Table 2 shows the numbers calculated according to Standish’s definitions for our case studies along with those of a fictitious organizationhaving the opposite bias of Landmark Graphics.The table provides an interesting insight into theStandish figures. Organization X is very successfulcompared to the other case studies. Nearly 70 percent of the projects are successful according to theStandish definitions. On the other end, LandmarkGraphics has only 6 percent success. OrganizationY is in-between with 59 percent success for costs,55 percent success for functionality, and 35 percent success for both.However, the f/a plots and their median EQFsclearly show that this is far from reality. Landmark Graphics’ and organization Y’s initial fore-0.0 0.2 0.4 0.6 0.8 1.0Project completionForecast/actual0.10.20.51.02.05.010.0Figure 4. 923 f/a ratios for 121 project durations of Landmark Graphics,where f is forecast and a is actual. The forecasts are reasonably closeto the horizontal line, yet, most f/a ratios are below it. The figureindicates the forecasts are biased toward underestimation.January/February 2010 I E E E S o f t w a r E 35casts deviate much less from their actuals than inthe case of organization X, which overestimatesfrom tenfold to a hundredfold, as Figure 3 shows.Also, the other organizations’ estimation quality outperforms organization X, which the median EQF of their initial forecasts illustrates: 2.3for Landmark Graphics, 6.4 for organization Y’scosts and 5.7 for organization Y’s functionality,versus 1.1 for organization X. So, half of Landmark Graphics’ initial forecasts deviate only 43percent from the actual value, 16 percent for organization Y’s costs and 18 percent for organization Y’s functionality, versus 91 percent fororganization X. Still, Standish considers organization X highly successful compared to the otherorganizations.To further illustrate how easy it is to becomehighly successful in Standish’s terms, we alsopresented 1/Landmark Graphics. This fictitiousorganization represents the opposite of Landmark Graphics. That is, the deviations to the actuals remain the same, but an overrun becomesan underrun and vice versa. Suddenly, 1/Landmark Graphics becomes highly successful witha 94 percent success rate. So, with the oppositeinstitutional bias, Landmark Graphics would improve its Standish success rate from 6 percent to94 percent.These case studies show that the Standish figures for individual organizations don’t reflect reality and are highly influenced by forecasting biases. Because the underlying data has an unknownbias, any aggregation of that data is unreliable andmeaningless.The influence of biased forecasts on theStandish figures isn’t just evident from our figures.Standish’s Chairman Jim Johnson clearly indicatesthat manipulating the figures is easy:In 1998, they [the respondents] had changedtheir [estimating] process so that they werethen taking their best estimate, and thendoubling it and adding half again.14Johnson made this statement with respect tothe drop in the reported average cost overruns between 1996 (142 percent) and 1998 (69 percent).In the article, Johnson says that he doesn’t believethis change of process is the cause of the drop.However, our case studies show that forecastingbiases have a giant influence on such figures. So,we believe that the change in the estimating process is most likely the cause of the drop in the reported cost overruns.We developed methods based on Boehm andDeMarco’s work that mathematically accountfor forecasting biases.15 Our other paper contains more information about the case studiesin addition to another one (totaling 1,824 projects, 12,287 forecasts, and 1,059+ million euros).15 We propose bandwidths surrounding theactual value to determine whether forecasts areaccurate. These bandwidths show that projectswith relatively small underruns or overruns haveaccurate forecasts, whereas projects with relative large underruns or overruns have inaccurateforecasts. The mathematical implications aremanifold and are out of the scope of this paper.But, we were able to derive figures that were exactly in line with the reality of our case studies.We hope that Standish will adopt our proposeddefinitions and methods for the rise and resurrection of their reports.By ignoring the potential bias and forecastingquality, the figures of the Standish Group don’tadequately indicate what, according to their definitions, constitutes a successful or challengedproject. Some organizations tend to overestimate while others underestimate, so their success and challenge rates are meaningless becauseStandish doesn’t account for these clearly present biases.Table 2Comparing Standish success to real estimation accuracySource Successful (%) Challenged (%)Median estimation qualityfactor of initial forecastsOrganization X 67 33 1.1Landmark Graphics 5.8 94.2 2.3Organization Y cost 59 41 6.4Organization Y functionality 55 45 5.7Organization Y combined 35 65 6.51/Landmark Graphics 94.2 5.8 2.336 I E E E S o f t w a r E w w w . c o m p u t e r. o r g / s o f t w a r eT his article isn’t the first to challenge the Chaos report figures’ credibility; a num ber of authors also “questioned theunquestionable.”2–4,16For instance, Nicholas Zvegintzov placedlow reliability on information where researcherskeep the actual data and data sources hidden.16He argued that because Standish hasn’t explained, for instance, how it chose the organizations it surveyed, what survey questions it asked,or how many good responses it received, there’slittle to believe.Also, Glass2,3 felt the figures don’t representreality. Without plenty of successful software projects, he asserted, the current computer age wouldbe impossible.Moreover, Jørgensen and his colleagues expressed doubt about the numbers.4 They unveileda number of issues with Standish’s definitions andargue that the resulting figures are therefore unusable. For instance, they argued that the definitions of successful and challenged projects focuson overruns and discard underruns.Despite the valid questions our predecessorsraised, no one had previously been able to definitelyrefute the Standish figures’ credibility. Our research shows that Standish’s definitions suffer fromfour major problems that undermine their figures’validity.We communicated our findings15 to theStandish Group, and Chairman Johnson replied:“All data and information in the Chaos reportsand all Standish reports should be consideredStandish opinion and the reader bears all risk inthe use of this opinion.”We fully support this disclaimer, which toour knowledge was never stated in the Chaosreports.AcknowledgmentsThis research received partial support from the Netherlands Organization for Scientific Research’s Jacquard projects Equity and Symbiosis. We thank theanonymous reviewers and Nicholas Zvegintzov forcommenting on this article.References1. Chaos, tech. report, Standish Group Int’l, 1994.2. R. Glass, “IT Failure Rates—70% or 10–15%,” IEEESoftware, May 2005, pp. 110–112.3. R. Glass, “The Standish Report: Does It Really Describe a Software Crisis?” Comm. ACM, vol. 49, no. 8,2006, pp. 15–16.4. M. Jørgensen and K. Moløkken, “How Large AreSoftware Cost Overruns? A Review of the 1994 ChaosReport,” Information and Software Technology, vol.48, no. 8, 2006, pp. 297–301.5. D. Hartmann, “Interview: Jim Johnson of theStandish Group,” 2006; www.infoq.com/articles/Interview-Johnson-Standish-CHAOS.6. Chaos: A Recipe for Success, tech. report, StandishGroup Int’l, 1999.7. Extreme Chaos, tech. report, Standish Group Int’l,2001.8. B. Joy and K. Kennedy, Information TechnologyResearch: Investing in Our Future, tech. report, President’s Information Technology Advisory Committee,Feb. 1999.9. B. Boehm, Software Engineering Economics, PrenticeHall, 1981.10. T. DeMarco, Controlling Software Projects, PrenticeHall, 1982.11. T. Lister, “Becoming a Better Estimator—An Introduction to Using the EQF Metric,” www.stickyminds.com,2002; www.stickyminds.com/s.asp?F=S3392_ART_2.12. D. Garmus and D. Herron, Function Point Analysis—Measurement Practices for Successful SoftwareProjects, Addison-Wesley, 2001.13. T. Little, “Schedule Estimation and Uncertainty Surrounding the Cone of Uncertainty,” IEEE Software,vol. 23, no. 3, 2006, pp. 48–54.14. J. Johnson, “Standish: Why Were Project Failures Upand Cost Overruns Down in 1998?” InfoQ.com, 2006;www.infoq.com/articles/chaos-1998-failure-stats.15. J.L. Eveleens and C. Verhoef, “Quantifying IT ForecastQuality,” Science of Computer Programming, vol. 74,no. 11+12, 2009, pp. 934-988; www.cs.vu.nl/~x/cone/cone.pdf.16. N. Zvegintzov, “Frequently Begged Questions and Howto Answer Them,” IEEE Software, vol. 20, no. 2, 1998,pp. 93–96.Selected CS articles and columns are also availablefor free at http://ComputingNow.computer.org.About the AuthorsJ. Laurenz Eveleens is a PhD student at Vrije Universiteit Amsterdam’s Department of Computer Science. His current research is aimed at quantifying the quality of ITforecasts. Eveleens has an MSc in business mathematics and informatics from VU UniversityAmsterdam. Contact him at [email protected]Chris Verhoef is a computer science professor at Vrije Universiteit Amsterdam and isa scientific advisor with IT-Innovator Info Support. His research interests are IT governance,IT economics, and software engineering, maintenance, renovation, and architecture. He hasbeen an industrial consultant in several software-intensive areas, notably hardware manufacturing, telecommunications, finance, government, defense, and large service providers.Verhoef has a PhD in mathematics and computer science from the University of Amsterdam.He’s an executive board member of the IEEE Computer Society Technical Council on SoftwareEngineering and the vice chair of conferences. Contact him at [email protected]