CMS began two Medicare ACO experiments in 2012 – the Pioneer program and the Medicare Shared Savings Program (MSSP). Data on these programs available at CMS’s website paints a discouraging picture of the programs’ ability to cut costs. But two papers published in the last two years in the Journal of the American Medical Association paint a much rosier picture. A paper written by David Nyweide et al. claimed to find the Pioneer ACO program generated gross savings two times more in 2012, and slightly more in 2013, than CMS reported. Similarly, a paper by J. Michael McWilliams claimed to find the MSSP program saved money in 2014 while CMS’s data says it lost money.
What explains the discrepancy? Answer: The JAMA papers examined simulated ACO programs, not the actual Pioneer and MSSP programs. Moreover, Nyweide et al. neglected to report that shared savings payments would have greatly reduced the gross savings, and both Nyweide et al. and McWilliams ignored the start-up and maintenance costs the ACOs incurred. (JAMA’s editors redeemed themselves somewhat by publishing a comment by former CMS administrator Mark McClellan which warned readers that Nyweide et al. failed to measure the “shared savings payments to the ACOs” and “the investments of time and money” made by the ACOs.)
In this essay I will describe how the simulations reported in the JAMA papers differed from the actual ACO programs, and I’ll question the ethics of conflating simulated with actual results.
Results from the real world
CMS does not make it easy to determine whether its ACO programs save money. In fact, it is fair to say CMS is routinely deceptive. When CMS releases ACO data, it announces only the total savings achieved by a minority of ACOs and ignores the costs CMS incurs.  But CMS does post spread sheets on its website that permit the more dogged among us to calculate net figures, that is, the savings CMS celebrates minus the losses CMS won’t talk about.
Here are the net savings for Medicare for each of the first four years of the Pioneer program.
2012: 0.2 percent;
2013: 0.5 percent;
2014: 0.7 percent;
2015: 0.1 percent.
Here are the net losses for the MSSP program for its first four years, presented as three “performance years” (because of the uneven start dates for this program in 2012):
2012-2013: -0.2 percent;
2014: -0.1 percent;
2015: -0.3 percent. 
To sum up, after four years of trying, the Pioneer ACOs cut Medicare’s costs by somewhere between one- and seven-tenths of a percent annually, while the MSSP ACOs raised Medicare’s costs by somewhere between one- and three-tenths of a percent annually. Note that these underwhelming results do not include the start-up and maintenance costs incurred by the ACOs nor the costs CMS incurred to administer these complex programs. Note also that these results are entirely consistent with four decades of research on managed care experiments, including HMOs, “medical homes,” “coordinated care,” and the Physician Group Practice Demonstration.  As Robert Laszweski put it in 2012 commenting on a CBO report on managed-care tactics, “(Almost) nothing works.”
But according to Nyweide et al., the Pioneer ACOs achieved gross savings of 4.0 percent in 2011 and 1.5 percent in 2012, above the 1.2 and 1.3 percent gross savings figures for those years according to CMS data.  Similarly, McWilliams’ JAMA paper found both gross and net savings for MSSP ACOs in 2014 (McWilliams found net savings of 0.7 percent) while CMS’s data indicated a net loss of a tenth of a percent that year and net losses in all other years.
If at first you don’t succeed, simulate
The design of the simulated versions of the Pioneer and MSSP ACO programs that Nyweide et al. and McWilliams examined varied substantially from the design of the real programs. The single most fundamental difference is the comparison group used to determine ACO spending. The real-world programs determine the performance of the ACOs by comparing the Medicare expenditures on patients “attributed” to ACO doctors in “performance years” with expenditures on patients attributed prior to the performance year. Nyweide et al. and McWilliams chose a different set of providers and patients to serve as the comparison groups: They chose providers (and their attributed patients) who had not signed up with an ACO.
Other important differences between the simulated and real ACO programs include differences in methods used to attribute patients to doctors (for example, whether to count visits to specialists and primary care doctors or only primary care doctors, and whether to look back two years versus three years to attribute patients) and in calculating savings and losses generated by ACOs (how many years to look back to create the baseline expenditure, whether to trend the baseline forward using national or local inflation rates, and what risk-adjustment method to use to adjust the baseline and the performance year expenditures to reflect changes in patient health).
These design changes are significant. The change in attribution methods means, obviously, the experimental group of patients (those in ACOs) was not the same experimental group tested by the real-world ACO programs. The difference in the algorithms used to calculate savings and losses means the simulated ACOs experienced different rewards and penalties from the real-world ACOs, and that in turn should have caused doctors and hospitals in the simulated programs to behave differently from the providers in the real programs.
In short, the authors of the JAMA papers changed every important parameter. They changed the control group, the experimental group, the method for calculating how the ACOs affected costs and, with that, the size of and distribution of rewards and penalties.
Why simulate when you have real ACO programs right in front of you?
Simulations (as the word is used in science) can serve very useful purposes. Modeling or simulating an existing system under conditions that differ from real-world conditions can improve our knowledge of how the system works and suggest ways to improve it. As one expert on simulations puts it, “Another broad class of purposes to which computer simulations can be put is in telling us about how we should expect some system in the real world to behave under a particular set of circumstances.”
But what justification is there for studying a simulated version of existing ACO programs – a version in which the programs are subjected to “a particular set of circumstances” that have never been applied in the real world – and declaring the results of the simulation to be the results of the real-world program? I can’t think of any.
The rationale offered by Nyweide et al. and McWilliams for applying imaginary conditions to the real-world ACO programs was that the design of the Pioneer and MSSP programs was too crude to measure accurately the performance of the ACOs. They argued that their simulated design would make their measurements more accurate than CMS’s. This hypothesis (decoupled from the Alice-in-Wonderland claim that simulated results equal real-world results) is plausible. But it needs to be demonstrated, not merely declared. If it were proven, it would suggest ways CMS could improve the accuracy of the carrots and sticks it applies to ACOs.
But Nyweide et al. and McWilliams made no attempt to prove their claim that their simulated model could measure ACO performance more accurately than CMS does now. They merely asserted that claim, and JAMA’s editors let them get away with it.
In their own words
Because some readers may be having a hard time believing smart people could behave so irrationally, I will quote the authors. In the quotes I present below the authors use the label “evaluation” to describe their allegedly more sophisticated design, and they characterize CMS’s method of calculating savings as a mere “payment formula” or “actuarial calculation,” as if CMS were obsessed with cost-cutting and cared not a whit whether the payments and penalties they administer to ACOs correspond to the impact ACOs are actually having on costs.
I begin with Nyweide et al. That paper was based on research reported by L&M Policy Research in their evaluation of the first two years of the Pioneer program. Here is how L&M justified examining a simulated version of the Pioneer program rather than the actual program:
Savings and losses under the payment formula [meaning the real ACO program] are calculated with the goal of establishing an incentive to reduce spending compared to a benchmark. The goal of the evaluation [i.e., L&M’s simulation] is to estimate what costs and other outcomes would have been in the absence of the Pioneer model, which necessitates employing different approaches than those used to calculate payment. (p. 2) 
Do you see how empty and manipulative that statement is? L&M asserts that their “goal” is different from and superior to CMS’s, that L&M’s goal is to “estimate what costs … would have been in the absence of the Pioneer model” whereas the cost-cutters at CMS have no interest in determining “what costs would have been in the absence of the Pioneer model.” CMS, we are to believe, adopted crude formulas for decreeing that savings and losses occurred with little or no regard for the actual impact ACOs had on expenditures. 
Nyweide et al. made the same argument in their JAMA paper: They portrayed themselves as sophisticated scientists interested in knowing the real truth, while the pencil-pushers who designed the actual programs didn’t really care whether they rewarded only those ACOs that really saved money. Here’s what Nyweide et al. wrote: “Between 2012 and 2013, Pioneer ACOs generated approximately $183 million in savings to the Medicare program relative to projected spending levels…. However, these results do not account for many factors that may confound the relationship between the model intervention and patient outcomes.” (p. 2153) There was no further discussion of what these “many factors” might have been nor any evidence introduced to support the claim that the bean counters at CMS never thought about confounders. The implication was just left hanging – the people at CMS who designed the original ACO programs didn’t care about confounders while Nyweide et al. do.
McWilliams made the same argument in his JAMA paper on MSSP ACOs and in the paper in which he laid out his methodology for his JAMA paper. He claimed CMS’s “actuarial calculations” of savings “may” not be accurate, and offered a single, biased illustration:
Savings based on [CMS’s] actuarial calculations, however, may differ from actual spending reductions. For example, the substantial geographic variation in Medicare spending growth calls into question the validity of savings estimated by comparing spending in an ACO with a benchmark derived from a national rate of spending growth. If an ACO is located in an area with high spending growth, its savings could be underestimated. (p. 2)
That’s it. That’s all McWilliams can say to back up his claim that CMS’s method of measuring ACO expenditures is crude and the method he used for his simulation is better. Note also the bias in McWilliams’ example of CMS’s alleged lack of sophistication – savings could be underestimated in high-cost areas. Well, yes, but if that’s true, then what about ACOs in low-cost areas? Won’t their savings be overestimated? McWilliams was silent on that possibility.
The claim that CMS didn’t think enough or at all about accurate measurement contradicts common sense and CMS’s own statements. For example, in their 2011 final MSSP rule (76 FR 19528), devoted pages and pages to the question of how to measure savings. In that rule, CMS wrote, for example, “our proposed approach … will result in a more accurate benchmark.” (p. 67914)
Can we learn anything from the JAMA papers?
Let us, for the moment, forgive Nyweide et al. and McWilliams for announcing to the world that simulated results are real-world results, and let’s ask whether we can learn anything from their papers. These papers might be helpful if we could conclude that they demonstrate techniques for improving the accuracy with which CMS measure’s ACO savings and losses. Well-designed simulations should help us understand how to improve existing systems. Unfortunately the papers don’t do that. In fact, it’s quite possible that accuracy of measurement achieved by Nyweide et al.’s simulation was worse than it is in the real programs.
The most important confounder in any comparison of patients is differences in patient health and income. Adjusting for these differences is commonly called “risk adjustment.” The new comparison group that Nyweide et al. inserted into their simulated version made accurate risk adjustment even more essential and slightly more difficult. That’s because of the method Nyweide et al. used to create their control group guaranteed that the control and experimental (ACO) patient pools would vary on at least one crucial dimension – continuity of care or, if you prefer, patient loyalty. Nyweide et al.’s method of assigning patients to the control group was to first select out from all Medicare patients in a given region those who “belong” to doctors in ACOs. They determined “belongingness” by assessing where patients generated a plurality of their primary care visits. All patients who didn’t make the cut – who didn’t get assigned to an ACO doctor by the plurality method – got thrown back into the pool of “control” patients.
By definition, then, the comparison group in Nyweide et al.’s simulation consisted of less loyal patients – patients who have less continuity of care than ACO patients. We don’t need to know which way the causality runs – healthier patients lead to greater continuity, or continuity leads to healthier patients – to know that this method of creating a control group only makes accurate risk adjustment more important. Yet despite decades of trying, neither CMS nor anyone else has come up with a risk adjuster that is remotely accurate. The fact that Nyweide et al. reported declines in utilization in every single category of medical service, including primary care, for the ACOs is circumstantial evidence that the ACOs got healthier patients and that Nyweide et al. were unable to adjust their expenditure data accurately. 
We cannot conclude, therefore, that the simulations taught CMS or the rest of us anything useful about how to improve the measurement of ACO performance. All we can say with certainty is (a) Nyweide et al. and McWilliams presented no evidence for claiming their method of measuring ACO performance is superior to the method CMS has been using, and (b) they badly misled their readers by claiming the positive results their simulated ACOs achieved should be viewed as results of the real-world ACOs.
 Here is how Ben Umansky describes an example of CMS subterfuge :
CMS emphasizes [in its August 2015 press release] the $806 million in savings generated by the 92 MSSP ACOs that qualified for a shared savings payment [in 2014]. It also acknowledges that 89 other ACOs held expenditures below their benchmarks, but not by enough to qualify for a shared savings payment. It makes no mention of the 152 ACOs whose expenditures were above benchmark. Fortunately, CMS has released ACO-level data that make it possible to reconstruct the full picture. As a complete group, the 333 MSSP ACOs kept spending only $291 million below benchmark—a cost savings to Medicare, yes, but one smaller than the $341 million in shared savings payments made to the 92 top performers.
 In this footnote, I describe my sources for the savings and losses of the Pioneer and MSSP programs listed in the text. I have calculated the yearly results for both programs using CMS’s data, but rather than list myself as the source for those figures, I thought it would add to the credibility of my figures if I listed other sources.
Readers can find the 2012 and 2013 results for both programs at p. 4 of an April 2015 CMS document .
Readers can find the total savings and losses and total spending in dollar terms for 2014 for the MSSP program in a blog comment by David Introcaso and Gregory Berger, and total savings and losses for both programs for 2014 in the comment by Umanksy I referred to in footnote 1. I calculated total benchmark spending for 2014 for the Pioneer ACOs from CMS data and divided Umansky’s net dollar savings into total spending.
Readers may find the 2015 figures for both programs in table 2 p. 3 in this report from MedPAC, and at pp. 7-8 of the transcript of the morning session of MedPAC ‘s October 6, 2016 meeting.
 The Physician Group Practice (PGP) Demonstration was the first test of the ACO concept conducted by CMS. Ten carefully selected PGPs tried to lower Medicare spending over a five-year period (2005 to 2010) and , as a group, failed. As the final evaluation of the demo put it, “Seven of the ten participants had currently or previously owned a health maintenance organization (HMO).”(p. ES-3) And yet for all their experience wielding managed care tools, the PGPs succeeded only in raising Medicare’s costs by 1.2 percent over the five years. According to the “final report” on the demo, “[T]he demonstration saved Medicare .3 percent of the claims amounts, while performance payments were 1.5 percent of the claims amounts,” for a net loss of 1.2 percent. (p. 64)
For a review of the inconclusive evidence that HMOs save money, see my 2001 Health Affairs article.
 These 4.0 and 1.5 percent figures are taken from the editorial by Mark McClellan that accompanied the Nyweide et al. paper.
 After characterizing their simulation as an “evaluation” and CMS’s design as a mere “payment formula,” L&M went on to offer this summary of how their simulation differed from CMS’s actual Pioneer program: The primary variances between the payment and evaluation approaches include different (1) baseline populations…; (2) comparison populations….; (3) approaches in trending methods….; and (4) risk-adjustment methods. As such, findings between the financial payment calculations and the evaluation necessarily differ, both at an aggregate level and for individual Pioneer ACOs. (p. 2)
 It is possible CMS forced or induced L&M to print the empty statement I quote. A peculiar “disclaimer” on the “acknowledgements” page of the L&M report suggests CMS demanded statements or commitments to methods that L&M was not comfortable with. The acknowledgement seems to say that L&M should not be held responsible for defects in their report caused by “constraints” imposed upon them by CMS.
 I’m not the only one who has noticed the fact that Nyweide et al.’s method of creating a control group aggravates patient differences. L&M noted the problem in its second-year evaluation in a section discussing patient “satisfaction” surveys: “[I]t is possible that these CAHPS [Consumer Assessment of Healthcare Providers and Systems] results are confounded, given that beneficiaries are aligned or assigned to an ACO because they receive regular care from ACO providers.” (pp. 31-32)
The Nyweide et al. paper contains evidence suggesting that this problem is real and not fixable. Nyweide et al. reported that the ACOs reduced utilization in every category of medical service, including primary care. Not only that, they did it in the space of just two years! Utiliization had to go up somewhere, presumably in the primary care category, in order for utilization to fall elsewhere. The Medicare ACOs are literally built up from a base of primary care visits, for Pete’s sake! Why did primary care visits fall along with utilization of all other types of services? The most obvious explanation for the failure of utilization of at least some categories of medical care to rise, and for the immediate positive impact of ACOs, is that the ACOs were assigned healthier and wealthier patients on Day One and Nyweide et al. were unable to correct accurately for that fact.
I would find your posts more helpful, both past and present, if you solicited feedback from the PIs of the study’s you cite. You raise interesting points. However, these researchers are plenty smart. I have read McWilliams for years and have heard him speak. He is a solid resource and someone to learn from. Go to the source.
You would serve the readership well by helping us to understand why investigators choose the methods they do and the upsides and downsides of their approach by going deeper on the backstory.
Kip, I would encourage you to take the time to boil down the obvious hard work you put into this into a letter to the editor to JAMA that challenges the authors to respond. One very important reason is so that your criticisms and their response become part of the “permanent record” of those who search PubMed and the literature on this issue.