Assessing the Validity of MACRA’s Risk Adjustment Methods

Assessing the Validity of MACRA’s Risk Adjustment Methods

2
SHARE

flying cadeuciiThe feedback doctors will receive from CMS under CMS’s proposed MACRA rule will arrive in two forms: Money (more or less of it) and data. Neither form of feedback will be accurate. For that reason, the behavior desired by Congress and CMS – “smarter care” (as CMS puts it) producing lower costs and higher quality – will not materialize.

As I noted in the first installment of this three-part series, the two most important sources of noise in CMS’s feedback will be CMS’s inability to determine which patients “belong” to which doctor (the attribution problem) and its inability to adjust cost and quality scores for factors outside physician control (the risk-adjustment problem). [1] In my first installment I showed that the method of attribution CMS will use is unacceptably sloppy. In this installment I review the risk-adjustment problem and CMS’s irresponsible claim that it can measure physician “merit” even with sample sizes as small as 20 patients.

Big expense and small samples

The purpose of risk adjustment is to adjust cost and quality scores for factors doctors cannot control. The patient’s health, socio-economic status, and quality of insurance coverage are the three most important confounders that must be accounted for in any pay-for-performance scheme (MACRA is, of course, one great big P4P scheme) or any report card that could steer patients toward or away from a clinic or hospital. If risk adjustment is not done, or is done poorly, the signals doctors receive from the P4P scheme or report card will be useless, and even worse than useless if doctors who treat sicker and poorer patients are punished unjustifiably. Dozens of studies have shown that P4P schemes and report cards are already harming sicker and poorer patients (see, for example, Werner et al. ), Dranove et al.. , Chien et al. , and Friedberg et al. ).

The most insurmountable obstacles to accurate risk adjustment are insufficiently large sample sizes and the expense of collecting data on confounding factors. I will illustrate the severity and intractability of these problems by describing the coronary artery bypass graft (CABG) surgery report card published annually by the New York Department of Health, and the Hierarchical Condition Categories (HCC) method that CMS developed to adjust payments to Medicare Advantage plans. The CABG report card uses what is probably the world’s most accurate risk-adjustment method for a quality measure. CMS’s HCC method is probably the world’s most accurate method of adjusting cost measures.

Nearly 100 percent of CABG surgeons are “average”

Since 1991, New York’s Department of Health has annually released a report on hospitals and surgeons who perform CABG surgery. [2] The report presents 30-day mortality rates for the roughly 185 surgeons who perform CABGs in New York and the 40 New York hospitals where that procedure is performed. [3]

Dr. Ashish Jha, who comments regularly on THCB, and Arnold Epstein have described New York’s CABG report card as “arguably the gold standard”  in quality measurement. I agree. There are two reasons to regard it as the gold standard. First, the Department of Health doesn’t have to engage in arbitrary attribution games to determine which surgeon operated on which patients. Second, the Department adjusts the quality measure – 30-day surgery mortality rates – for approximately 70 risk factors (the total number depends on how you bundle risk factors; see pp. 58-59 Hannan et al. ).

Precisely because so much effort is poured into collecting data on confounding factors, New York’s report card is very expensive. In the paper cited above, Hannan et al. reported that approximately 40 full-time staff were needed to produce the report card:

* five people at the Department of Health to maintain the state’s database;

* a sixth person at the Department of Health who functions as a “utilization review agent … to audit a sample of 50 cases from half the hospitals each year,” and;

* a “data coordinator” at each of the hospitals (in 1997, the date of Hannan et al.’s paper, 31 hospitals were doing CABG surgery).

But for all the money poured into this report card, it cannot distinguish the vast majority of hospitals and doctors from one another. The latest report card determined that just 8 percent of the hospitals (3 out of 40) had risk-adjusted mortality rates above the state average in 2012 (none were below) and just 2 percent of the surgeons (four out of 185) were above the statewide average (none were below). [4].

The wild swings in some hospital and surgeon rates suggests there is still a lot of noise in this report card. Bellevue’s rate in 2011, for example, was only 1.19, below the statewide average of 1.24 that year (but not enough to be statistically significant). But in 2012 it leaped to 8.04, high enough to make Bellevue one of the three outliers listed that year.

Sample size is a severe constraint for this report card. The average New York hospital performed 204 CABGs during 2012. The average physician did no more than 47 a year over the three-year period 2010-2012. [5] New York’s Department of Health seeks to address the sample-size problem for surgeons by pooling three years of data. But even after pooling the data, the Department is still unable to distinguish 98 percent of the surgeons from each other.

Note finally that New York’s CABG quality measure (mortality rates) is not combined arbitrarily with a grossly inaccurate cost score to create so-called “total performance” or “value” scores, which is what CMS proposes to do under MACRA. And yet, with all of its advantages – no noise generated by sloppy attribution, unusually accurate adjustment of confounding factors, and no noise generated by crude cost scores mashed into quality scores – despite all those advantages, the New York report card simply cannot distinguish the vast, vast majority of doctors and hospitals from one another.

CMS’s crude HCC adjuster

CMS’s method of adjusting payments made to Medicare Advantage (MA) plans is probably the most sophisticated, most studied risk-adjuster on the planet. And yet it can only explain 11 percent of the variation in expenditures between patients, and that’s using a very large (5 percent) sample of all FFS Medicare patients (see p. 126 of a 2004 evaluation of CMS’s method by Pope et al. ). Because its risk-adjustment method is so crude, CMS (a) chronically overpays MA plans and (b) substantially overpays them for their healthier enrollees and substantially underpays for their sicker enrollees. [6]

It is extremely unlikely CMS can improve the accuracy of this method by more than a few percentage points. It’s possible CMS could raise the accuracy of its method somewhat if it were to require insurers, ACOs, “medical homes,” and individual doctors to submit, in addition to diagnoses, data from medical records plus data on patient income and education. But that would raise costs for everyone involved. The resistance CMS encountered just getting insurance companies to submit diagnostic data gives us reason to believe CMS will never require insurance companies or providers to submit medical records data or socioeconomic data for all or even a substantial portion of Medicare’s 55 million enrollees.

For most of Medicare’s history, CMS/HCFA adjusted payments to insurance companies using only demographic data. This method of adjustment explained a paltry 1 percent of the variation in expenditures. In 1997, Congress ordered CMS/HCFA to improve the accuracy of its risk adjuster. CMS responded by developing the HCC system.

CMS derived that system by collapsing the 15,000 codes in the old ICD-9-CM into 804 “diagnostic groups,” then collapsing those 804 groups down further into 189 “hierarchical condition categories” (HCCs). One of the ten criteria CMS used for determining the definition of each diagnostic group and HCC was sufficient sample size: Each group and HCC had to be defined loosely enough so that the pool of patients that fell into that diagnostic category had “adequate sample sizes to permit accurate and stable estimates of expenditures” (p. 121, Pope et al.).

CMS then threw out an undisclosed number of HCCs because the insurance industry complained about the cost of reporting diagnoses (p. 127, Pope et al.). To cut the MA plans’ cost of reporting and for other reasons, CMS weeded out all but 70 of the 189 HCCs, which meant weeding out all but 3,000 of the 15,000 ICD-9 codes (appendicitis and osteoarthritis are examples of diagnoses CMS threw out). Pope et al. reported that this 70-HCC method could only be applied to 57 percent of all MA enrollees because the other 43 percent either had no diagnosis at all in the course of a year or had a diagnosis that CMS weeded out.

You should now have good understanding of the reasons why CMS’s relatively sophisticated risk-adjuster can’t explain 89 percent of the variation in expenditures among Medicare beneficiaries. Those reasons are:

CMS does not use medical records data nor data on income and education, [7] only 70 of 189 HCCs containing just 20 percent of the ICD-9 codes made the final cut, and the HCC adjuster can’t augment the accuracy of the estimates for nearly half of Medicare enrollees because those enrollees don’t get a diagnosis in the course of a year that is covered by one of the HCCs.

Readers who harbor the hope that CMS could substantially improve the accuracy of its crude HCC adjuster either by making the HCCs more precise or by simply using more of the existing 189 HCCs must abandon that hope. If CMS were to shrink the scope of its HCCs to make them more precise, it would shrink the size of the pools of patients covered by the average HCC. On the other hand, if CMS were to add more HCCs to those they already use, CMS would gain almost no additional explanatory power.

To understand this last point, readers may want to look at Figure 4 p. 128 of the Pope paper . You will see there a graph that shows that the first ten (most powerful) HCCs account for 74 percent of the “maximum explanatory power” of the HCC algorithm, and each HCC added after that adds very little. Adding more HCCs will only increase the cost of running the HCC system and do almost nothing to improve its accuracy. [8]

CMS conflates risk adjustment with “reliability”

My purpose in examining the CABG report card and CMS’s HCC method is to give you a sense of how primitive even our most sophisticated risk-adjustment methods are and how unfixable that problem is. CMS, however, gives the readers of its MACRA rule no hint that risk-adjustment is still in its infancy and will never grow out of its infancy. To the contrary, CMS conveys the impression that CMS has already created risk adjustment methods sufficiently accurate to punish and reward physicians.

CMS conveys this impression two ways. First, it states repeatedly that measures already in use for existing programs (such as the ACO and VM programs) are risk-adjusted but does not explain how poorly those measures have been risk adjusted. Secondly, CMS repeatedly claims its measures meet a new-fangled test called a “reliability threshold” test. On the basis of this test, CMS thinks it’s just fine to judge physician “merit” using a sample size as small as 20 prospectively attributed patients. Attributed, mind you!

Here is an example of how CMS sells its “reliability” test: ‘[W]e are now proposing to institute a minimum reliability threshold for public reporting on Physician Compare. The reliability of a measure refers to the extent to which the variation in measure is due to variation in quality of care …..” (p. 433 of the MACRA rule). That is just false. All CMS’s vaunted “reliability” test does is determine that the factors influencing a doctor’s cost and quality scores are fairly stable – they don’t change much from one period to the next or from one sample of a clinic’s patients to another sample. The “reliability” test tells us nothing about which of those factors are outside the doctor’s control and how badly those factors are distorting CMS’s scores for that doctor.

In a report entitled, The Reliability of Provider Profiling: A Tutorial, the RAND corporation said exactly what I’m saying. CMS is well aware of this report: I found it in a document on CMS’s Physician Compare website (see p. 25). RAND made it crystal clear CMS has no business conflating its “reliability” test with accurate risk adjustment. RAND stated:

Validity is the most important property of a measurement system. In nontechnical terms, validity is whether the measure actually measures what it claims to measure. If the answer is yes, the measure is valid. This may be an important question for physician profiling. For example, what if a measure of quality of care is dominated by patient adherence to treatment rather than by physician actions? Labeling the measure as quality of care measure does not necessarily make it so.

[H]ere are several important determinants of validity of physician performance measures:

Is the measure fully controllable by the physician?

Is the measure properly adjusted for variation in the case-mix of patients among physicians?

Is the measure partially controlled by some other level of the system?….

Reliability assumes validity as a necessary precondition. Although the reliability calculations can still be performed for measures that are not valid, subsequent interpretation is problematic. [p. 17]

I think we can go beyond “problematic” in criticizing CMS’s proposal to use patient pools as small as 20. I believe “reckless” is the appropriate word.

Bad feedback is worse than no feedback

MACRA’s congressional authors and CMS staff operate on the assumption that any feedback, no matter how inaccurate, is better than no feedback. That’s absurd. For any organism, be it a rat in a Skinner box or a doctor being trained to practice “smarter care” by CMS, feedback has to be accurate and intelligible to be useful. But CMS’s feedback will be neither. It will be grossly inaccurate and largely unintelligible for multiple reasons, the two most important of which are CMS’s sloppy attribution method and its intractably crude risk-adjustment method. In fact, CMS’s feedback could be worse than useless. It could have the net effect of raising costs and lowering quality, especially for the poor and the sick.

[1] Perhaps the third-most important source of noise is the unrepresentativeness of the physician activities CMS proposes to include in its overall quality score. The activities measured constitute only a very tiny fraction of all services physicians provide. The fourth most important source of noise might be the ridiculous vagueness of some of CMS’s proposed measures, such as, “Take steps to improve health status of communities….” (p. 948)

[2] The New York cardiac surgery report has since been expanded to precutanous coronary interventions and valve replacements.

[3] I derived 185 as the number of surgeons who did bypass operations in 2012 by counting the number of physicians listed by name in New York’s latest report .  That number may be too high or too low, but it’s the only one any reader would be able to calculate. The total number of surgeons who are graded by the report card  is higher than 185 if anonymous surgeons grouped into “all others” for some hospitals are counted.  On the other hand, a minority of surgeons (a total of 33) are listed multiple times because they operated in more than one hospital (one surgeon operated in four hospitals, five operated in three, and 27 operated in two). If we count only one entry for each of these, the number of listed surgeons would fall by 26.

[4] The hospital mortality rates appear in Table 1 p. 16 of Adult Cardiac Surgery in New York, 2010-2012  The physician data appear in Table 6. The actual number of physician-hospital combinations reported as outliers in Table 6 was five, not four. But one of these operated on a single patient at one hospital (NYP-Weill Cornell) and lost that patient. But when that single case was combined with 130 CABGs that doctor  performed at another hospital (NY Methodist), the mortality rate for the 131 cases was average.

If we raise the bar just a notch and require that hospitals and doctors appear as outliers in two successive reports (the 2011 report  and the 2012 report), the report card’s performance deteriorates even further: The percent of outlier hospitals falls to zero and the percent of  outlier surgeons falls to 1 percent.

[5] The ambiguous phrase “no more than 47 a year” is necessary because the report does not list every CABG surgeon by name. For several hospitals, it lumps some unnamed surgeons into a category called “all others.” 

[6] As Pope et al.  put it, “Research showed that the managed care program was increasing total Medicare Program expenditures, because its enrollees were healthier than FFS enrollees….” (p. 119) As MedPAC put it in its June 2014 report

We show that the CMS HCC model severely over-predicts the costs in the prediction year for beneficiaries who had relatively low costs in the base year and severely under-predicts the costs in the prediction year for beneficiaries who had relatively high costs in the base year. These results raise concerns about equity among MA plans because plans that have a relatively high share of high-cost beneficiaries may be disadvantaged.” (p. 26) 

[7] CMS does use Medicaid status as a measure of poverty, but that is a very crude measure of poverty.

[8] A 2011 evaluation of CMS’s HCC method confirmed the 2004 evaluation. CMS’s method still only explains 11 percent of the variation in expenditures between patients (see Table 2-1 p. 6).

Kip Sullivan is an attorney with Physicians For a National Health Program Minnesota.

Leave a Reply

2 Comments on "Assessing the Validity of MACRA’s Risk Adjustment Methods"


Member
realdoctor
Jun 15, 2016

“Don’t let the perfect be the enemy of the good.”
“you can’t change what you don’t measure.”
“we have to do something.”
“we’ve validated the results”
“we need to standardize care”
“we have to pay for quality, not quantity”
“fee for service can’t continue”

all these cliches have been used to trump the clear fact, well supported in this article:

Bad feedback is worse than no feedback