Scoring the Surgeon Scorecard

Screen Shot 2015-10-13 at 10.03.33 AMMark W Friedberg is a researcher at the RAND Institute and a co-author of the recent RAND analysis of the Surgeon Scorecard. He posted this on THCB in response to Ashish Jha’s post “Misunderstanding ProPublica.”  

I don’t disagree at all with the idea that providers should release their own performance data, to the extent that they have it. Free flow of accurate and understandable performance information is inherently good. If the ProPublica Surgeon Scorecard can create pressure for this to happen, fantastic.

But there is no tradeoff between recognizing the serious methodological problems in the Scorecard, improving the Scorecard, and encouraging providers to release their own data. All three can and should be done simultaneously.

Also, for frequenters of this blog, I think it’s important to clarify a few key things about the “RAND critique” (which I authored with individuals from many institutions, all of whom deserve credit for devoting considerable unpaid time to the effort).

1. Nowhere in the critique do we suggest that ProPublica – or anybody else for that matter – abandon efforts to generate and publicize reports that truly reflect provider performance. Far from it. If you look up the authors of our critique, you’ll see that all of us have devoted substantial time and effort to furthering the science and practice of performance measurement and transparency in health care.

2. What the critique does suggest is to make methodological improvements and perform due diligence, based on current best practices in public reporting. Some of our suggested improvements are easy to make. Some are difficult and will require time and effort, but that isn’t an excuse for not doing them. We also explain the reasons why these improvements are necessary: they address specific methodological steps that have not been performed in a scientifically credible manner (e.g., validating the brand-new measure that is being reported, checking the accuracy of the source data) and some suboptimal statistical choices (most notably, suppressing hospital random effect estimates in the reported “adjusted complication rates”—see our critique for the multiple reasons why this is problematic). We recommend calculating the reliability and risk of random misclassification (i.e., measurement error) in the report, and disclosing these to report users. This is a key component of transparency about the limitations of any report. To be clear though: all of the recommendations in the critique are doable, and they have been done for previous performance reports (which we cite in the critique). The only truly insurmountable barrier would be to find that the underlying Medicare claims data are so jumbled up that individual surgeons are wrongly assigned to surgeries at a very high rate. As disheartening as this would be, I think we can all agree that without reasonably accurate attribution of cases to providers, a public report cannot be useful to patients or providers, no matter how rigorous the statistical methods. And there is good reason to validate surgeon-surgery assignments carefully, given troubling prior findings about operating surgeon NPI inconsistency between Part A and Part B claims (see Dowd et al, which we cite in the critique).

3. I for one do not doubt that ProPublica’s goals in creating the Surgeon Scorecard are good, and I share them. As strange as I find ProPublica’s promotional video and some other aspects of tone in ProPublica’s response to our critique, I chalk this up to “Journalists are from Venus / Researchers are from Mars” – a difference in professional culture. So the problem is not the aim of the effort. The problem is the execution.

4. This is the toughest thing to communicate clearly, and I recommend reading our critique for more detail: It is entirely possible for a performance report with poor or unknown validity and reliability (which together determine the degree to which reported data are true predictors of the care a patient will receive from a given provider) to cause harm to patients and providers, both in the short and long term. For the reasons detailed in our critique, we come to a pretty strong conclusion on this regarding the Scorecard: potential users should not consider it valid or reliable in its current form. Future versions can and should be better. To be clear though, our conclusion isn’t about P values or confidence intervals (except insofar as the confidence intervals tell us something about measurement reliability). It’s about the hard reality that the validity (i.e., truth) of a performance report is only as strong as its weakest methodological link. We highlight the weakest ones in the critique. We aren’t asking readers to just take our word for it; we make logical arguments. And if anything isn’t clear, my coauthors and I are happy to explain the more esoteric, but very important, methodological points.

If others read our critique and still want to use the Scorecard to help choose a surgeon, they should by all means do so, hopefully with our caveats in mind. But I would give them this advice: Ask your prospective surgeons for their rates of short and long-term mortality, morbidity (including the most common and severe complications), and operative success. Ask them how they know these rates. How do they track them, if at all? Ask these questions, and use any other information at your disposal, with equal vigor, for surgeons with the lowest and highest “adjusted complication rates” on the ProPublica Surgeon Scorecard.

Categories: Uncategorized

4 replies »

  1. Great questions. Regarding “How much rigor should we expect?” from report cards, rephrasing the question can be helpful: “How accurate do we expect performance reports to be, as predictions of the care a provider will deliver in the future?”

    If we don’t care much about accuracy, then report makers can skip difficult steps like validating a new measure (and its risk adjustment), checking the source data, and requiring a reasonable level of reliability.

    Who wouldn’t care much about accuracy? Maybe somebody who is just using the report as an initial screening tool to catch performance outliers, with the intent of performing a “second sweep” — i.e., getting gold-standard data to see which outliers really need some help. While only a small percentage of reported outliers will be true outliers, it’s still useful to be able to target such gold-standard data collection.

    But for high-stakes uses like payment and public reporting directly to patients, where second-sweep data collection is impractical, I think reasonable people will agree that accuracy matters quite a bit. Hence the need for strong methods. There’s lots of work to be done in this area.

    William’s questions illustrate some of the conceptual problems in trying to rate individual actors, when health care is truly delivered by systems. This is just one of the reasons it makes little sense to try to divorce individual surgeon performance from hospital performance. Even if these could be separated statistically (and they can’t be, as explained in our critique), what’s the point, from the patient perspective? If I needed surgery, I would care far more about my health outcomes than about who deserves credit or blame.

    To healthjourno’s point, I think Peter Pronovost put it nicely here [https://armstronginstitute.blogs.hopkinsmedicine.org/2015/09/28/the-surgeon-scorecard-and-the-need-for-measurement-standards/]: “When journalists act as scientists, they should be held to the standards of scientists.”

  2. Mark, great piece. Thanks for the thoughtful take in the original RAND response and taking further time to address this important issue here. This isn’t the first, nor will it be the last controversial report card. Recent policies (MACRA, Meaningful Use 3) imply that metrics will only receive more attention. With so much on the line yet with so many new metrics in play, what are your thoughts on how this will all unfold? Is it fair / realistic to expect (demand?) the level of rigor researchers / health economists are used to in designing, validating, meeting such metrics? With such variance in how (and how well) data such as ICD codes are assigned will such validations have to occur on a per institutional basis?

  3. As a working journalist, I am highly skeptical of colleagues who reflexively charge critics with conflict of interest when their reporting is called into question. This story is a big deal in medicine. The media isn’t paying much attention as yet and is generally ignoring it.

    Stop and think what the world will be like if this kind of research/reporting becomes the standard.

    1. Academics will be held to one standard and publicly thrashed in the media as irresponsible if they fail to meet that standard. Promising careers will end. Research will be defunded.

    2. Journalists will not be required to meet that standard. Peer review will not be possible.
    The Data Pulitzer will go to the group of journalists with the biggest, baddest data expose. Data-driven reporting will become the norm.

  4. If this effort takes off, will we have shortages of surgeons?
    Will performance ratings be shortly muddled, just as ‘best car’ ratings in Consumer Reports? (Eg surgeon x was the best for mortality and third for complications for hysterectomies in pre-menopausal women, age 25-40 but…blah blah)
    Shouldn’t performance ratings logically be applied to all providers? including nurses, PTs, OTs, echocardiographic technologists,etc? …and, if this is done, won’t shortages be extensive and markets be chaotic?
    Shouldn’t other health care stakeholders like hospital board members be evaluated?… with respect to patient beneficence as well as corporate effectiveness? (E.g.docs and nurses have painful shortages–all the time–in hospital activities. Good management is critical here.)
    Can your objective in surgeon evaluations be achieved earlier in medical school, internship, and residency training? What are your plans for the surgeon who is forth best in mortality and complications in a small community with seven surgeons of similar expertise?
    Are these evaluation efforts tied in any way to the output of physicians from medical schools and residencies? Isn’t it probable that these ratings will alter the societal needs for these physicians?