THCB

Misunderstanding ProPublica

Ashish Jha

In July the investigative journalists at ProPublica released an analysis of 17,000 surgeons and their complication rates. Known as the “Surgeon Scorecard,” it set off a firestorm. In the months following, the primary objections to the scorecard have become clearer and were best distilled in a terrific piece by Lisa Rosenbaum. As anyone who follows me on Twitter knows, I am a big fan of Lisa –she reliably takes on health policy group think and incisively reveals that it’s often driven by simplistic answers to complex problems.

So when Lisa wrote a piece eviscerating the ProPublica effort, I wondered – what am I missing? Why am I such a fan of the effort when so many people I admire– from Rosenbaum to Peter Pronovost and, most recently, other authors of a RAND report – are highly critical? When it comes to views on the surgeon scorecard, reasonable people see it differently because they begin with a differing set of perspectives. Here’s my effort to distill mine.

 What is the value of transparency?

Everyone supports transparency. Even the most secretive of organizations call for it. But the value of transparency is often misunderstood. There’s strong evidence that most consumers haven’t, at least until now, used quality data when choosing providers.  But that’s not what makes transparency important. It is valuable because it fosters a sense of accountability among physicians for better care. We physicians have done a terrible job policing ourselves. We all know doctors who are “007s” – licensed to kill. We do nothing about it. If I need a surgeon tomorrow, I will find a way to avoid them, but that’s little comfort to most Americans, who can’t simply call up their surgeon friends and get the real scoop. Even if patients won’t look at quality data, doctors should and usually do.

Data on performance changes the culture in which we work. Transparency conveys to patients that performance data is not privileged information that we physicians get to keep to ourselves. And it tells physicians that they are accountable. Over the long run, this has a profound impact on performance. In our study of cardiac surgery New York, transparency drove many of the worst surgeons out of the system – they moved, stopped practicing, or got better. Not because consumers were using it, but because when the culture and environment changed, poor performance became harder to justify.

Aren’t bad data worse than no data?

One important critique of ProPublica’s effort is that it represents “bad data,” that its misclassification of surgeons is so bad that it’s worse than having no data at all. Are ProPublica’s data so flawed that they represent “bad data”? I don’t think so. Claims data reliably identify who died or was readmitted. ProPublica used these two metrics – death and read missions due to certain specific causes – as surrogates for complications. Are these metrics perfect measures of complications? Nope. As Karl Bilimoria and others have thoughtfully pointed out – if surgeon A discharges patients early, her complications are likely to lead to re-admissions whereas as surgeon B, who keeps his patients in the hospital longer will see the complications in-house. Surgeon A will look worse than Surgeon B while having the same complication rate.  While this may be a bigger problem for some surgery compared to others, the bottom line is that for the elective procedures examined by ProPublica, most complications are diagnosed after discharge.

Similarly, Peter Pronovost pointed out that if I am someone with a high propensity to admit, I am more likely to readmit someone with a mild post-operative cellulitis than my colleague, and while that might be good for my patients, I am likely to be dinged by ProPublica metrics for the same complication. But this is a problem for all readmissions measures. Are these issues limitations in the ProPublica approach? Yes. Is there an easy fix that they could apply to address either one of them? Not that I can think of.

But here’s the real question: are these two limitations, or any of the others listed by the RAND report, so problematic as to invalidate the entire effort? No. If you needed a surgeon for your mom’s gallbladder surgery and she lived in Tahiti (where you presumably don’t know anyone) – and Surgeon A had a ProPublica “complication rate” of 20% and surgeon B had a 2% complication rate, without any other information, would you really say this is worthless? I wouldn’t.

A reality test for me came from that cardiac surgeon study I mentioned from New York State. As part of the study, I spoke to about 30 surgeons with varying performance. Not one said that the report card had mislabeled a great surgeon as being a bad one. I heard about how the surgeon had been under stress, or that transparency wasn’t fair or that mortality wasn’t a good metric. I heard about the noise in the data, but no denials of the signal. In today’s debate over ProPublica, I see a similar theme: lots of complaints about methodology, but no evidence that the results aren’t valuable.

But let’s think about the alternative. What if the ProPublica reports are so bad that they have negative value? While I think this is not true – what should our response be?  It should create a strong impetus for getting the data right. When risk-adjustment fails to account for severity of illness, the right answer is to improve risk-adjustment, not to abandon the entire effort. Bad data should lead us to better data.

Misunderstanding the value of p-values and confidence intervals

complication-rates

Another popular criticism of the ProPublica scorecard is that its confidence intervals are wide, a line of reasoning which I believe misunderstands p-values and confidence intervals. Let’s return to your mom, who lives in Tahiti and still needs gallbladder surgery. What if I told you that I was 80% sure that surgeon A was better than average, and I was 80% sure that surgeon B was worse than average. Would you say that is useless information? Because the critique – that the 95% confidence intervals in the Propublica reports are wide – requires that we be 95% sure about an assertion to reject the null hypothesis. That threshold has a long historical context and is important when the goal is to not make a type 1 error (don’t label someone as a bad surgeon unless you are really sure he or she is bad). But if you want to avoid a type 2 error (which is what patients want – don’t get a bad surgeon, even if you might miss out on a good one), a p-value of 0.2 and 80% confidence intervals look pretty good. Of course, the critique about confidence intervals comes mostly from physicians who can get to very high certainty by calling their surgeon friends and finding out who is good. It’s a matter of perspective. For surgeons worried about being mislabeled, 95% confidence intervals seem appropriate.  But for the rest of the world, a p-value of 0.05 and 95% confidence intervals is way too conservative.

A final point about the Scorecard – and maybe the most important:  This is fundamentally hard stuff, and ProPublica deserves credit for starting the process. The RAND report outlines a series of potential deficiencies, each of which is worth considering – and to the extent that it’s reasonable, ProPublica should address them in the next iteration.  That said – a key value of the ProPublica effort is that it has launched an important debate about how we assess and report surgical quality. The old way – where all the information was privileged and known only among physicians – is gone. And it is not coming back. So here’s the question for the critics: how do we move forward constructively – in ways that build trust with patients, spur improvements among providers, and don’t hinder access for the sickest patients?  I have no magic formula. But that’s the discussion we need to be having.

Ashish Jha, MD, MPH (@ashishkjha) is the C. Boyden Gray Associate Professor of Health Policy and Management at the Harvard School of Public Health. He blogs at An Ounce of Evidence where this post originally appeared. He is also the Senior Editor-in-Chief for Healthcare: The Journal of Delivery Science and Innova

Livongo’s Post Ad Banner 728*90

Categories: THCB

Tagged as:

20
Leave a Reply

6 Comment threads
14 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
9 Comment authors
William Palmer MDMWFriedbergSteven FindlayashishkjhaMillenson Recent comment authors
newest oldest most voted
William Palmer MD
Member
William Palmer MD

We have to remember that surgeons–and other providers–are factors of production for many other people: hospitals; health plans; insurers; governments; agencies; ACOs; IPAs, etc. By this I mean that they are input factors in production that often yields money for other people. And the money involved is tremendous. So it is no wonder that surgeons are examined and rated and vetted and graded quite analogous to those productive activities produced by professional athletes for their sports teams and their owners. To the extent that these rating systems have this specific goal, then they are not patient-centered and we have a… Read more »

Steven Findlay
Member
Steven Findlay

Terrific piece and conversation that advances the debate and the ball on physician performance metrics and public reporting. Just one quick point: Ashish correctly notes that quality measurement has mostly been about pushing providers to improve and not primarily about giving consumers information on which to base choice of provider or healthcare decisions. ProPublica’s entry into this arena, along with others, is rebalancing this framework. Giving consumers meaningful and actionable info/data is steadily gaining momentum. CMS/HHS and others are now poised, for example, to make the “Compare” sites more consumer friendly…and more about getting consumers to vote with their feet.… Read more »

Millenson
Member
Millenson

Let me second what Ashish has said, add some background and pose a pointed question. The background: in a 2002 peer-reviewed article “Pushing the Profession” I wrote for the journal then-called Quality and Safety in Healthcare (which was excerpted in the BMJ), I reviewed the history of the press and the profession in terms of patient safety. Though you very rarely see this admitted, most of the advances in patient safety came because of public pressure from the news media. For example, the vaunted Harvard anesthesia guidelines were prompted by financial pressure and an expose by NBC-News. So, too, the… Read more »

ashishkjha
Member

Mike — I love your post. For every hospital or surgeon that complains about problems with methdology, etc — your response is right on. Please release your own data. I’m sure your own clinically-based data is far superior. Let’s see it.

MWFriedberg
Member

I don’t disagree at all with the idea that providers should release their own performance data, to the extent that they have it. Free flow of accurate and understandable performance information is inherently good. If the ProPublica Surgeon Scorecard can create pressure for this to happen, fantastic. But there is no tradeoff between recognizing the serious methodological problems in the Scorecard, improving the Scorecard, and encouraging providers to release their own data. All three can and should be done simultaneously. Also, for frequenters of this blog, I think it’s important to clarify a few key things about the “RAND critique”… Read more »

leonardkish
Member

We can’t know data is bad until it sees the light of day. I applaud ProPublica in their efforts to get this data out and noticed. To your point, they got the ball rolling. Like science, this is a process. No theory is complete and right out of the gate, it is the process of refinement and discovery that matters. It’s a stepping stone, but the direction of the path is what matters, toward the light.

Margalit Gur-Arie
Member

My problem with this line of thinking is that it lowers the bar on what is deemed acceptable research, down into the gutter. Just because we have some information, which by all accounts is insufficient for drawing accurate conclusions, doesn’t mean that we should call it “data” and bestow scientific meaning to it. And just because we don’t have anything better, doesn’t mean that we should create a fear mongering campaign, complete with ominous trailer videos, to promote it to an unsuspecting public. Personally, I question the motives of this “effort”…

Joe Flower
Member

No, this does not “lower the bar on what is deemed acceptable research, down into the gutter.” The data is not “by all accounts … insufficient for drawing accurate conclusions.” To say that would be to say that the worst surgeon in the survey has a random chance of actually being the best, that patients and their families could learn nothing — or worse, learn things that are false — from looking at this survey. As Jha admits, the data is noisy and there are problems with the methodology … but it is the best we have at the moment,… Read more »

Margalit Gur-Arie
Member

I agree that patients need better information and the sooner the better. I cannot agree with publishing inaccurate data analysis as a method to pressure the delivery system into counter publishing their own analysis.
Why can’t we just fund the right research, instead of dumping a bunch of data into the lap of some news outlet and have them figure it out?

leonardkish
Member

What motives do you suspect, exactly?

Margalit Gur-Arie
Member

Mostly clicks, and also the very fashionable (and seemingly well funded) bandwagon of cutting the medical profession down to size. I wouldn’t oppose the latter if the control was transferred to patients, but it is most definitely not, and I would rather have one million individual doctors make independent decisions than a handful of too big too fail corporate monopolies calling the shots.

leonardkish
Member

1. Data journalism is not better funded than the multitude of medical societies, certainly not for a few clicks.
2. Data for better decision-making isn’t transferring control to patients?
3. A million individual doctors should release their data.
4. What corporate monopolies?

To Mike’s point, the problem of “bad data” is easily solved by the even better-funded critics who lament the bad data. They can easily get it and share it….if they so choose.

But they won’t. Why? Because the bad data isn’t actually the issue, the measuring is.

Margalit Gur-Arie
Member

I apologize for the lack of clarity in my previous comment, and for the length of what’s to follow…. Data journalism is an interesting concept, but I am not entirely sure how this trailer for the surgeon “report card” fits in with objective data journalism, or any journalism https://youtu.be/mdQJMeLnwYw Data for better decision making is great if the data is clean and if people are at liberty to make decisions. How do people pick surgeons? You don’t wake up one morning and decide to shop for someone to remove your gallbladder. Chances are another physician will refer you to a… Read more »

ashishkjha
Member

This post makes very little sense to me. The data are pretty good. Imperfect, but pretty good. By the way, we’ve been using essentially the same claims-based data with very similar approach for risk-adjustment for hospitals and docking hospitals Medicare payments when they have high readmission rates. That’s a much bigger deal. Have not seen a lot of concern about how terrible the data are or how terrible the models are. The ProPublica stuff is pretty good.

Margalit Gur-Arie
Member

If my memory is correct, I believe people, including the author, did complain about readmissions penalties being unfair to hospitals that serve vulnerable populations. As to the data, this has been litigated extensively following the ProPublica publication, so I would assume that there is no need for me to repeat all the points and counterpoints here. Let me just say, that I would not be surprised if upon compilation of all surgeries (not just inpatient Medicare FFS) and proper assessment of what is or is not an avoidable (by the surgeon) complication, the “scorecards” would look significantly different. I do… Read more »

Brad F
Member
Brad F

Ashish
Why not dispense with point estimates then, and use a less refined presentation style? Wait for better data and tune up the analysis for next iteration.

If it’s the outliers we want (“the killers”), let’s begin there. We can at least moderate the type I and II effects (misunderstanding, etc.) by simplifying how we view the rankings. For now.

Brad

ashishkjha
Member

Thanks Brad. The issue is — what better data do you want to wait for, and when might we exactly get it? I’m fine with just an outlier analysis — but remember, since we can never be perfect, are we OK with labeling a lot of people as bad outliers when they aren’t? That’s what consumers would want — they would prefer to miss out on a good surgeon much more than get a bad surgeon. Our current approach flips that.

Brad F
Member
Brad F

By better data, I mean better process (RA and RAND recs). On outliers, I think we can agree regardless of approach, we won’t achieve perfection. I’m with you there. It’s just how “less perfect” can we be? Would I be willing to sacrifice sensitivity for specificity to cite surgeons whose records we can agree flashed warning signals? For now, yes. However, that’s only a first step. And it’s progress. Given NYS CABG study captured a much smaller group of docs, you had an easier time analyzying. Has your thinking changed though on the transparency brought forth on them? Did you… Read more »

ashishkjha
Member

Brad — to your last question of whether I feel comfortable revealing bad apples, I think I did an inadequate job explaining my thinking, so here it is: 1. Until now, you and I (as doctors) had the privilege of being the only ones who knew the bad apples. That era is over (and I’m not shedding tears). Its irrelevant whether I feel comfortable. The world demands it. 2. In the new world, in response to this demand, lots of people will do their own ratings of what is a good doctor and what is a bad doctor. Some of… Read more »

Brad F
Member
Brad F

As always Ashish, thanks for being a gent and weighing in. Quick response and a final question for you. Just to clarify, by feeling comfortable, I meant targeting the “bad apples” with your data precision, not so much temperamentally in how you feel about the reveal. I’m with you, though, i.e., if we can identify a 007, so be it and let the chips fall where they may. I won’t win over any fans for offering this view, but nonetheless, it’s one worth mentioning. The overarching theme of report cards is to protect the public. The common good rules. Playing… Read more »