Misunderstanding ProPublica

Ashish Jha

In July the investigative journalists at ProPublica released an analysis of 17,000 surgeons and their complication rates. Known as the “Surgeon Scorecard,” it set off a firestorm. In the months following, the primary objections to the scorecard have become clearer and were best distilled in a terrific piece by Lisa Rosenbaum. As anyone who follows me on Twitter knows, I am a big fan of Lisa –she reliably takes on health policy group think and incisively reveals that it’s often driven by simplistic answers to complex problems.

So when Lisa wrote a piece eviscerating the ProPublica effort, I wondered – what am I missing? Why am I such a fan of the effort when so many people I admire– from Rosenbaum to Peter Pronovost and, most recently, other authors of a RAND report – are highly critical? When it comes to views on the surgeon scorecard, reasonable people see it differently because they begin with a differing set of perspectives. Here’s my effort to distill mine.

 What is the value of transparency?

Everyone supports transparency. Even the most secretive of organizations call for it. But the value of transparency is often misunderstood. There’s strong evidence that most consumers haven’t, at least until now, used quality data when choosing providers.  But that’s not what makes transparency important. It is valuable because it fosters a sense of accountability among physicians for better care. We physicians have done a terrible job policing ourselves. We all know doctors who are “007s” – licensed to kill. We do nothing about it. If I need a surgeon tomorrow, I will find a way to avoid them, but that’s little comfort to most Americans, who can’t simply call up their surgeon friends and get the real scoop. Even if patients won’t look at quality data, doctors should and usually do.

Data on performance changes the culture in which we work. Transparency conveys to patients that performance data is not privileged information that we physicians get to keep to ourselves. And it tells physicians that they are accountable. Over the long run, this has a profound impact on performance. In our study of cardiac surgery New York, transparency drove many of the worst surgeons out of the system – they moved, stopped practicing, or got better. Not because consumers were using it, but because when the culture and environment changed, poor performance became harder to justify.

Aren’t bad data worse than no data?

One important critique of ProPublica’s effort is that it represents “bad data,” that its misclassification of surgeons is so bad that it’s worse than having no data at all. Are ProPublica’s data so flawed that they represent “bad data”? I don’t think so. Claims data reliably identify who died or was readmitted. ProPublica used these two metrics – death and read missions due to certain specific causes – as surrogates for complications. Are these metrics perfect measures of complications? Nope. As Karl Bilimoria and others have thoughtfully pointed out – if surgeon A discharges patients early, her complications are likely to lead to re-admissions whereas as surgeon B, who keeps his patients in the hospital longer will see the complications in-house. Surgeon A will look worse than Surgeon B while having the same complication rate.  While this may be a bigger problem for some surgery compared to others, the bottom line is that for the elective procedures examined by ProPublica, most complications are diagnosed after discharge.

Similarly, Peter Pronovost pointed out that if I am someone with a high propensity to admit, I am more likely to readmit someone with a mild post-operative cellulitis than my colleague, and while that might be good for my patients, I am likely to be dinged by ProPublica metrics for the same complication. But this is a problem for all readmissions measures. Are these issues limitations in the ProPublica approach? Yes. Is there an easy fix that they could apply to address either one of them? Not that I can think of.

But here’s the real question: are these two limitations, or any of the others listed by the RAND report, so problematic as to invalidate the entire effort? No. If you needed a surgeon for your mom’s gallbladder surgery and she lived in Tahiti (where you presumably don’t know anyone) – and Surgeon A had a ProPublica “complication rate” of 20% and surgeon B had a 2% complication rate, without any other information, would you really say this is worthless? I wouldn’t.

A reality test for me came from that cardiac surgeon study I mentioned from New York State. As part of the study, I spoke to about 30 surgeons with varying performance. Not one said that the report card had mislabeled a great surgeon as being a bad one. I heard about how the surgeon had been under stress, or that transparency wasn’t fair or that mortality wasn’t a good metric. I heard about the noise in the data, but no denials of the signal. In today’s debate over ProPublica, I see a similar theme: lots of complaints about methodology, but no evidence that the results aren’t valuable.

But let’s think about the alternative. What if the ProPublica reports are so bad that they have negative value? While I think this is not true – what should our response be?  It should create a strong impetus for getting the data right. When risk-adjustment fails to account for severity of illness, the right answer is to improve risk-adjustment, not to abandon the entire effort. Bad data should lead us to better data.

Misunderstanding the value of p-values and confidence intervals


Another popular criticism of the ProPublica scorecard is that its confidence intervals are wide, a line of reasoning which I believe misunderstands p-values and confidence intervals. Let’s return to your mom, who lives in Tahiti and still needs gallbladder surgery. What if I told you that I was 80% sure that surgeon A was better than average, and I was 80% sure that surgeon B was worse than average. Would you say that is useless information? Because the critique – that the 95% confidence intervals in the Propublica reports are wide – requires that we be 95% sure about an assertion to reject the null hypothesis. That threshold has a long historical context and is important when the goal is to not make a type 1 error (don’t label someone as a bad surgeon unless you are really sure he or she is bad). But if you want to avoid a type 2 error (which is what patients want – don’t get a bad surgeon, even if you might miss out on a good one), a p-value of 0.2 and 80% confidence intervals look pretty good. Of course, the critique about confidence intervals comes mostly from physicians who can get to very high certainty by calling their surgeon friends and finding out who is good. It’s a matter of perspective. For surgeons worried about being mislabeled, 95% confidence intervals seem appropriate.  But for the rest of the world, a p-value of 0.05 and 95% confidence intervals is way too conservative.

A final point about the Scorecard – and maybe the most important:  This is fundamentally hard stuff, and ProPublica deserves credit for starting the process. The RAND report outlines a series of potential deficiencies, each of which is worth considering – and to the extent that it’s reasonable, ProPublica should address them in the next iteration.  That said – a key value of the ProPublica effort is that it has launched an important debate about how we assess and report surgical quality. The old way – where all the information was privileged and known only among physicians – is gone. And it is not coming back. So here’s the question for the critics: how do we move forward constructively – in ways that build trust with patients, spur improvements among providers, and don’t hinder access for the sickest patients?  I have no magic formula. But that’s the discussion we need to be having.

Ashish Jha, MD, MPH (@ashishkjha) is the C. Boyden Gray Associate Professor of Health Policy and Management at the Harvard School of Public Health. He blogs at An Ounce of Evidence where this post originally appeared. He is also the Senior Editor-in-Chief for Healthcare: The Journal of Delivery Science and Innova

Categories: Uncategorized

Tagged as:

20 replies »

  1. We have to remember that surgeons–and other providers–are factors of production for many other people: hospitals; health plans; insurers; governments; agencies; ACOs; IPAs, etc.

    By this I mean that they are input factors in production that often yields money for other people. And the money involved is tremendous.

    So it is no wonder that surgeons are examined and rated and vetted and graded quite analogous to those productive activities produced by professional athletes for their sports teams and their owners.

    To the extent that these rating systems have this specific goal, then they are not patient-centered and we have a large opening for non-beneficence, or at best accidental beneficence. To the extent that their purpose is to give patients a better product, and only this, then we are serving faithfully our professional oath.

  2. Terrific piece and conversation that advances the debate and the ball on physician performance metrics and public reporting. Just one quick point: Ashish correctly notes that quality measurement has mostly been about pushing providers to improve and not primarily about giving consumers information on which to base choice of provider or healthcare decisions. ProPublica’s entry into this arena, along with others, is rebalancing this framework. Giving consumers meaningful and actionable info/data is steadily gaining momentum. CMS/HHS and others are now poised, for example, to make the “Compare” sites more consumer friendly…and more about getting consumers to vote with their feet. The 5-star ratings on Hospital Compare and other compare sites are a step in this direction. In several conversations I’ve had with the CMS compare team in recent weeks, it’s clear more is planned, though I dare say it’ll take quite some time for it all to roll out.

  3. Let me second what Ashish has said, add some background and pose a pointed question.

    The background: in a 2002 peer-reviewed article “Pushing the Profession” I wrote for the journal then-called Quality and Safety in Healthcare (which was excerpted in the BMJ), I reviewed the history of the press and the profession in terms of patient safety. Though you very rarely see this admitted, most of the advances in patient safety came because of public pressure from the news media. For example, the vaunted Harvard anesthesia guidelines were prompted by financial pressure and an expose by NBC-News. So, too, the pressure from ProPublica will have a salutary effect.

    And now here’s the part the critics in the NEJM and elsewhere have left out: there’s no law saying you can’t voluntarily release risk-adjusted clinical data on your own that’s better and more accurate. The best cardiac data, for example, is from the Society of Thoracic Surgeons database, but it’s all for internal purposes except when a group (not individual surgeons) chooses to participate in a limited release program that’s been around the last few years.

    If, Esteemed Doctors and Scientists, your interest is giving the public the best possible data, then let’s see data on individual surgeons at Hopkins, Mass General, Brigham & Women’s, etc. as you set an example for the rest of the nation, bravely imitating what Dr. Ernest Amory Codman asked you to do back in 1913. At the very least, release volume data, by surgeon, and infection data by procedure or by floor.

    Or, instead, continue to complain about those evil press. Like the American College of Surgeons did when, in 1919, it burned its first hospital inspection results in the furnace at the Waldorf-Astoria Hotel in New York rather than have it possibly fall into the hands of reporters.

    Remember, don’t let the perfect be the enemy of the good. Harvard, Hopkins and all the rest of you, let’s have some radical transparency.

    • Mike — I love your post. For every hospital or surgeon that complains about problems with methdology, etc — your response is right on. Please release your own data. I’m sure your own clinically-based data is far superior. Let’s see it.

      • I don’t disagree at all with the idea that providers should release their own performance data, to the extent that they have it. Free flow of accurate and understandable performance information is inherently good. If the ProPublica Surgeon Scorecard can create pressure for this to happen, fantastic.

        But there is no tradeoff between recognizing the serious methodological problems in the Scorecard, improving the Scorecard, and encouraging providers to release their own data. All three can and should be done simultaneously.

        Also, for frequenters of this blog, I think it’s important to clarify a few key things about the “RAND critique” (which I authored with individuals from many institutions, all of whom deserve credit for devoting considerable unpaid time to the effort).

        1. Nowhere in the critique do we suggest that ProPublica – or anybody else for that matter – abandon efforts to generate and publicize reports that truly reflect provider performance. Far from it. If you look up the authors of our critique, you’ll see that all of us have devoted substantial time and effort to furthering the science and practice of performance measurement and transparency in health care.

        2. What the critique does suggest is to make methodological improvements and perform due diligence, based on current best practices in public reporting. Some of our suggested improvements are easy to make. Some are difficult and will require time and effort, but that isn’t an excuse for not doing them. We also explain the reasons why these improvements are necessary: they address specific methodological steps that have not been performed in a scientifically credible manner (e.g., validating the brand-new measure that is being reported, checking the accuracy of the source data) and some suboptimal statistical choices (most notably, suppressing hospital random effect estimates in the reported “adjusted complication rates”—see our critique for the multiple reasons why this is problematic). We recommend calculating the reliability and risk of random misclassification (i.e., measurement error) in the report, and disclosing these to report users. This is a key component of transparency about the limitations of any report. To be clear though: all of the recommendations in the critique are doable, and they have been done for previous performance reports (which we cite in the critique). The only truly insurmountable barrier would be to find that the underlying Medicare claims data are so jumbled up that individual surgeons are wrongly assigned to surgeries at a very high rate. As disheartening as this would be, I think we can all agree that without reasonably accurate attribution of cases to providers, a public report cannot be useful to patients or providers, no matter how rigorous the statistical methods. And there is good reason to validate surgeon-surgery assignments carefully, given troubling prior findings about operating surgeon NPI inconsistency between Part A and Part B claims (see Dowd et al, which we cite in the critique).

        3. I for one do not doubt that ProPublica’s goals in creating the Surgeon Scorecard are good, and I share them. As strange as I find ProPublica’s promotional video and some other aspects of tone in ProPublica’s response to our critique, I chalk this up to “Journalists are from Venus / Researchers are from Mars” – a difference in professional culture. So the problem is not the aim of the effort. The problem is the execution.

        4. This is the toughest thing to communicate clearly, and I recommend reading our critique for more detail: It is entirely possible for a performance report with poor or unknown validity and reliability (which together determine the degree to which reported data are true predictors of the care a patient will receive from a given provider) to cause harm to patients and providers, both in the short and long term. For the reasons detailed in our critique, we come to a pretty strong conclusion on this regarding the Scorecard: potential users should not consider it valid or reliable in its current form. Future versions can and should be better. To be clear though, our conclusion isn’t about P values or confidence intervals (except insofar as the confidence intervals tell us something about measurement reliability). It’s about the hard reality that the validity (i.e., truth) of a performance report is only as strong as its weakest methodological link. We highlight the weakest ones in the critique. We aren’t asking readers to just take our word for it; we make logical arguments. And if anything isn’t clear, my coauthors and I are happy to explain the more esoteric, but very important, methodological points.

        If others read our critique and still want to use the Scorecard to help choose a surgeon, they should by all means do so, hopefully with our caveats in mind. But I would give them this advice: Ask your prospective surgeons for their rates of short and long-term mortality, morbidity (including the most common and severe complications), and operative success. Ask them how they know these rates. How do they track them, if at all? Ask these questions, and use any other information at your disposal, with equal vigor, for surgeons with the lowest and highest “adjusted complication rates” on the ProPublica Surgeon Scorecard.

  4. We can’t know data is bad until it sees the light of day. I applaud ProPublica in their efforts to get this data out and noticed. To your point, they got the ball rolling. Like science, this is a process. No theory is complete and right out of the gate, it is the process of refinement and discovery that matters. It’s a stepping stone, but the direction of the path is what matters, toward the light.

  5. My problem with this line of thinking is that it lowers the bar on what is deemed acceptable research, down into the gutter. Just because we have some information, which by all accounts is insufficient for drawing accurate conclusions, doesn’t mean that we should call it “data” and bestow scientific meaning to it. And just because we don’t have anything better, doesn’t mean that we should create a fear mongering campaign, complete with ominous trailer videos, to promote it to an unsuspecting public. Personally, I question the motives of this “effort”…

    • No, this does not “lower the bar on what is deemed acceptable research, down into the gutter.” The data is not “by all accounts … insufficient for drawing accurate conclusions.”

      To say that would be to say that the worst surgeon in the survey has a random chance of actually being the best, that patients and their families could learn nothing — or worse, learn things that are false — from looking at this survey.

      As Jha admits, the data is noisy and there are problems with the methodology … but it is the best we have at the moment, and its publication may and should spur true efforts to find better methods.

      This is a problem across healthcare. It is intolerable that there are problems across healthcare, usually that everyone in healthcare or in that specialty know about, but that we as patients and consumers cannot discover no matter how diligent we are. The lack of transparency of all kinds across healthcare is probably the biggest bar to making it better and cheaper. There is no excuse for it, and it must end.

      • I agree that patients need better information and the sooner the better. I cannot agree with publishing inaccurate data analysis as a method to pressure the delivery system into counter publishing their own analysis.
        Why can’t we just fund the right research, instead of dumping a bunch of data into the lap of some news outlet and have them figure it out?

      • Mostly clicks, and also the very fashionable (and seemingly well funded) bandwagon of cutting the medical profession down to size. I wouldn’t oppose the latter if the control was transferred to patients, but it is most definitely not, and I would rather have one million individual doctors make independent decisions than a handful of too big too fail corporate monopolies calling the shots.

        • 1. Data journalism is not better funded than the multitude of medical societies, certainly not for a few clicks.
          2. Data for better decision-making isn’t transferring control to patients?
          3. A million individual doctors should release their data.
          4. What corporate monopolies?

          To Mike’s point, the problem of “bad data” is easily solved by the even better-funded critics who lament the bad data. They can easily get it and share it….if they so choose.

          But they won’t. Why? Because the bad data isn’t actually the issue, the measuring is.

          • I apologize for the lack of clarity in my previous comment, and for the length of what’s to follow….

            Data journalism is an interesting concept, but I am not entirely sure how this trailer for the surgeon “report card” fits in with objective data journalism, or any journalism https://youtu.be/mdQJMeLnwYw

            Data for better decision making is great if the data is clean and if people are at liberty to make decisions.

            How do people pick surgeons? You don’t wake up one morning and decide to shop for someone to remove your gallbladder. Chances are another physician will refer you to a surgeon, and if you are lucky you may be given a few choices. And if you are really lucky, your doctor will be someone you can trust to make a good recommendation.

            Rich and educated people who read ProPublica may have the luxury of picking any surgeon they want.

            Most people are in managed care, narrow networks or referred by big box physicians to the surgeons sanctioned by the system or by the plan. If it’s one of those tiered or reference pricing plans, and you are poor, you will go to the cheapest one, no matter what the newspaper says. For most people, having information is not the same thing as having power.

            It’s nice to know how Yelp or ProPublica rates the surgeon and hospital, but let’s not kid ourselves about patients freedom to make decisions. Insurance plans (said monopolies) decide what meds you can take, what tests and procedures you can have and where you can have them done, and the less you trust your doctor the better the insurer can manage your expectations or lack thereof.

            Videos as the one above go a long way towards establishing an image of doctors as incompetent, callous, dangerous, out to kill you, etc. Add to that the “dollars for docs” headlines, and subsequent exercises in data journalism on the same subject, and one may be tempted to ask whether the term journalism is applicable here.

            It may very well be that ProPublica folks are sincerely trying to fight the mythical big bad paternalistic and secretive doctor on behalf of “consumers”, but the unintended consequences are to diminish the power of the only advocate vulnerable patients used to have in this horrific system, and once it’s gone, it’s gone for good.

    • This post makes very little sense to me. The data are pretty good. Imperfect, but pretty good. By the way, we’ve been using essentially the same claims-based data with very similar approach for risk-adjustment for hospitals and docking hospitals Medicare payments when they have high readmission rates. That’s a much bigger deal. Have not seen a lot of concern about how terrible the data are or how terrible the models are. The ProPublica stuff is pretty good.

      • If my memory is correct, I believe people, including the author, did complain about readmissions penalties being unfair to hospitals that serve vulnerable populations.

        As to the data, this has been litigated extensively following the ProPublica publication, so I would assume that there is no need for me to repeat all the points and counterpoints here. Let me just say, that I would not be surprised if upon compilation of all surgeries (not just inpatient Medicare FFS) and proper assessment of what is or is not an avoidable (by the surgeon) complication, the “scorecards” would look significantly different.

        I do agree with Michael that it would be nice for surgeons or hospitals to publish clean and well researched and reviewed data. I just don’t think that publishing questionable information in order to rattle the cages and spur them into taking corrective action is the best strategy for achieving “transparency”.

  6. Ashish
    Why not dispense with point estimates then, and use a less refined presentation style? Wait for better data and tune up the analysis for next iteration.

    If it’s the outliers we want (“the killers”), let’s begin there. We can at least moderate the type I and II effects (misunderstanding, etc.) by simplifying how we view the rankings. For now.


    • Thanks Brad. The issue is — what better data do you want to wait for, and when might we exactly get it? I’m fine with just an outlier analysis — but remember, since we can never be perfect, are we OK with labeling a lot of people as bad outliers when they aren’t? That’s what consumers would want — they would prefer to miss out on a good surgeon much more than get a bad surgeon. Our current approach flips that.

      • By better data, I mean better process (RA and RAND recs).

        On outliers, I think we can agree regardless of approach, we won’t achieve perfection. I’m with you there. It’s just how “less perfect” can we be? Would I be willing to sacrifice sensitivity for specificity to cite surgeons whose records we can agree flashed warning signals? For now, yes. However, that’s only a first step. And it’s progress.

        Given NYS CABG study captured a much smaller group of docs, you had an easier time analyzying. Has your thinking changed though on the transparency brought forth on them? Did you feel comfortable at the time revealing the “bad apples?”

        • Brad — to your last question of whether I feel comfortable revealing bad apples, I think I did an inadequate job explaining my thinking, so here it is:

          1. Until now, you and I (as doctors) had the privilege of being the only ones who knew the bad apples. That era is over (and I’m not shedding tears). Its irrelevant whether I feel comfortable. The world demands it.

          2. In the new world, in response to this demand, lots of people will do their own ratings of what is a good doctor and what is a bad doctor. Some of those ratings will be terrific. Others will be awful. We, the experts, won’t get to arbitrate that. Democratization of information.

          3. All we can do is to use the ratings that are coming out, to the extent that they are inadequate, to advocate for better data, better methods, etc. If we don’t, we’ll just have a big mess and it’ll take a long time to sort out what information is signal and what is just noise.

          So my plea to my friends is — I get your concerns. I share them. But the answer to inadequate data and inadequate methods is better data, better methods. Transparency has left the station. Make sure it goes in the right direction.

          • As always Ashish, thanks for being a gent and weighing in. Quick response and a final question for you.

            Just to clarify, by feeling comfortable, I meant targeting the “bad apples” with your data precision, not so much temperamentally in how you feel about the reveal. I’m with you, though, i.e., if we can identify a 007, so be it and let the chips fall where they may.

            I won’t win over any fans for offering this view, but nonetheless, it’s one worth mentioning. The overarching theme of report cards is to protect the public. The common good rules. Playing somewhat of a devil’s advocate here though, what about the wrongly identified doc whose career gets trampled? The silent victim never gets discussed, and I think some of the pushback and outrage from the ProPublica release on the doc side emanates from the same place.

            I don’t expect the public to regret a false positive in exchange for spared lives and suffering, but what about those individuals? Do we owe them anything ex-post? How do we make them whole again?

Leave a Reply

Your email address will not be published. Required fields are marked *