Is ProPublica the Paul Revere of Transparency?

flying cadeuciiRecently, I was speaking with a “less is more” advocate. He used his superior knowledge of statistics – he had an MPH – to debunk randomized controlled trials. We discussed overdiagnosis, overtreatment, and the shakiness of medical sciences.

We spoke about measuring the quality of physicians. I remarked that quality metrics have as much evidence as Garcinia Cambogia – we had just laughed about Dr. Oz. I expected a chuckle. Instead, he became distinctly uncomfortable and, in a solemn tone, lectured me about the Institute of Medicine (IOM) report, “To Err is Human.”

The physician, a bulldog of evidence-based medicine (EBM), had a blind spot. He ripped cardiologists for overusing pacemakers. He believed in the usefulness of the physician quality reporting system. He disdained big pharma for pushing statins. He was a fan of maintenance of certification. He was at once a raging skeptic and a true believer.

My understanding of statistics is modest compared to his. But I am skeptical by nature. I’m skeptical of many things including (not necessarily in this order): statins in 65-year olds, kumbaya, hellfire, England’s soccer team, quality metrics, screening (much to the chagrin of my radiology colleagues), high priests, middle priests, hard drives spontaneously combusting, and futurists. I’d like to believe this is because I’m a dark knight searching for the truth. The reality is that I’m just a cynical git who was raised in an island where it rains without remorse. My skepticism manages to offend a unique human every day.

There is, however, a schism between the world of drugs and devices and the world of quality and value. Towards the former, scrutiny is celebrated – the mark of a person of science. Towards the latter, scrutiny is frowned upon – the mark of an uncooperative physician who doesn’t get it. How has one corner of healthcare sciences become kryptonite for many rational skeptics of contemporary medicine?

I believe this is a clash of moral purpose. The ongoing controversy about Propublica’s Surgeon Scorecard (SS) has exposed this tension. This is marvelously articulated by Ashish Jha, a physician and researcher at the Harvard Medical School. Ashish has courageously defended Propublica’s efforts publically. Courage is a trait I admire the most. Of course, I expect nothing less of a “Jha.”

Here are some points of tension in the debate about the SS.

  1. Bad data is better than no data.

That the data used by Propublica to score surgeons is imperfect can scarcely be contested. That perfection is the enemy of good is a truism worth remembering. The long view is that scoring surgeons will lead to transparency yielding better data and a positive change in culture.

Is Propublica then the Paul Revere of Transparency? The honest arbiter must surely concede this possibility, even as he must caution that this may not be so. Time will tell and time will need time to tell. However, to assert this possibility dogmatically is to assert a belief – a belief which is grounded not in evidence – it can’t be – but in hope.

Statistical (in)significance of surgeons

Ashish draws our attention to a meme which misleads many people: p must be less than 0.05. This is based on our historical tolerance of two types of errors – falsely concluding a drug doesn’t work and falsely concluding a drug works. EBM emerged on the heels of excessive therapeutic optimism in medicine. Thus, we chose to err towards therapeutic pessimism: better ten useful drugs be canned than one useless drug is brought to the market.

The SS can mislabel both a good surgeon and a bad surgeon. It has been argued, plausibly, that the significance level applied to drugs does not apply to surgeons. That, it is better we mislabel a good surgeon than a bad surgeon because a bad surgeon can harm her patients. Forced to choose between saving the reputation of a good surgeon and the life of a patient I would choose the patient. Wouldn’t you?

Alas, this is not so simple. No surgeon is an island. Let’s ignore the toll inflicted on the mislabeled good surgeon and her family – we are depriving patients, current and future, of her services. In a finite system we face a trade-off. This is not a trade-off in the strictest sense because it has the same unit on both sides. That is, we are harming patients who will never see the mislabeled good surgeon to save patients from a correctly labeled bad surgeon. This may still yield a net positive balance – that is we may save more patients than we harm. But this is a close algebra, at the best.

K2 could be taller than Everest

You’re determining whether mountains in the Himalayas or the Karakoram Range are taller, on average. You measure the height of twenty mountains at random. There is uncertainty whether the mountains sampled are a true representation of the range.

But you know that the unit of measurement – foot – measures tallness. That is you know that a 30-foot structure is taller than a 15-foot structure. What if the unit of measurement was not always an attribute of tallness? What if a 30-foot structure could be smaller than a 15-foot structure? Another uncertainty is now introduced. You would no longer know that Everest is the tallest mountain in the world, and not K2. Imagine the injustice on Mt. Everest.

This is precisely the problem with the SS. The unit of measurement is frequently not an attribute of what it measures. The surgeon’s scorecard is not to a surgeon’s performance what a foot is to height of a mountain.

Many misunderstand the error this introduces. It is not that a truly bad surgeon could randomly be a truly good surgeon. It is that a bad score may not mean a bad surgeon and a good score may not mean a good surgeon. In probability speak, may not = 1-may.

What’s your alternative? Science

A common riposte to the criticism of quality metrics is “what’s your alternative?” Since “do not waste time using useless metrics” is not an option, the skeptic, for lack of a less-worse metric, is shamed in to silence. The response assumes that measuring quality is such an imperative that something is always better than nothing. This assumption, whether justified or not, hurts the science of measuring quality.

Consider a vaccine for malaria that does not work. Is its futility diminished by the absence of an alternative? Is it not useful or useless for its own sake? What if the developers of the vaccine said: “what’s your alternative to this vaccine which does not work? Enlighten me or shut up.” But the researchers do not say that. Which is why drugs improve, vaccines arise, but quality metrics risk not improving.

Criticism without alternatives can be justifiably dismissed. It would have been idiotic to argue about the optimal size of a lifeboat when the Titanic was sinking. Is the quality movement directing us to lifeboats for a sinking ship or is it designing a better ship? If the latter, which is my understanding, quality metrics need more science and less evangelism.

Science requires that a quality metric must submit itself to the same scrutiny as a drug or a medical device. This means that its developers must assume the burden of proof. Just as the onus is not on me to prove that a drug does not work, but on the pharma to prove that it works, the burden does not fall on skeptics to prove the futility of the metrics. The burden is on proponents of the metrics to prove their usefulness, rather than reflexively referring the skeptic to the IOM report on medical errors.

We eschew subjectivity for its uncertainty. For objectivity we need science. Science progresses by embracing uncertainty. But uncertainty is anathema to the certainty we wish to exude with quality metrics. This is the Catch-22 of measuring quality.

The proponents of measuring quality must decide if quality is a mass movement or a science. If a movement it needs advocacy. It must exude self-evident nobility. It must attract true believers. It must summarily dismiss the skeptics. This is not a forlorn strategy. The movement cannot, though, simultaneously demand the deference of science and enjoy the carapace of an art critic.

What of Propublica? Will its legacy be the advancement of the science of quality or the emboldening of its movement? Regardless, it has its own ethical question to address which, ironically, is one of transparency.

Paul Revere, from what I understand, led by example. So can Propublica. It can add a disclaimer to its scorecard. “This is work in progress. The numbers may be misleading particularly for surgeons whose complications we do not fully track.” What can be more transparent than articulating the limitations of one’s methods?

Surely, transparency begins at home.

About the Author

Saurabh Jha is skeptical by nature not because he hates you. He can be reached on Twitter @RogueRad







Categories: Uncategorized

24 replies »

  1. I am a doctor with a cash practice and a strong online reputation with patients. I practice outside of the mainstream because if I followed evidence-based guidelines I would not be able to help my patients. I am diagnosing and successfully treating patients whose conditions have persisted despite usual, even excellent medical care. What I am doing is evidence-based. In a sense, I should have nothing to worry about. The irony is, the kind of care that I am providing needs to be mainstreamed, yet the mainstream has no place for me. Getting positive attention from patients is not my problem. I am open to Mr. Millenson’s suggestions as to what metrics I should be tracking to gain the positive attention of mainstream medicine.

  2. “I think a very legitimate question is why licensing (& training) doesn’t catch the outliers. My guess is it may have something to do with the Labor laws.”

    I think more to do with doctor culture and the system keeping nurses silent. Docs are reluctant to call out another doc because they feel, “but for the grace of God…” From what I have read most errors occur with small % of docs, so how do they continue if colleagues are assumed to speak out.

  3. Thanks for reading. You nailed it with “outcomes that really matter to patients.”

  4. Thanks Michel! A compliment from an Austrian isn’t easily earned. I think a very legitimate question is why licensing (& training) doesn’t catch the outliers. My guess is it may have something to do with the Labor laws.

  5. Thank you Steven. I agree that medicine is often practiced without due diligence to science. Michael’s point about hospitals releasing data is a good one. It deserves serious attention, which is why it’s important to get these metrics correct at a local level.

  6. Steven,

    That there is waste in the system, few will argue. That the waste can be identified at all on the basis of “metrics,” that’s precisely what’s in dispute. There are presuppositions that come with the notion of quality measures, and those presuppositions are willfully or naively glossed over.


  7. “In my experience as an in-house physician it is the surgical nurses who best know the technical excellence of the surgeon and the recovery room nurses who can best assess the overall surgical effort.”

    As a nurses husband I’d agree. Nurses interact every day. Why not the nurses be required to produce an evaluation score card after each surgery then get it reviewed by the hospital and surgical team.

    Trying to get the “best” is very subjective and there’d be very long lines getting there. I’d rather get the consistently “competent” who continually have their skills reviewed and techniques improved by fellow competents.

  8. Why should ProPublica be “applauded for their work?” Is it the thought that counts? Or should you judge what it purports to be: i.e. is it a good tool, or not? Is a hammer made of macaroni better than no hammer at all? P.S. Have you watched their promo video yet? They’re not making tentative claims.

  9. Clearly a provocative (and clever) piece. And good dialogue here. I concur with Dr. Jha that’s there’s lax science and implementation in the world of quality measurement, and a clear need for qualifiers when that data in publicly released. I disagree that we are applying lower standards to quality measurement than to the approval of drugs and devices and the application of EBM to the practice of medicine in general. It seems to me they are about on same plain. Which is to say that on any given day from 30% to 50% of what happens in medicine and healthcare is poor science, and unnecessary, inappropriate, excessive, or suboptimal care. The science-base and implementation of quality metrics is at about the same level, circa 2015. ProPublica should be applauded for their work, but I agree a stronger contextual disclaimer is warranted on their web site. I agree with Michael Millenson: docs should stop complaining about imperfect measures, track what they do with utter honesty, individually and in the groups and venues in which they practice, and commit to 100% transparency. With so much evidence over 25 years that so much care is inappropriate or ineffective, or even harmful, it’s no longer an option not to demand accountability from providers even as we work steadily and hard to improve the science, the care and the mechanisms of that accountability.

  10. Patients in this country are still able to vote – however imperfectly – with their feet. The Surgeon Scorecard was supposed to be a tool to aid the decision-making process. However, as we both know it turned out to be an unhelpful tool, yet was promoted with a melodramatic video, and, well, zealotry: https://www.youtube.com/watch?t=2&v=mdQJMeLnwYw

  11. Great piece (but I’ve come to expect nothing less from Dr. Jha).

    One aspect of this question that is not often raised is the following: if the need to identify the good doctors is so pressing (and I believe it is), then there must be something inherent in the system that is keeping that obviously important information from surfacing. Funny that the whole healthcare system was launched by a measure (licensing) that promised exactly the opposite outcome, i. e., that the bad apples would be weeded out.

  12. Hmm. I’ve talked to enough scientists to know that political considerations (economic, advocacy, internal) are key drivers and blockers of academic research .

  13. Mr. Millenson, long an erstwhile healthcare journalist, demonstrates Jha’s point about the camp that settles for bad science for lack of an alternative. Transparency is step 1 but step 1 alone is not enough. Mr. Millenson well knows how nuances matter in the ratings of human beings as good or bad. We need a methodology that is able to distinguish good science from advocacy, from movement propaganda.

    As a physician whose weeks are spent mopping up after missed diagnoses, I am thoroughly disillusioned by the limits of “cause and effect, p<0.05" science (aka evidence-based medicine). The complexities of human biology far exceed our methods of describing them. Dr. Jha's post is pure brilliance.

    I look forward to the day when physicians are rated by the outcomes that mean the most to patients, as opposed to cookbook stats that mislabel the best efforts of human beings devoted to patient care.

  14. John, in my experience, advocacy and science are often at loggerheads with one another, perhaps because one requires a much more emotive tone than the other is supposed to allow.

  15. Thanks for reading. There is a middle ground here. Why not encourage all surgeons to post a video demonstrating their technical expertise, even on a phantom? One study showed that raters can reasonably predict outcomes based on watching videos of surgeons operating. We should focus first on the dangerous outlier.

  16. Thanks Margalit. In many countries primary care physicians are tasked with the responsibility of recommending the best surgeon. They have incentives not to be too wrong as they’d end up picking some of the complications. I think many problems here can be solved by strengthening primary care.

  17. Perfect. Let’s add OR nurses to the mix. I’m thinking something like this would be nice:
    Tooth transplants
    Dr. John Hunt
    Physician-patients (including family members) = 68
    Nurse/PA-patients (including family members) = 5
    Physicians = 125
    OR Nurses = 10

    There is only ONE question: Would you use this surgeon if you needed a tooth transplant? and two answers only:a) Yes and I or a loved one used him in the past b) Yes

    That would be more useful than the public could ever wish for….and you never have to say anything bad about a colleague…

  18. In my experience as an in-house physician it is the surgical nurses who best know the technical excellence of the surgeon and the recovery room nurses who can best assess the overall surgical effort. The surgeon’s doctor assistant can evaluate pretty well also, but has fewer examples than a nurse who is in the OR every day. And don’t forget the patient’s view of his/her experience and outcome. And the stats from the health records….I guess…although I would trust an OR nurse over these. The way docs behave in meetings and conferances tells other docs a lot about their competence, but it is not as valid as the day-to-day views by the nurses. That is because we put on aires in front of other doctors (surprise!).

  19. Love the mountains example, Dr. Jha! As to the main point, I don’t think that a disclaimer at the bottom of an article is nearly enough to justify the publication of ratings, based on incomplete data, formatted to look “scientific” to the untrained observer. I don’t think partial data is better than nothing either, because as you aptly describe, and as history showed again and again, selectively picked bits and pieces of information are more often than not woefully misleading.

    The facts here are that these data do not include all surgeons, do not include all surgeries, do not include all “complications” and may include events that were not real complications. The inclusions and exclusions are not random, but even if they were, the best one could conclude from this is an aggregate rate of so called complications, which lo and behold, seems pretty low to me. But that would make little if any news.

    Leaving ProPublica aside, and referring to statements repeatedly made by Dr. Jha (the other Dr. Jha :-)), implying that physicians know or can easily obtain information about competence of surgeons for their own private needs, I have a simple question: if we really want to help the “public” pick good surgeons, why not share that information? What sort of argument is that when a highly respected doctor tells the public that he “knows” how to pick a surgeon, but for all of you simpletons out there, here are some half-true, semi-useful hints, and that’s the best we can do for you now.

    So as a techie geek, I would suggest that somebody creates a nice (free) website where physicians who “know” who the best surgeons are can share this information with the public they are so wanting to help. You can do it anonymously. You can limit the recommendations to only surgeons you or friends and family have actually used (it can be built so gaming is highly unlikely). Just put a check mark next to the one you “know” is good. No need to even mention the ones you know are bad. Think of it as a public service… Maybe a slightly different type of “transparency”….. the useful kind.

  20. I think we have three questions:

    1). Is it good journalism?

    2.). Is it good science?

    3). Is it good political advocacy?

    I would argue yes to 1 and to 3. Maybe not so much for 2.

    We’ve never combined 1,2,3 before.

  21. Nice post, but as I said in response to Ashish: how about surgeons/hospitals releasing meaningful clinical data as opposed to complaining. Let’s just start with, say, all hospitals in the capital of intellectualizing about quality and safety statistics (Boston) releasing volume data by surgeon, infection data by surgical unit and whatever outcomes data they can agree upon. Granular stuff that lets you compare the most distinguished surgeons/hospitals in Boston to…everyone else. How about it?