Recently, I was speaking with a “less is more” advocate. He used his superior knowledge of statistics – he had an MPH – to debunk randomized controlled trials. We discussed overdiagnosis, overtreatment, and the shakiness of medical sciences.
We spoke about measuring the quality of physicians. I remarked that quality metrics have as much evidence as Garcinia Cambogia – we had just laughed about Dr. Oz. I expected a chuckle. Instead, he became distinctly uncomfortable and, in a solemn tone, lectured me about the Institute of Medicine (IOM) report, “To Err is Human.”
The physician, a bulldog of evidence-based medicine (EBM), had a blind spot. He ripped cardiologists for overusing pacemakers. He believed in the usefulness of the physician quality reporting system. He disdained big pharma for pushing statins. He was a fan of maintenance of certification. He was at once a raging skeptic and a true believer.
My understanding of statistics is modest compared to his. But I am skeptical by nature. I’m skeptical of many things including (not necessarily in this order): statins in 65-year olds, kumbaya, hellfire, England’s soccer team, quality metrics, screening (much to the chagrin of my radiology colleagues), high priests, middle priests, hard drives spontaneously combusting, and futurists. I’d like to believe this is because I’m a dark knight searching for the truth. The reality is that I’m just a cynical git who was raised in an island where it rains without remorse. My skepticism manages to offend a unique human every day.
There is, however, a schism between the world of drugs and devices and the world of quality and value. Towards the former, scrutiny is celebrated – the mark of a person of science. Towards the latter, scrutiny is frowned upon – the mark of an uncooperative physician who doesn’t get it. How has one corner of healthcare sciences become kryptonite for many rational skeptics of contemporary medicine?
I believe this is a clash of moral purpose. The ongoing controversy about Propublica’s Surgeon Scorecard (SS) has exposed this tension. This is marvelously articulated by Ashish Jha, a physician and researcher at the Harvard Medical School. Ashish has courageously defended Propublica’s efforts publically. Courage is a trait I admire the most. Of course, I expect nothing less of a “Jha.”
Here are some points of tension in the debate about the SS.
- Bad data is better than no data.
That the data used by Propublica to score surgeons is imperfect can scarcely be contested. That perfection is the enemy of good is a truism worth remembering. The long view is that scoring surgeons will lead to transparency yielding better data and a positive change in culture.
Is Propublica then the Paul Revere of Transparency? The honest arbiter must surely concede this possibility, even as he must caution that this may not be so. Time will tell and time will need time to tell. However, to assert this possibility dogmatically is to assert a belief – a belief which is grounded not in evidence – it can’t be – but in hope.
Statistical (in)significance of surgeons
Ashish draws our attention to a meme which misleads many people: p must be less than 0.05. This is based on our historical tolerance of two types of errors – falsely concluding a drug doesn’t work and falsely concluding a drug works. EBM emerged on the heels of excessive therapeutic optimism in medicine. Thus, we chose to err towards therapeutic pessimism: better ten useful drugs be canned than one useless drug is brought to the market.
The SS can mislabel both a good surgeon and a bad surgeon. It has been argued, plausibly, that the significance level applied to drugs does not apply to surgeons. That, it is better we mislabel a good surgeon than a bad surgeon because a bad surgeon can harm her patients. Forced to choose between saving the reputation of a good surgeon and the life of a patient I would choose the patient. Wouldn’t you?
Alas, this is not so simple. No surgeon is an island. Let’s ignore the toll inflicted on the mislabeled good surgeon and her family – we are depriving patients, current and future, of her services. In a finite system we face a trade-off. This is not a trade-off in the strictest sense because it has the same unit on both sides. That is, we are harming patients who will never see the mislabeled good surgeon to save patients from a correctly labeled bad surgeon. This may still yield a net positive balance – that is we may save more patients than we harm. But this is a close algebra, at the best.
K2 could be taller than Everest
You’re determining whether mountains in the Himalayas or the Karakoram Range are taller, on average. You measure the height of twenty mountains at random. There is uncertainty whether the mountains sampled are a true representation of the range.
But you know that the unit of measurement – foot – measures tallness. That is you know that a 30-foot structure is taller than a 15-foot structure. What if the unit of measurement was not always an attribute of tallness? What if a 30-foot structure could be smaller than a 15-foot structure? Another uncertainty is now introduced. You would no longer know that Everest is the tallest mountain in the world, and not K2. Imagine the injustice on Mt. Everest.
This is precisely the problem with the SS. The unit of measurement is frequently not an attribute of what it measures. The surgeon’s scorecard is not to a surgeon’s performance what a foot is to height of a mountain.
Many misunderstand the error this introduces. It is not that a truly bad surgeon could randomly be a truly good surgeon. It is that a bad score may not mean a bad surgeon and a good score may not mean a good surgeon. In probability speak, may not = 1-may.
What’s your alternative? Science
A common riposte to the criticism of quality metrics is “what’s your alternative?” Since “do not waste time using useless metrics” is not an option, the skeptic, for lack of a less-worse metric, is shamed in to silence. The response assumes that measuring quality is such an imperative that something is always better than nothing. This assumption, whether justified or not, hurts the science of measuring quality.
Consider a vaccine for malaria that does not work. Is its futility diminished by the absence of an alternative? Is it not useful or useless for its own sake? What if the developers of the vaccine said: “what’s your alternative to this vaccine which does not work? Enlighten me or shut up.” But the researchers do not say that. Which is why drugs improve, vaccines arise, but quality metrics risk not improving.
Criticism without alternatives can be justifiably dismissed. It would have been idiotic to argue about the optimal size of a lifeboat when the Titanic was sinking. Is the quality movement directing us to lifeboats for a sinking ship or is it designing a better ship? If the latter, which is my understanding, quality metrics need more science and less evangelism.
Science requires that a quality metric must submit itself to the same scrutiny as a drug or a medical device. This means that its developers must assume the burden of proof. Just as the onus is not on me to prove that a drug does not work, but on the pharma to prove that it works, the burden does not fall on skeptics to prove the futility of the metrics. The burden is on proponents of the metrics to prove their usefulness, rather than reflexively referring the skeptic to the IOM report on medical errors.
We eschew subjectivity for its uncertainty. For objectivity we need science. Science progresses by embracing uncertainty. But uncertainty is anathema to the certainty we wish to exude with quality metrics. This is the Catch-22 of measuring quality.
The proponents of measuring quality must decide if quality is a mass movement or a science. If a movement it needs advocacy. It must exude self-evident nobility. It must attract true believers. It must summarily dismiss the skeptics. This is not a forlorn strategy. The movement cannot, though, simultaneously demand the deference of science and enjoy the carapace of an art critic.
What of Propublica? Will its legacy be the advancement of the science of quality or the emboldening of its movement? Regardless, it has its own ethical question to address which, ironically, is one of transparency.
Paul Revere, from what I understand, led by example. So can Propublica. It can add a disclaimer to its scorecard. “This is work in progress. The numbers may be misleading particularly for surgeons whose complications we do not fully track.” What can be more transparent than articulating the limitations of one’s methods?
Surely, transparency begins at home.
About the Author
Saurabh Jha is skeptical by nature not because he hates you. He can be reached on Twitter @RogueRad