Measuring the Quality of Hospitals and Doctors: When Is Good Good Enough?

In the past, neither hospitals nor practicing physicians were accustomed to being measured and judged. Aside from periodic inspections by the Joint Commission (for which they had years of notice and on which failures were rare), hospitals did not publicly report their quality data, and payment was based on volume, not performance.

Physicians endured an orgy of judgment during their formative years – in high school, college, medical school, and in residency and fellowship. But then it stopped, or at least it used to. At the tender age of 29 and having passed “the boards,” I remember the feeling of relief knowing that my professional work would never again be subject to the judgment of others.

In the past few years, all of that has changed, as society has found our healthcare “product” wanting and determined that the best way to spark improvement is to measure us, to report the measures publicly, and to pay differentially based on these measures. The strategy is sound, even if the measures are often not.

Hospitals and doctors, unaccustomed to being rated and ranked like resort hotels and American Idol contestants, are suffering from performance anxiety and feeling an intense desire to be left alone. But we also bristle at the possibility of misclassification: to be branded a “B” or a “C” when you’re really an “A” feels profoundly unjust.

In my role as chair of the ABIM this year, I am awed by the amount of time and expertise that goes into ensuring that the pass/fail decisions of the Board are valid and defensible (legally, if necessary). They are. But as new kinds of measures spring up, most of them lack the rigor of the verdicts of the certifying boards. For example, Medicare is now penalizing hospitals that have excessive numbers of readmissions. As Harvard’s Karen Joynt and Ashish Jha observed in 2012, there is considerable doubt that the 30-day readmission rate is a valid measure of quality, and clear evidence that its application leads to misclassifications – particularly for penalized hospitals whose sins are that they care for large numbers of poor patients or that they house teaching programs. Quite understandably, these hospitals cry “foul.”

Yet the Medicare fines have contributed to a falling number of readmissions nationally – from 19 percent in 2011 to 17.8 percent in 2012, which represents more than 100,000 patients spared an unpleasant and risky return trip to the hospital. While cause and effect is difficult to prove, it seems likely that hospitals’ responses to the Medicare program (better discharge planning, earlier follow-up appointments, enhanced communication with PCPs, post-discharge phone calls to patients) are playing a role. “Readmissions are not a good quality measure,” Jha observed in a recent blog, “but they may be a very good way to change the notion of accountability within the healthcare delivery system.” Medicare’s Jonathan Blum puts it more bluntly. “I’m personally comfortable with some imprecision to our measures,” he said, as long as the measures are contributing to the ultimate goal of reducing readmissions.

With Jha and seven other experts, I am an advisor to the Leapfrog Group’s effort to grade hospitals on patient safety. Using the best available publicly reported data, our panel recommended a set of measures and a weighting system that Leapfrog has used to assign patient safety letter grades to U.S. hospitals. The hospitals that have received “F’s” (25 out of the 2619 hospitals that received ratings) have been up in arms – I’ve received several calls from their representatives, livid about what they believe to be a vast injustice. Yet there is no question that these hospitals are working on improvement with a passion that, in many cases, was previously lacking.

Of course, before getting down to business, everyone’s first two responses to poor grades are to question the validity of the measures and to work on better coding. I know one hospital that received a stellar grade in the Consumer Reports ranking system (one of the several systems now out there), and responded by festooning the hospital lobby and halls with banners. A few months later, when they received a “C” from Leapfrog, their reaction was to inveigh against the methods. This, of course, is natural: we embrace the rankings we like and reject the ones we don’t. But it is largely unproductive.

At a recent conference on transparency, I heard Arnie Milstein, a national leader in assessment and a professor at Stanford, speak about the current state of quality measurement. He described the Los Angeles Health Department’s program that rates restaurants on cleanliness, and mandates that restaurants post large signs with their letter grades (A, B, or C) in their windows. According to Milstein, the measures “would not have passed the National Quality Forum,” the agency that vets healthcare quality measures for scientific rigor. Yet the results were strikingly positive: a 20 percent decrease in patients hospitalized for food poisoning. This raises the central question: “At what point are measures good enough?”

In a 2007 study, Milstein and colleagues asked 1,057 Americans about physician quality measures. Specifically, they wondered what level of potential inaccuracy people would accept before they would not want to see the results. About one in five respondents said that they would want to see a measure even if its rate of misclassification (calling a doctor fair when she is excellent, or vice versa) was as high as 20-50 percent. Another third would not tolerate that degree of uncertainty, but would want access to measures that might be as much as 5-20 percent inaccurate.

Milstein hypothesized that these results might be a manifestation of the public’s famous innumeracy: perhaps these folks didn’t really understand the hazards of relying on such flawed information. So he asked the same question of a group of PhD statisticians at a national meeting. If anything, they were even more tolerant of misclassification risk. “‘P equals less than 0.05’ was nowhere to be seen,” he quipped.

Why were experts and non-experts alike so accepting of misclassification? Milstein came to the conclusion that the measures that they were being offered were better than what they had, which was nothing. Moreover, they probably sensed that public reporting of such measures would not only help them make better choices as consumers, but would also spur the doctors to improve. “Measures can motivate or discriminate,” Yale’s Harlan Krumholz reminded us at the same meeting. And in most cases, they do a bit of both.

Does the public’s tolerance for misclassification give measurers – the ABIM, Leapfrog, or Medicare – a free ride on the “Ends-Justify-The-Means” Express? Absolutely not. Measurers need to do their honest best to produce measures with as much scientific integrity as possible, and commit themselves to improving the measures over time. Medicare’s decision to ditch their four-hour door-to-antibiotic time pneumonia measure in the face of evidence of misclassification and unanticipated consequences (antibiotics at triage for everyone with a cough) is a shining example of responding to feedback and new data. In a recent NEJM article, Joynt and Jha recommend a few simple changes, including taking into account patients’ socioeconomic status, that could improve the readmission measure. The trick is to adjust appropriately for such predictors without giving safety net and academic hospitals a pass, since these organizations undoubtedly vary in performance and many have room for improvement.

Now that I have been on both sides of the measurement equation, one thing that has become clear to me is this: Public reporting of quality measures not only improves the work of the measured, it improves the work of the measurer. Ultimately, a healthcare ecosystem in which reasonable measures help guide patient and purchaser choices will lead to improvements in both the quality of care and of the measures themselves. I believe we can look forward to an era of more accurate measures, measures that capture the right things (not just clinical quality but teamwork and communication skills, for example), and measures that are less burdensome to collect and analyze.

If there were a way of getting to this Nirvana without ever unfairly characterizing a physician or hospital as a “C” when she/it is really a “B+”, that would be splendid. Personally, I can’t see how we can manage that. Seen in that light, the question to ask is not, “Are the measures perfect?” (clearly, they’re not) but, “Is the risk of misclassification low enough and the value of public reporting and other policy responses high enough that the measure is good enough to use?” A second, equally important question follows: “Is the measurer committed to listening to the feedback of the public and profession and to responding to the emerging science in an effort to improve the measure over time?”

Measures that do not meet the first criteria should not be used. And organizations that do not meet the second should be ejected from the measurement and judgment business.

Robert Wachter, MD, professor of medicine at UCSF, is widely regarded as a leading figure in the patient safety and quality movements. He edits the federal government’s two leading safety websites, and the second edition of his book, “Understanding Patient Safety,” was recently published by McGraw-Hill. In addition, he coined the term “hospitalist” in an influential 1996 essay in The New England Journal of Medicine and is chair of the American Board of Internal Medicine.  His posts appear semi-regularly on THCB and on his own blog, Wachter’s World.