Measuring the Quality of Hospitals and Doctors: When Is Good Good Enough?

In the past, neither hospitals nor practicing physicians were accustomed to being measured and judged. Aside from periodic inspections by the Joint Commission (for which they had years of notice and on which failures were rare), hospitals did not publicly report their quality data, and payment was based on volume, not performance.

Physicians endured an orgy of judgment during their formative years – in high school, college, medical school, and in residency and fellowship. But then it stopped, or at least it used to. At the tender age of 29 and having passed “the boards,” I remember the feeling of relief knowing that my professional work would never again be subject to the judgment of others.

In the past few years, all of that has changed, as society has found our healthcare “product” wanting and determined that the best way to spark improvement is to measure us, to report the measures publicly, and to pay differentially based on these measures. The strategy is sound, even if the measures are often not.

Hospitals and doctors, unaccustomed to being rated and ranked like resort hotels and American Idol contestants, are suffering from performance anxiety and feeling an intense desire to be left alone. But we also bristle at the possibility of misclassification: to be branded a “B” or a “C” when you’re really an “A” feels profoundly unjust.

In my role as chair of the ABIM this year, I am awed by the amount of time and expertise that goes into ensuring that the pass/fail decisions of the Board are valid and defensible (legally, if necessary). They are. But as new kinds of measures spring up, most of them lack the rigor of the verdicts of the certifying boards. For example, Medicare is now penalizing hospitals that have excessive numbers of readmissions. As Harvard’s Karen Joynt and Ashish Jha observed in 2012, there is considerable doubt that the 30-day readmission rate is a valid measure of quality, and clear evidence that its application leads to misclassifications – particularly for penalized hospitals whose sins are that they care for large numbers of poor patients or that they house teaching programs. Quite understandably, these hospitals cry “foul.”

Yet the Medicare fines have contributed to a falling number of readmissions nationally – from 19 percent in 2011 to 17.8 percent in 2012, which represents more than 100,000 patients spared an unpleasant and risky return trip to the hospital. While cause and effect is difficult to prove, it seems likely that hospitals’ responses to the Medicare program (better discharge planning, earlier follow-up appointments, enhanced communication with PCPs, post-discharge phone calls to patients) are playing a role. “Readmissions are not a good quality measure,” Jha observed in a recent blog, “but they may be a very good way to change the notion of accountability within the healthcare delivery system.” Medicare’s Jonathan Blum puts it more bluntly. “I’m personally comfortable with some imprecision to our measures,” he said, as long as the measures are contributing to the ultimate goal of reducing readmissions.

With Jha and seven other experts, I am an advisor to the Leapfrog Group’s effort to grade hospitals on patient safety. Using the best available publicly reported data, our panel recommended a set of measures and a weighting system that Leapfrog has used to assign patient safety letter grades to U.S. hospitals. The hospitals that have received “F’s” (25 out of the 2619 hospitals that received ratings) have been up in arms – I’ve received several calls from their representatives, livid about what they believe to be a vast injustice. Yet there is no question that these hospitals are working on improvement with a passion that, in many cases, was previously lacking.

Of course, before getting down to business, everyone’s first two responses to poor grades are to question the validity of the measures and to work on better coding. I know one hospital that received a stellar grade in the Consumer Reports ranking system (one of the several systems now out there), and responded by festooning the hospital lobby and halls with banners. A few months later, when they received a “C” from Leapfrog, their reaction was to inveigh against the methods. This, of course, is natural: we embrace the rankings we like and reject the ones we don’t. But it is largely unproductive.

At a recent conference on transparency, I heard Arnie Milstein, a national leader in assessment and a professor at Stanford, speak about the current state of quality measurement. He described the Los Angeles Health Department’s program that rates restaurants on cleanliness, and mandates that restaurants post large signs with their letter grades (A, B, or C) in their windows. According to Milstein, the measures “would not have passed the National Quality Forum,” the agency that vets healthcare quality measures for scientific rigor. Yet the results were strikingly positive: a 20 percent decrease in patients hospitalized for food poisoning. This raises the central question: “At what point are measures good enough?”

In a 2007 study, Milstein and colleagues asked 1,057 Americans about physician quality measures. Specifically, they wondered what level of potential inaccuracy people would accept before they would not want to see the results. About one in five respondents said that they would want to see a measure even if its rate of misclassification (calling a doctor fair when she is excellent, or vice versa) was as high as 20-50 percent. Another third would not tolerate that degree of uncertainty, but would want access to measures that might be as much as 5-20 percent inaccurate.

Milstein hypothesized that these results might be a manifestation of the public’s famous innumeracy: perhaps these folks didn’t really understand the hazards of relying on such flawed information. So he asked the same question of a group of PhD statisticians at a national meeting. If anything, they were even more tolerant of misclassification risk. “‘P equals less than 0.05’ was nowhere to be seen,” he quipped.

Why were experts and non-experts alike so accepting of misclassification? Milstein came to the conclusion that the measures that they were being offered were better than what they had, which was nothing. Moreover, they probably sensed that public reporting of such measures would not only help them make better choices as consumers, but would also spur the doctors to improve. “Measures can motivate or discriminate,” Yale’s Harlan Krumholz reminded us at the same meeting. And in most cases, they do a bit of both.

Does the public’s tolerance for misclassification give measurers – the ABIM, Leapfrog, or Medicare – a free ride on the “Ends-Justify-The-Means” Express? Absolutely not. Measurers need to do their honest best to produce measures with as much scientific integrity as possible, and commit themselves to improving the measures over time. Medicare’s decision to ditch their four-hour door-to-antibiotic time pneumonia measure in the face of evidence of misclassification and unanticipated consequences (antibiotics at triage for everyone with a cough) is a shining example of responding to feedback and new data. In a recent NEJM article, Joynt and Jha recommend a few simple changes, including taking into account patients’ socioeconomic status, that could improve the readmission measure. The trick is to adjust appropriately for such predictors without giving safety net and academic hospitals a pass, since these organizations undoubtedly vary in performance and many have room for improvement.

Now that I have been on both sides of the measurement equation, one thing that has become clear to me is this: Public reporting of quality measures not only improves the work of the measured, it improves the work of the measurer. Ultimately, a healthcare ecosystem in which reasonable measures help guide patient and purchaser choices will lead to improvements in both the quality of care and of the measures themselves. I believe we can look forward to an era of more accurate measures, measures that capture the right things (not just clinical quality but teamwork and communication skills, for example), and measures that are less burdensome to collect and analyze.

If there were a way of getting to this Nirvana without ever unfairly characterizing a physician or hospital as a “C” when she/it is really a “B+”, that would be splendid. Personally, I can’t see how we can manage that. Seen in that light, the question to ask is not, “Are the measures perfect?” (clearly, they’re not) but, “Is the risk of misclassification low enough and the value of public reporting and other policy responses high enough that the measure is good enough to use?” A second, equally important question follows: “Is the measurer committed to listening to the feedback of the public and profession and to responding to the emerging science in an effort to improve the measure over time?”

Measures that do not meet the first criteria should not be used. And organizations that do not meet the second should be ejected from the measurement and judgment business.

Robert Wachter, MD, professor of medicine at UCSF, is widely regarded as a leading figure in the patient safety and quality movements. He edits the federal government’s two leading safety websites, and the second edition of his book, “Understanding Patient Safety,” was recently published by McGraw-Hill. In addition, he coined the term “hospitalist” in an influential 1996 essay in The New England Journal of Medicine and is chair of the American Board of Internal Medicine.  His posts appear semi-regularly on THCB and on his own blog, Wachter’s World.

11 replies »

  1. With 35+ years in healthcare, it is time for every program to incorporate metrics from the start. No longer can we expect that funding will always be there because a program “feels” like the right thing to do. We can’t say we have measured healthcare by counting the number of people served. A deeper level of sophistication is a taxpayer’s right and we should not build a single program unless metrics is integrated throughout the lifespan of the program. Doing so fosters course correction, modifications and much better outcomes. I like what I see with the MED-VAR, LLC program. It seems to blend science, research, metrics and validation on ROI. Are you using it?

  2. Well the when the patient got diagnose proper and get the perfect cure for his or her could be the better positive factor from his point of view. The patient interaction with the hospitals and their services can say much better from the experience they have.

  3. Great post Bob! A thoughtful and balanced analysis of the critical issue of knowing what we are getting for our healthcare dollar from the patient/consumer perspective; and, making sure it is fair and accurate from the physician/hospital side – with the bias emphasizing improvement. Better yet is the synthesis of the issues into two key and actionable questions at the end of the post. Thanks!

  4. Having previously worked with my former CMO colleagues at the 5 UC hospitals all you have to do is look at the drop in central line sepsis and decubitus ulcers to recognize the iimportance of focusing on measurable outcomes. Perhaps one of the next big areas to focus on is the 10 to 30 percent of errors related to misdiagnosis wron diagnosis or wrong therapy for patients who have the correct diagnosis. Bob thanks for all your great work and your willingness to confront these important but difficult issues

  5. Imprecision in the metrics of something that’s new to measurement development and reporting: makes me think of the move toward making food safer in the late 19th and early 20th centuries. Building effective measurement systems requires some level of hit-or-miss at the outset if what’s being measured is complex and has plenty of input/output points.

    On the good-news side, the measurement stats you mention – CMS and Leapfrog – aren’t widely used by patients … yet. However, I’m part of a growing army of people spreading the word on their use and utility. That army’s goal is to get to Nirvana, or at least its near-neighborhood. Savvy patients are some of the best educators out here, and we’re learning from *you* constantly.

  6. Appreciate your perspective on public ratings and the public domain. We have been statistically comparing public ratings of healthcare rakings for quality for 5 years and find patterns relevant to ongoing improvements with methodology for US News, Leapfrog, Truven, etc. Interesting correlations, for instance between academic hospital type and US News ranking, remain pervasive.

  7. Great post; I agree with all your comments that, while the measurements should be improved, they are already serving an important function of wonderfully concentrating the minds of hospital leaders. There will be many stumbles and blind alleys along the way, but the important thing was to make a start – and that has been long, long overdue.

  8. These ratings systems will force HC systems to be more accountable – to do the common sense, patient-focused things they should have been doing all along. This is the real value of the ACA and why it will brings costs down and improve quality.