Can I fool you with the picture above? Apparently, some people think so.
I’m a Twitter newbie, but I’ve already discovered that sometimes you can tweet what you think is a helpful piece of data, then find yourself suddenly caught up in an explosive controversy. When it’s the Brookings Institute and US News and World Report on one side and passionate e-patients on the other, a research tweep is liable to feel like a nerdy accountant who wandered into the OK Corral at high noon with neither Kevlar nor a gun.
This happened to me when Niam Yaraghi of Brookings posted on the US News blog and the Brookings blog that people shouldn’t trust Yelp reviews in health care—the URL for the post actually ends “online-doctor-ratings-are-garbage”—because patients hadn’t been to medical school.
His contention was that patient reviews would be based on factors such as “bedside manner, décor of their offices, and demeanor of staff”, which he perceived as trivial. Patients, Yaraghi asserted, instead “should instead rely on valid measures of quality and medical expertise which are also available online. Examples of such measures include effectiveness of care at hospitals, experience of physicians in performing a specific medical procedure and nurse-to-patient ratios at nursing homes.”
I follow Charlie Ornstein, past president of the Association of Health Care Journalists, and saw his tweet that Yaraghi’s post seemed “to dismiss that patients are experts in their own conditions.” I read the conversation and blog post. Thinking that data often helps when people disagree, I tweeted a graph from research my collaborator Naomi Bardach and I had done showing that Yelp ratings had a correlation of 0.49 with ratings on the Consumer Assessment of Healthcare Providers and Systems (CAHPS) hospital measure (the state-of-the-art method of measuring patient experience).
Yaraghi doubled down. He tweeted to Charlie and me, “Patients are experts in their own condition? Give me a break.” To the many patient advocates who follow Charlie, those are fighting words, and e-patients would soon be piling on Yaraghi, filling my Twitter notifications box. Corrie Painter, Associate Director of Operations at the Broad Institute at MIT and Harvard and an angiosarcoma survivor, for instance, asked Yaraghi, “Since I formally study my own cancer, do I qualify as an expert?”
The conversation also reached the Twitter account of Jeremy Stoppelman, CEO of Yelp. Stoppelman responded by retweeting our graph. Yaraghi replied that I shouldn’t try to “fool people with a graph”. (Disclosure: I’ve never met Jeremy Stoppelman and have no financial relationship with Yelp.) Yaraghi apparently then read our paper and tweeted asking why we didn’t also say that the correlation coefficient between “Yelp and outcome measures was at most -0.31?”
So who’s right? Do Yelp scores mean anything? Or are the correlations we described so weak as to be meaningless, suggesting that everyone should stick solely to the other measures Yaraghi mentions?
The good news here is you can judge for yourself. Here again is the graph I sent, which shows the relationship between the percent of people giving a hospital high ratings on Yelp (4 or 5 stars out of 5) and the percent of people giving the same hospitals high ratings on the hospital CAHPS measure (9 or 10 stars out of 10). We used the top two categories in each system because the Medicare program, in its public reporting, does this (it’s our impression that there are some consumers who just won’t let themselves check the top box on any scale):
Our full paper is here. Note that the correlation coefficient between Yelp and CAHPS is 0.49. You might want to look at the picture to decide for yourself whether you think there’s a meaningful relationship. Alternatively, here’s what Richard Hernstein and Charles Murray said in The Bell Curve:
“correlations in the social sciences are seldom higher than 0.50 and often much weaker—because social events are imprecisely measured and are usually affected by [many] variables”
So, the 0.49 we found is near the upper end of what we would expect to see for a strong relationship in the chaotic social world we live in. Furthermore, if you looked at the specific groups of questions in CAHPS, Yelp scores uniformly predicted ratings. As Yelp reviewers gave more stars, performance improved on every single aspect of care:
In addition, we found that the relationship between Yelp measures and outcomes like death and readmission for three different conditions were all statistically significant, but lower (-0.13 to -0.39, with the negative sign meaning that, as Yelp scores went up, bad outcomes happened less). Herrnstein and Murray note, “The correlations between IQ and various job-related measures are generally in the .2 to .6 range.” Thus, our findings that the correlation between Yelp scores and heart failure mortality and readmissions were, respectively, -0.31 and -0.39 are in line with the strength of correlation between intelligence and job performance.
Other people have found that patients’ experience ratings also correlate with the more technical aspects of care. For example, among patients with depression or anxiety, their experience ratings are predictive of also receiving the right counseling or medication.
So I think most people would disagree with Yaraghi about Yelp (or other ways of capturing patients’ opinions) being meaningless, just based on the numbers. Furthermore, many would argue that “bedside manner” and “demeanor of the staff” are important in their own right.
But Yaraghi does also have two important points, even if he hasn’t been very gracious about them. These seemed to have been thrown out with the discourteous bathwater by some of the patient advocates. First, not all Yelp reviews are likely to be helpful. Sometimes people get mad about things you don’t care about or random events happen. So, if a doctor or hospital has only a few ratings, you should be less confident that you have a good sense of how they are doing.
Second, if there are other valid measures of quality available about them, such as the mortality rate for a cardiac surgeon, you probably want to take those into account when choosing a physician. I would rather have a surgeon who is a jerk but has very good outcomes than one who is warm but has bad outcomes (assuming that there really has been good adjustment for differences in the complexity or difficulty of the patients they see).
So why is it that I conclude the errors in this conversation are mostly Yaraghi’s. One reason is just practical: you can’t get the valid measures he mentions for the vast majority of conditions that might make you ill. A woman with breast cancer cannot find the mortality rate for different cancer centers. A patient with multiple sclerosis cannot know which neurologist will prevent disability.
In fact, if you want information on a specific doctor (other than her address, phone number, and where she went to school) it is virtually impossible to find. The science of measuring outcomes simply isn’t yet to the point that we have many measures that are really accurate at the individual doctor level (because of low sample sizes). Then, the politics of measuring outcomes means that nothing gets published by government or medical organizations until it’s deemed nearly perfect.
(Go ahead, look me up or Google me or check with the California State Medical Board: see if you can find out if my asthma patients are short of breath, or how long my lung cancer patients live. If you find anything, tell me, because I’d love to know how I’m doing.)
When you combine the absence of the measures for which Yaraghi is advocating with the evidence that patients’ insights are associated with more sophisticated measures of quality and the ethical notion that their experience has validity in its own right, there’s only one conclusion. Niam Yaraghi and Brookings got this one mostly wrong.
Adams Dudley, MD is Professor of Medicine and Director of UCSF’s Center of Health Care Value.
Categories: Uncategorized
Thanks for this interesting perspective and discussion. I’d suggest you re-analyze the data using a Bland-Altman plot to assess agreement. Here is a nice discussion of Bland-Altman (which I particularly love because it its written by an anesthesiologist): http://bja.oxfordjournals.org/content/99/3/309.full
Well said! Agree about the volume question when there are no outcomes measures, too. Furthermore, here we found that choosing high volume hospital didn’t have to mean much extra travel: http://jama.jamanetwork.com/article.aspx?articleid=192451
“the politics of measuring outcomes means that nothing gets published by government or medical organizations until it’s deemed nearly perfect” – that’s the wheel under which people are ground to powder even *if* they want to find out if their doctor is Osler reincarnated or Dr. Hodad. There’s no way for *any* of us – doctors, patients, policy wonks, partridges in pear trees – to know enough about the specifics of their outcome metrics to make a really informed choice.
As surgical volume metrics have risen in the insider-QI discussion, those of us who monitor signal traffic for interpretation and dissemination in the wider community (some of us journalists, some of us patient-side policy wonks, some of us both) have IDed that as a good starting point, and are advising people/patients to add “how many [procedure in question] have you personally done?” to their decision tree checklist. How many + how many have good quality of life post-surg would equal nirvana on the patient-side of metrics calculation … but that ain’t even close to a thing yet.
If you, or someone you love, has been a patient for [some big thing], you know how the conversation can quickly become something that happens *over* you, not *with* you. Given the dearth of in-the-clear metrics available to patients, are there any alternatives but Yelp? Or health-specific stuff like Vitals or Healthgrades? Both of which are as bushwa as Yelp, IMO.
Can somebody start taking the impenetrable data-crap pumped out of PQRS et al and start serving it up in ways that are easy for an average human to parse? While experts fiddle, we all burn.
Yes, you can’t make much of a doctor or hospital with just a few ratings. We excluded those with fewer than 5. But if one of those outliers appears in a slew of other ratings, it has less impact. Plus others use the accompanying narrative to decide how they feel (so if a 1 star rating says, “operated on the wrong knee” that will carry more weight than a 1 star rating for “office staff was slow”).
Great post! Surgical volumes have been an issue for years but just won’t go away (e.g., in Spain, population 46 million, there are no hospitals that do 400+ CABGs a year and most are really low). And the idea of videoing and scoring is very novel.
Here’s the problem: the outliers.
People who write a single negative review that is so glaringly unfair that it unduly influences. How do we statistically evaluate the impact of that?
Does this work the same way for all specialties?
Here’s the other problem: we’re using quantitative tools to evaluate qualitative data.
I’m not sure this works —
I think we’re framing the question the wrong way. That doesn’t mean that Yelp is a bad thing, Yelp is a good thing. Does that mean are all reviews good? No, it does not …
What we need to do is to teach people how to evaluate a written review, and take a grain of a salt with what they read.
/ j
This reminds me of an article on Amplio and surgical quality: https://medium.com/backchannel/should-surgeons-keep-score-8b3f890a7d4c. Sometimes quality is easier for lay people to recognize than we realize.
Using Pearson R’s on ORDINAL rank data? Why not carry them out to 4 decimal places, like some people now do with their GPAs?
Seriously?