Credibility of Evidence: A Reconsideration of the Logic and Strength of Our Healthcare Decisions

A few days ago, we wrote an editorial for US News and World Reports on the scant or dubious evidence used to support some healthcare policies (the editorial is reproduced in full below).  In that case, we focused on studies and CMS statements about a select group of Accountable Care Organizations and their cost savings. Our larger point however is about the need to reconsider the evidence we use for all healthcare-related decisions and policies. We argue that an understanding of research design and the realities of measurement in complex settings should make us both skeptical and humbled.  Let’s focus on two consistent distortions.

Evidence-based Medicine (EBM).  Few are opposed to evidence-based medicine.  What’s the alternative? Ignorance-based medicine? Hunches?  However, the real world applicability of evidence-based medicine (EBM) is frequently overstated. Our ideal research model is the randomized controlled trial, where studies are conducted with carefully selected samples of patients to observe the effects of the medicine or treatment without additional interference from other conditions. Unfortunately, this model differs from actual medical practice because hospitals and doctors’ waiting rooms are full of elderly patients suffering from several co-morbidities and taking about  12 to 14 medications, (some unknown to us). It is often a great leap to apply findings from a study under “ideal conditions” to the fragile patient. So wise physicians balance the “scientific findings” with the several vulnerabilities and other factors of real patients.  Clinicians are obliged to constantly deal with these messy tradeoffs, and the utility of evidence-based findings is mitigated by the complex challenges of the sick patients, multiple medications taken, and massive unknowns. This mix of research with the messy reality of medical and hospital practice means that evidence, even if available, is often not fully applicable. 

Relative vs. Absolute Drug Efficacy:

Let’s talk a tiny bit about arithmetic. Say we have a medication (called  X) that works satisfactorily for 16 out of a hundred cases, i.e., 16% of the time.  Not great, but not atypical of many medications.  Say then that another drug company has another medication (called “Newbe”) that works satisfactorily 19% of the time. Not a dramatic improvement, but a tad more helpful (ignoring how well it works, how much it costs, and if there are worse side effects).  But what does the advertisement for drug “Newbe” say?   That “Newbe” is almost 20% better than drug “X.” Honest. And it’s not a total lie.  Three percent (the difference between 16% and 19%) is 18.75%, close enough to 20% to make the claim legit. Now, if “Newbe” were advertised as 3% better (but a lot more expensive) sales would probably not skyrocket. But at close to 20% better, who could resist?   

Policy:  So what does this have to do with healthcare policy?  We also want evidence of efficacy with healthcare policies but it turns out that evaluation of these interventions and policies is often harder to do well than are studies of drugs. Interventions and policies are introduced into messy pluralistic systems, with imprecise measures of quality and costs, with sick and not-so-sick patients, with differing resources and populations, with a range of payment systems, and so on and so on. Sometimes, randomized controlled trials are impossible.  But sometimes they are possible but difficult to effect. Nevertheless, we argue they are usually worth the effort. Considering the billions or trillions of dollars involved in some policies (e.g., Medicare changes, insurance rules) the cost is comparatively trivial.

But there’s another question: What if a decent research design is used to measure the effects of a large policy in a select population but all you get is a tiny “effect?”  What do we know? What should policymakers do? Here’s what we wrote in our recent editorial in the US News and World Report.

In an ideal world, health policies are based on solid evidence. But what if influential research is flawed? What if policymakers can’t distinguish between weak and trustworthy studies? And what if the resulting policies dramatically affect our health care system’s quality and costs?

Two recent studies in top medical journals, the Journal of the American Medical Association and the New England Journal of Medicine, concluded that “Pioneer Accountable Care Organizations” – which receive monetary incentives in attempts to improve Medicare’s efficiency and quality – saved money or reduced spending increases. Last week, the U.S. federal Centers for Medicare and Medicaid Services noted these findings as a reason to expand the program.

While we also want Accountable Care Organizations to succeed, we question whether the scant evidence reported in the studies justifies such a large public investment in time and money. The studies may have unfairly compared pre-selected, high performing pioneer Accountable Care Organizations with less experienced and less selective controls. This is akin to comparing baseball’s all-star teams to a team made up of the players not even considered for that honor.

To make matters worse, the research found only tiny differences (for example, 1.2 percent medical cost savings in one study) that may be dwarfed by error, bias and unmeasured variables, such as implementations costs. Statistics can’t fix research designs with major bias. Moreover, the JAMA study that measured a second year of Accountable Care Organization experience found declining “savings.” In the words of the co-editor-in-chief of the Incidental Economist, “Saving a couple hundred million out of the Medicare program is a decimal point.”

Our larger concern is how to interpret research on our costly, impossibly complex health system for policymakers who lack the scientific skills to evaluate the conclusions. Apparent cost savings often reflect coding irregularities, unreliable data (for example, insurance claims that may reflect what the companies will pay for rather than the patient diagnosis), unacknowledged biased comparisons, data omissions, arbitrary end points to studies or unrealistic time constraints.

Our worry is that as a research community, we are losing our humility and our caution in the face of declining research funding, the need to publish and the need to show so-called useful findings. Perhaps it’s becoming harder to admit that our so-called big data findings are not as powerful as we wish or are, at best, uninterpretable.

Probably, the most common weakness in such research is selection bias – for example, comparing volunteers to non-volunteers; programs chosen by the funding agency to those that weren’t; and, of course, comparing those who are healthy and wealthy to those who are too sick or poor to participate. Dead patients are also well known for not responding to requests for call-backs, responses or data.

The sociologist W.I. Thomas offered what is now called the Thomas theorem: “If men define situations as real, they are real in their consequences.” In this example with pioneer accountable care organizations, a finding that may be mostly measurement error becomes a truth to support a questionable policy.

What can we learn from this experience? Comparing haves and have-nots often generates exaggerated successes. Ideally, we can hope for randomized trials – in this example, by randomly dividing eligible, advanced health care organizations to accountable care organization participation or to non-participation. (The non-participating comparable controls might receive the program in a year or two after the results of the trial are in.) And when randomized trials are not possible, there are better designs than self-selection and administrative fiat.

We’ve had many randomized trials of government programs. When we are spending billions (or even trillions) of dollars, and the policies affect our social system, our economy and peoples’ well-being, we should require rigorous research methods and even more humility.

A portion of this essay appeared in the May 18th edition of US News and World Report.

Ross Koppel teaches research methods in the sociology department at the University of Pennsylvania and is a senior fellow at the Leonard Davis Institute of Health Economics. Stephen Soumerai is professor of population medicine and research methods and director of the Drug Policy Research Group at Harvard Medical School and the Harvard Pilgrim Health Care Institute.



Categories: Uncategorized

4 replies »

  1. I think what you’re saying is there is really no ‘evidence’ for evidence-based-medicine and we really live in a world of experience-based-medicine which needs to concatenate itself with consensus-oriented evidence AND we need to listen to patient and blend in the realities of preference-based-medicine.


  2. It’s truly frightening how political US health ‘research’ has become. Industry and government tend to prefer cooking the books.

    Who’s watching out for patients’ health and safety, sky-high costs for drugs and medical treatment in the US, and driving us toward meaningful, credible research? Ross Koppel and Steve Soumerai are heros, pressing for the use of scientific methods for research on US healthcare systems.

  3. Interesting.

    Some thoughts of mine back in the mid 90’s while my daughter was dying of cancer:

    “Post-operative therapeutic literature on HCC is a dense, frustrating tangle of mostly contradiction and disappointment. Chemo protocols declared “significant” in one study are found ineffective in another. The patient cohort sample sizes are too small and/or too unrepresentative to generalize to my daughter’s circumstance. Worse, the “operational definition” of a “success” is usually expressed in terms of weeks’ or months’ life extension beyond that of a control group, with little or no discussion of the quality of life of the therapy recipient. Indeed, beware of the word “palliative,” a term normally connoting “relief of symptoms.” In chemo-speak, however, “palliative” often simply means staving off expected demise for a short time with precious little otherwise “relief” in the bargain.”

    “…the fact that we can only cure 10% of known diseases implies nothing regarding the quality of mainstream medical research and practice, unless the alternatives industry can provide hard, “case-mix adjusted,” scientifically valid data showing their methods to effect consistently and significantly better outcomes– which they cannot (a dearth of peer-reviewed studies being a central characteristic of “alternative” practice). Additionally, I asked, can anyone even cite historical curative percentages from 30, 50, or perhaps 100 years ago? Indeed, even such statistics would prove problematic– “shooting at a moving target,” as it were– in that more subtle and clinically unresponsive maladies continue to be discovered and classified while the easier to treat are dealt with more readily. And, classificatory observation is easy compared to the work and resources required to effect cures; we should expect that identification will outpace remedy. Finally, 50 years ago death certificates listing demise from “natural causes” would today likely have identifiable diseases recorded as the cause of death.

    “…the body of peer-reviewed medical literature does not constitute a clinical cookbook; even “proven” therapies– particularly those employed against cancers– are generally incremental in effect and sometimes maddeningly transitory in nature. The sheer numbers of often fleeting causal variables to be accounted for in bioscience make the applied Newtonian physics that safely lifts and lands the 747 and the space shuttle seem child’s play by comparison. Astute clinical intuition is a necessary component of a medical art that must, after all, act and act quickly– so often in the face of indeterminate, inapplicable, or contradictory research findings…


    BTW I have a new post up on my blog: “The Robot will see you now — assuming you can pay”

  4. Excellent and timely. Hooray! As you say, you can’t do randomized clinical trials on mixtures of diseases or mixtures of therapies. You have to try to study pure X and compare it to some null. Your experimental arm can’t be filled with folks who have coronary artery disease AND diabetes and a little COPD. Yet this chaotic mix is what the real world serves us. Also we can’t do RDBCTs when community standards tell us: “You must not deny this therapy to this person that you want to put in the experimental arm of your study as it has been standard practice for 50 years and you will get sued.” Thus, we cannot deny steroids to sarcoid patients in trials, although we have never been able to prove that they work.

    Trying to achieve evidence basing in policy matters seems a worthwhile goal too but politicians are hurried and desultory. Nearly impossible, but some way to hold them accountable would seem fitting.