“Comparative Effectiveness Research” and Kindred Delusions

“Comparative Effectiveness Research” and Kindred Delusions



Last summer President Obama signed the American Recovery and Reinvestment Act into law. Tucked into the legislation was $1.1 billion to support comparative effectiveness research (CER). The legislation charged the Institute of Medicine with defining CER. Its Committee on Comparative Effectiveness Research Prioritization rapidly came up with,

    …the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat and monitor a clinical condition, or to improve the delivery of care. The purpose of CER is to assist consumers, clinicians, purchasers, and policy makers to make informed decisions that will improve health care at both the individual and population levels.

The Committee then elicited over 2500 opinions from 1500 stakeholders and produced a list of the 100 highest-ranked topics for CER (www.iom.edu/cerpriorities). Proposals to undertake CER are pouring forth from investigators across the land. There is no doubt that an enormous amount of data will be generated by 2015. But there is every reason to doubt whether many inferences can be teased out of these data that will actually advantage patients, consumers, or the health of the nation.

I am no Luddite. For me “evidence based medicine” is not a shibboleth; it’s an axiom. Furthermore, having trained as a physical biochemist, I am comfortable with the most rigorous of the quantitative sciences let alone biostatistics. However, you can’t compare treatments for effectiveness unless you are quite certain that one of the comparators is truly efficacious. There must be a group of patients for whom one treatment has unequivocal and important efficacy. Otherwise, the comparison might discern differences in relative ineffectiveness.

The academic epidemiologists who spearheaded the CER agenda are aware of the analytic challenges but are convinced these can be overcome. I would argue that CER can never succeed as the primary mechanism to assure the provision of rational health care. It has a role as a secondary mechanism, a surveillance method to fine tune the provision of rational health care, once such is established.

The difference between efficacy and effectiveness

My assertion may seem counter-intuitive. After all, we hear every day about pharmaceuticals that are licensed by the FDA because of a science that supports the assertion of benefit. In epidemiology-speak, the science that the FDA reviews does not speak to the effectiveness of the drug, but to its efficacy. The science of efficacy tests the hypothesis that a particular drug or other intervention works in a particular group of similar patients. CER asks whether an intervention works better than other interventions in practice where the patients and the doctors are heterogeneous. The rational for the CER movement is the perceived limitations of efficacy research. I argue that the limitations of efficacy research are much more readily overcome than the limitations on CER.

Efficacy research

The gold standard of efficacy research is the randomized controlled trial (RCT). In a RCT, patients with a particular disease are randomly assigned to receive either a study intervention or a comparator (often a placebo). After a pre-determined interval, the previously defined clinical outcome is compared in the active and control limbs of the trial. If there is no difference, one can argue that the intervention offers no demonstrable clinical benefit to patients such as those in the study. If there is a difference, the contrary argument is tenable.

This elegant approach to establishing clinical utility has its roots in antiquity, at least as far back as Avicenna. The modern era commences after World War II and escalates dramatically after 1962 when the Kefauver-Harris Amendment to the laws regulating the US Food and Drug Administration mandated demonstration of efficacy before pharmaceuticals could be licensed. Modern biostatistics has probed every nuance of the RCT paradigm. The result is a highly sophisticated understanding of the limitations of the RCT, an understanding that has fueled the call for CER:

  1. The more homogeneous the study population, the more likely any efficacy will be demonstrated and the more compelling any assertion as to its lacking. However, the homogeneity compromises the ability to assume the result generalizes to different kinds of patients.
  2. Many important clinical outcomes are either infrequent or occur late in the course of disease. It is difficult to maintain and fund RCTs that require years or decades before one can hope to see a difference between the active and control limbs. The compromise is to study “surrogate” outcomes, measures that in theory reflect the disease process, but are not themselves clinically important outcomes. Thus we have thousands of studies of blood pressure, cholesterol, blood sugar, PSA and the like but comparatively few studies that use heart attacks, death from prostate cancer, or other untoward clinical outcomes as the end-point.
  3. How big a difference between the active and control limbs is important? Biostatistics has dictated that we should pay attention to any difference that is unlikely to happen by chance too often. “Too often” traditionally is considered no more than 5% of the time, but that’s a matter risk-taking philosophy. What are we to make of a difference that is clinically very small, even if it is unlikely to happen by chance more than 5% of the time? Is it possible that the small effect will be important, perhaps less small, when the constraints of homogeneity are removed in practice? In practice, drugs licensed for one disease are even tried for other “off label” indications where effectiveness may emerge.
  4. The corollary limitation relates to the negative trial. If there is no demonstrable difference, does that mean that there is no effect? Or could the effect have been too small to detect because of the duration of the trial or the size or homogeneity of the population studied? Even a very small effect, advantaging only the occasional patient, can translate into many benefited people when tens of thousands are treated.
  5. Devices and surgical procedures are used practice; rigorous testing as to efficacy is not a statutory requirement. Maybe in the “real world” a treatment that was never studied or studied in a limited fashion turns out to really advantage patients in practice, or advantage some patients – or not.

CER to the rescue?

The methodology employed for CER is not the RCT. CER is an exercise in “observational research”. CER examines real world data sets to deduce benefit or lack thereof. This entails the development of large-scale, clinical and administrative networks to provide the observational data. Then biostatistics must come to grips with issues that make defining the heterogeneity of populations recruited into RCTs seem trivial. In the RCT, the volunteers can be examined and questioned individually and in detail and the criteria for admission into the trial defined a priori. Nothing about the validity of diagnosis, clinical course, interventions, coincident diseases, personal characteristics or outcomes can be assumed in observational data sets. There must be efforts at validating all such crucial variables. No matter how compulsively this is done, CER demands judgments about the importance of each of these variables. It is argued that some of these limitations are overcome because CER is not attempting to ask whether a particular intervention works in practice, but whether it works better than another option also in practice. It is even suggested that encouraging or introducing particular interventions or practice styles into some practice communities and not others would facilitate CER. Perhaps.

The object lesson of interventional cardiology

Interventional cardiology for coronary artery disease is the engine of the American “health care” enterprise. Angioplasties, stents of various kinds, and coronary artery bypass grafting (CABG) have attained “entitlement” status. There are thousands of RCTs comparing one with another, generally leading to much ado about very small differences, usually in surrogate measures such as costliness or patency of the stent. But there are very few RCTs comparing the invasive intervention with non-invasive best medical care of the day: 3 for CABG and 4 for angioplasty with or without stenting. In these large and largely elegant RCTs, the likelihood of death or a heart attack if treated invasively is no different from the likelihood if treated medically. Whether anyone might be spared some degree of chest pain by submitting to an invasive treatment is arguable since the results are neither compelling nor consistent. Yet, interventional cardiology remains the engine of the American “health care” enterprise. It carries on despite the RCTs because its advocates launch such arguments as “We do it differently” or “The RCTs were keenly focused on particular populations of patients and we reserve these interventions for others we deem appropriate.” These arguments walk a fine line between hubris and quackery.

So many invasive procedures are done to the coronary arteries of the young and the elderly that interventional cardiology has long lent itself to CER. We know from observational studies that that it does not seem to matter much if the heart attack patient has an invasive intervention quickly or it is delayed or not at all. We know from observational studies, and even trials rewarding some but not all hospitals for getting doctors to adhere to the “guidelines” for managing heart disease, that adherence does not make much of a difference. Do the results of this CER mean that we need to further improve the efficiency and quality of the performance of invasive treatments as many would argue? Or can we hope that more exacting CER can parse out some meaningful indication from large data sets, some compelling inference that only particular people with particular conditions are advantaged and therefore are the only candidates for interventional cardiology?

Or are we using the promise of CER to postpone calling a halt to the ineffective and inefficacious engine of American “health care”. The available science is consistent with the argument that interventional cardiology is not contributing to the health of the patient. I would argue that interventional cardiology should be halted until someone can demonstrate substantial efficacy and a meaningful benefit-to-risk ratio in some subset.  Then CER can ask whether the benefit demonstrated in the efficacy trial translates to benefit in common practice.

Efficacy research is the horse; CER is the cart

Interventional cardiology for coronary artery disease is but one of many object lessons. There is much in common practice that has never been shown to be efficacious in any subset of patients. Some practices take up residence in the common sense despite having never been studied. Some practices, like interventional cardiology, persist because intellectual and fiscal interests are vested in the entrenchment despite the results of efficacy trials. CER can not inform efficacy, and CER can not inform effectiveness unless there is an example of efficacious therapy against which practices are compared. Otherwise, CER can be comparing degrees of ineffectiveness.

The way forward is to design efficacy trials that are more efficient in providing gold standards for comparison and as efficient in defining false starts that are not allowed into common practice until the approach is superseded by one of demonstrated efficacy. This is not all that difficult to do. Let’s return to the limitations of efficacy trials listed above:

  1. Homogeneity of study populations is not a limitation for the quest for a meaningful standard of efficacy. At least we will know the intervention is good for someone.
  2. Surrogate measures are useful to bolster the hypothesis that something might work. They have a dismal track record for testing the hypothesis that something does work. Clinically important outcomes must be invoked for such a test. If it is not feasible because the clinical outcome is too slow to develop or too infrequent, compromise is not an option. The intervention can not be studied at all, or it can not be studied until an appropriate subpopulation can be identified, or one must bite the bullet and undertake a lengthy RCT.
  3. Surrogate outcomes are not the only way that RCT results can lead to spurious clinical assumptions. “Composite outcomes” are even worse. RCTs in cardiology are notorious for an outcome such as “death from heart disease or heart attack or the need for another procedure.” When these studies are closely read, one learns that any difference detected is almost exclusively in “the need for another procedure” which is a highly subjective and interactive outcome that can speak to preconceptions on the part of the doctor or the patient rather than the efficacy of the intervention.
  4. Modern epidemiology is so wedded to the notion of statistical significance that concern about the statistical significance of “What?” is overwhelmed. “What?” is the clinical significance? Just because the difference observed between the active and control limbs of the RCT wouldn’t have happened by chance too often does not mean that the difference is clinically important even in the occasional patient. I’ll illustrate this by touching the Third Rail that the debate over the clinical utility of mammography has become. Malmö is a city in Sweden where women were invited to volunteer for a RCT; half would be offered routine screening mammography for a decade and the other half encouraged see their physicians whenever they had concern about the health of their breasts. That’s the difference between screening and diagnostic protocols; in screening one is agreeing to a test simply as a matter of course, in diagnostics one agrees to the testing in response to a clinical complaint. Back to the Malmö RCT. Over 40,000 women between age 40 and 60 volunteered for the RCT. Invasive cancer was detected in statistically significantly more women who were in the screened group than in the diagnostic group. Impressed? How about if I told you that 7 of 2000 women screened for a year were found to have invasive breast cancer and 5 of 2000 women in the diagnostic group for a year were found to have invasive breast cancer. Was all the screening worth this difference in absolute number of additional cancers detected? I could have told you that screening detected 40% more cancers but you won’t be swayed by the relative increase now that you know the absolute increase was 0.1%, will you? Would you consider the screening valuable if I told you that for every woman whose invasive breast cancer was treated so that they lived long enough to die from something else at a ripe old age, another two were treated unnecessarily since they died from something else before their breast cancer could be their reaper? How about all the false positive mammograms and false positive biopsies? There is a debate about mammography because it is a very marginal test that clearly is not doing as well as the common sense assumes.
  5. How small an effect can we detect in a RCT? Theoretically we can detect a very small effect. Theoretically we can detect an effect even smaller than the Malmö result. In order to do so, you need to randomize a large, homogeneous population whose size is determined by the level of statistical significance you choose and the nature of the health effect you seek. Death is the least equivocal outcome, for example. The quest for the small effect is the mantra of modern epidemiology. However, I consider such “small effectology” a sophism. No human population is homogeneous; we differ one from another in obvious, often measurable ways but also in less obvious, immeasurable ways. When we randomize individuals in any homogeneous population into a treatment group and a control group we assume that all the immeasurable differences randomize 50:50 or if not the randomization errors counterbalance. The smaller effect we are seeking, the more likely we are to be fooled by randomization errors that account for the difference rather than the treatment. That’s why so many small effects that emerge from RCTs do not reproduce.

Evidence Based Medicine can be more than a Shibboleth

The philosophical challenge in the design of efficacy trials relates to the notion of “clinically significant.” How high should we set the bar for the absolute difference in outcome between the treated and control groups in the RCT to be considered compelling? One way to get one’s mind around this question is to convert the absolute difference into a more intuitively appealing measure, the Number Needed to Treat (NNT). If the outcome is readily measured and unequivocal, such as death or stroke or heart attack, I would find the intervention valuable if I had to treat 20 patients to spare 1. Few students of efficacy would be persuaded if we had to treat more than 50 to spare 1. Between 20 and 50 delineates the communitarian ethic; smaller effects are ephemeral. For an outcome that is more difficult to measure than death or the like, an outcome that relates to symptoms or quality of life, I would argue for a more stringent bar.

If we applied this logic to RCTs, the trials would be far more efficient (in investigator/volunteer time, materiel, and cost) and the results far more reliable. If we applied this logic to RCTs, we would eliminate trials designed only to license agents no better than those already licensed (“me too” trials) and trials designed only for marketing purposes (“seed” trials). If we only licensed clinically efficacious interventions going forward, we could turn to CER to understand their effectiveness in practice. If we applied this logic retrospectively, to the trials that have already accumulated, we would soon realize how much of what is common practice is on the thinnest of evidentiary ice, how much has fallen through and how much supports an enterprise that is known to be inefficacious. It would take great transparency and political will to apply this razor retrospectively. We, the people, deserve no less.

Nortin M. Hadler, MD, MACP, FACR, FACOEM (AB Yale University, MD Harvard Medical School) trained at the Massachusetts General Hospital, the National Institutes of Health in Bethesda, and the Clinical Research Centre in London. He joined the faculty of the University of North Carolina in 1973 and was promoted to Professor of Medicine and Microbiology/Immunology in 1985. He serves as Attending Rheumatologist at the University of North Carolina Hospitals.

For 30 years he has been a student of “the illness of work incapacity”; over 200 papers and 12 books bear witness to this interest. He has lectured widely, garnered multiple awards, and served lengthy Visiting Professorships in England, France, Israel and Japan. He has been elected to membership in the American Society for Clinical Investigation and the National Academy of Social Insurance.  He is a student of the approach taken by many nations to the challenges of applying disability and compensation insurance schemes to such predicaments as back pain and arm pain in the workplace. He has dissected the fashion in which medicine turns disputative and thereby iatrogenic in the process of disability determination, whether for back or arm pain or a more global illness narrative such as is labeled fibromyalgia. He is widely regarded for his critical assessment of the limitations of certainty regarding medical and surgical management of the regional musculoskeletal disorders. Furthermore, he has applied his critical razor to much that is considered contemporary medicine at its finest.

Leave a Reply

30 Comments on "“Comparative Effectiveness Research” and Kindred Delusions"

Jan 10, 2010

Well done piece, and I hope the non-clinicians with interest in this area can take some meaning from it.
To humbly add to above, the concept of explanatory vs pragmatic RCTs fits into the described paradigm well:

Greg Pawelski
Jan 10, 2010

I was asked by Robert E. Ratner, MD, FACP, Robert Wood Johnson Health Policy Fellow, Study Officer, CER Priorities, Institute of Medicine, to submit a specific priority.
I submitted a proposal to compare the effectiveness of various genetic and cell culture assay technologies for targeted as well as conventional cancer treatments, to show what technologies work for drug selection.
Only 100 were picked out of 2,500 opinions submitted? Guess, better luck next time!

Jan 10, 2010

Dr. Hadler,
Very well said.
“The modern era commences after World War II and escalates dramatically after 1962 when the Kefauver-Harris Amendment to the laws regulating the US Food and Drug Administration mandated demonstration of efficacy before pharmaceuticals could be licensed.”
There is not any proven efficacy of the CPOE and other HIT medical care devices yet they are being promoted and sold to enable CER. Also, these devices have neither been approved by the FDA as being safe nor approved as being efficacious.
Perhaps the first project should be safety, efficacy, and CER study of the CPOE devices themselves?
Thus, it is folly, if not illegal, to deploy these devices at the present time.
For those readers and users who have experienced the adverse events associated with CPOE, word is circulating that such incidents and device defects are to be reported to the FDA at MedWatch.

Jan 10, 2010

Great piece. Really brings home the old saying “There are lies.there are damn lies and there are statistics.”
It has long been practice to do research that will support reasoning that supports an industry and then much effort is taken to continue that support and purposefully very little to risk bringing an industry down. Who are the researchers afraid of? Or is it the grants are not given out to reevaluate something?

Jan 10, 2010

Nortin Hadler is a major league stud who calls it brutally as he sees it. In fact if Obama had read “The Last Well Patient” instead of the Gawande article, he might have decided to close every cardiac cath lab in America.
However, I’m not sure that there’s any disagreement between those promoting CER and those who favor RCT. The difference between relative and absolute effects on very small numbers of people is relatively well known among wonks, but despite the best efforts of Gary Schwitzer, no one in mainstream America understands.
So let’s sing kumbyya and have Norton Hadler be very involved in setting policy.
And let’s change policy so that the results of these studies actually matters. Right now it’s going to be ignored.

Jan 11, 2010

Choice of the right comparator is vital, otherwise the risk of just supporting drug company competition. We wrote about this in relation to RA drugs a while back.


This is an extremely valuable review but, as much as we all love the Randomized Controlled Trial, they are too expensive and take too long to be relied upon any more than they are in our efforts to expand the medical knowledge base. And due to their exquisitely crafted selection criteria, they end up answering a relatively narrow set of questions that everyday-doctors face.
Pragmatically speaking, we have no choice but to rely on less rigorous trial design methodologies, including even retrospective trials that, as the author points out, are subject to hidden biases etc. This is OK in my mind so long as the results of these trial appear along with some rating system for the quality of the medical evidence that trial has provided.
ONC has addressed this matter in its guidance for clinical decision support tools as it relates to certifying EHRs. Great job by ONC!
Thank you,
Glenn Laffel, MD, PhD
Sr. Vice President Clinical Affairs
Practice Fusion
Free, Web-based EHR

Greg Pawelski
Jan 11, 2010

The science that the FDA reviews does not speak to the effectiveness of a drug but to its efficacy. I can see where CER asks whether a drug works better than another drug in practice.
In cancer medicine, Donald Berry, Ph.D., professor and chair of the Department of Biostatistics and Applied Mathematics at M.D. Anderson Cancer Center stated in a January 2006 issue of Nature Reviews Drug Discovery, the statistical method used nearly exclusively to design and monitor clinical trials today (the frequentist method) is so narrowly focused and rigorous in its requirements that it limits innovation and learning.
He advocates adopting the Bayesian methodology, a statistical approach that is more in line with how science works. It is used routinely in physics, geology and other sciences. And he has put the approach to the test at M.D. Anderson, where more than 100 cancer-related phase I and II clinical trials were being planned or carried out using the Bayesian approach.
The main difference between the Bayesian approach and the frequentist approach to clinical trials has to do with how each method deals with uncertainty, an inescapable component of any clinical trial. Unlike frequentist methods, Bayesian methods assign anything unknown a probability using information from previous experiments. The Bayesian methods make use of the results of previous experiments, to do continuous updating as information accrues, whereas the frequentist approaches assume we have no prior results.
Doctors want to be able to use biomarkers to determine who is responding to what medication and look at multiple potential treatment combinations. They want to be able to treat a patient optimally depending on the patient’s disease characteristics. Cancer is a diverse disease and what works to treat one person’s disease may not work for another.
The Bayesian methodology is no stranger to cell function analysis. Cell culture assay testing is a “functional” biomarker. The absoute predictive accuracy of cell culture assay tests varies according to the overall response rate in the patient population, in accordance with Bayesian principles.
The actual performance of the assay in each type of tumor precisely match predictions made from Bayes’ Theorem. The rational for cell function analysis is to ask whether a drug works better than another drug in a disease that is heterogeneous.
Real-world studies are not being performed under real-world conditions. No one is publishing real-word studies, except private laboratories performing cell function analysis, which can only do real-world studies, because their studies require fresh, viable specimen, which must be accessioned and tested in real-time under real-world conditions.
Patient outcomes need to be reported in real-time, so patients and physicians can learn immediately if and how patients are benefiting from new diagnostics and therapies.

Vikram C
Jan 11, 2010

Dr. Nortin’s article is very illuminating. Although I note with facetious glee, Dr. Nortin noted the futility of medical interventions in many cases on aggregated levels as stated in his book Last Well Person.
Recently I perused a debate on Maggie Mahar’s blog between Greg P and certain doctors about utility of CT scans. As a neutral both parties appeared to have forceful arguments but the disconnect was about individually useful procedure results that somehow didn’t show up in aggregated patient statistics quoted by Greg.
So will CER be used to improve individual results or can we expect it show up results on aggregated level?
After all driving factor for CER is still the hope to contain medical costs and not better patient outcomes. If results doesn’t show up on aggregate levels, it’s not going to do anything about costs.
Just curious and rumbling…


Comparative Effectiveness Research (CER) is supported by everyone, until someone has to give up something. Like the author of this fine post, I suport the concept enthusiastically. When incomes of physicians, device companies, hospitals, pharmaceuticals, etc., are threatened by the research, there will be a volcanic reaction that will dwarf the USPSTF Mammogate debacle. Every item of unnecessary medical care is someone else’s income. If the research attacks a stakeholder’s income, then the research will be attached as flawed or biased. I fear that the process will by stayed by politics and personal agendas. http://bit.ly/1NJqKS


Excellent article. The fallacy of coronary intervention and the ability of a single specialty to convince themselves of efficacy despite logic to the contrary while not unique, is staggering with regard to its impact on health care spending.
The issue is what to do with the data and who will do it?? Good luck standing-up to the specialty and huge industry of this “health care engine”. Someone will need to tell the patient that its “ok” to take medicine for their ‘blocked’ arteries when their specialist says otherwise.


Good post, but don’t you believe that while the concept rings true, the implementation would be far too expensive?

Jan 11, 2010

Agree with Dr. Kirsch; and to the groups that do not want “to give up something”, one must not forget to add patients/”consumers”.
The attitude/culture of large parts of the population and among journalists is: more is better, and any not obviously harmful intervention is better than no intervention if it has the slightest notion of a possible benefit; and not to offer/cover a medical service is “rationalization”/deprivation. That should be clear to anyone following the public reaction to the breast cancer screening recommendations.
Somehow, we manage to talk about applied medicine in the US as if we lived in a rational world. Actually, we live in a healthcare world of huge economic interests, ignorance, and artificial needs. I am deeply skeptical that this ignorance can be overcome, since applied stats are not easy and often counterintuitive, and because there are economic interests maintaining the prevailing culture.

Margalit Gur-Arie
Jan 11, 2010

If I understand correctly the R in CER stands for Research and as Dr. Kirsh says nobody is opposing the research per se.
The question is what to do with the results of such research. Do we present the results to physicians in the form of guide lines for treatment, or do we present the results to insurers as guide lines for payment?
I think it’s the latter that presents problems…

Vikram C
Jan 11, 2010

How about presenting results to consumers?
Some just might read it, especially the one managing chronic disease condition.