Last summer President Obama signed the American Recovery and Reinvestment Act into law. Tucked into the legislation was $1.1 billion to support comparative effectiveness research (CER). The legislation charged the Institute of Medicine with defining CER. Its Committee on Comparative Effectiveness Research Prioritization rapidly came up with,
- …the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat and monitor a clinical condition, or to improve the delivery of care. The purpose of CER is to assist consumers, clinicians, purchasers, and policy makers to make informed decisions that will improve health care at both the individual and population levels.
The Committee then elicited over 2500 opinions from 1500 stakeholders and produced a list of the 100 highest-ranked topics for CER (www.iom.edu/cerpriorities). Proposals to undertake CER are pouring forth from investigators across the land. There is no doubt that an enormous amount of data will be generated by 2015. But there is every reason to doubt whether many inferences can be teased out of these data that will actually advantage patients, consumers, or the health of the nation.
I am no Luddite. For me “evidence based medicine” is not a shibboleth; it’s an axiom. Furthermore, having trained as a physical biochemist, I am comfortable with the most rigorous of the quantitative sciences let alone biostatistics. However, you can’t compare treatments for effectiveness unless you are quite certain that one of the comparators is truly efficacious. There must be a group of patients for whom one treatment has unequivocal and important efficacy. Otherwise, the comparison might discern differences in relative ineffectiveness.
The academic epidemiologists who spearheaded the CER agenda are aware of the analytic challenges but are convinced these can be overcome. I would argue that CER can never succeed as the primary mechanism to assure the provision of rational health care. It has a role as a secondary mechanism, a surveillance method to fine tune the provision of rational health care, once such is established.
The difference between efficacy and effectiveness
My assertion may seem counter-intuitive. After all, we hear every day about pharmaceuticals that are licensed by the FDA because of a science that supports the assertion of benefit. In epidemiology-speak, the science that the FDA reviews does not speak to the effectiveness of the drug, but to its efficacy. The science of efficacy tests the hypothesis that a particular drug or other intervention works in a particular group of similar patients. CER asks whether an intervention works better than other interventions in practice where the patients and the doctors are heterogeneous. The rational for the CER movement is the perceived limitations of efficacy research. I argue that the limitations of efficacy research are much more readily overcome than the limitations on CER.
The gold standard of efficacy research is the randomized controlled trial (RCT). In a RCT, patients with a particular disease are randomly assigned to receive either a study intervention or a comparator (often a placebo). After a pre-determined interval, the previously defined clinical outcome is compared in the active and control limbs of the trial. If there is no difference, one can argue that the intervention offers no demonstrable clinical benefit to patients such as those in the study. If there is a difference, the contrary argument is tenable.
This elegant approach to establishing clinical utility has its roots in antiquity, at least as far back as Avicenna. The modern era commences after World War II and escalates dramatically after 1962 when the Kefauver-Harris Amendment to the laws regulating the US Food and Drug Administration mandated demonstration of efficacy before pharmaceuticals could be licensed. Modern biostatistics has probed every nuance of the RCT paradigm. The result is a highly sophisticated understanding of the limitations of the RCT, an understanding that has fueled the call for CER:
- The more homogeneous the study population, the more likely any efficacy will be demonstrated and the more compelling any assertion as to its lacking. However, the homogeneity compromises the ability to assume the result generalizes to different kinds of patients.
- Many important clinical outcomes are either infrequent or occur late in the course of disease. It is difficult to maintain and fund RCTs that require years or decades before one can hope to see a difference between the active and control limbs. The compromise is to study “surrogate” outcomes, measures that in theory reflect the disease process, but are not themselves clinically important outcomes. Thus we have thousands of studies of blood pressure, cholesterol, blood sugar, PSA and the like but comparatively few studies that use heart attacks, death from prostate cancer, or other untoward clinical outcomes as the end-point.
- How big a difference between the active and control limbs is important? Biostatistics has dictated that we should pay attention to any difference that is unlikely to happen by chance too often. “Too often” traditionally is considered no more than 5% of the time, but that’s a matter risk-taking philosophy. What are we to make of a difference that is clinically very small, even if it is unlikely to happen by chance more than 5% of the time? Is it possible that the small effect will be important, perhaps less small, when the constraints of homogeneity are removed in practice? In practice, drugs licensed for one disease are even tried for other “off label” indications where effectiveness may emerge.
- The corollary limitation relates to the negative trial. If there is no demonstrable difference, does that mean that there is no effect? Or could the effect have been too small to detect because of the duration of the trial or the size or homogeneity of the population studied? Even a very small effect, advantaging only the occasional patient, can translate into many benefited people when tens of thousands are treated.
- Devices and surgical procedures are used practice; rigorous testing as to efficacy is not a statutory requirement. Maybe in the “real world” a treatment that was never studied or studied in a limited fashion turns out to really advantage patients in practice, or advantage some patients – or not.
CER to the rescue?
The methodology employed for CER is not the RCT. CER is an exercise in “observational research”. CER examines real world data sets to deduce benefit or lack thereof. This entails the development of large-scale, clinical and administrative networks to provide the observational data. Then biostatistics must come to grips with issues that make defining the heterogeneity of populations recruited into RCTs seem trivial. In the RCT, the volunteers can be examined and questioned individually and in detail and the criteria for admission into the trial defined a priori. Nothing about the validity of diagnosis, clinical course, interventions, coincident diseases, personal characteristics or outcomes can be assumed in observational data sets. There must be efforts at validating all such crucial variables. No matter how compulsively this is done, CER demands judgments about the importance of each of these variables. It is argued that some of these limitations are overcome because CER is not attempting to ask whether a particular intervention works in practice, but whether it works better than another option also in practice. It is even suggested that encouraging or introducing particular interventions or practice styles into some practice communities and not others would facilitate CER. Perhaps.
The object lesson of interventional cardiology
Interventional cardiology for coronary artery disease is the engine of the American “health care” enterprise. Angioplasties, stents of various kinds, and coronary artery bypass grafting (CABG) have attained “entitlement” status. There are thousands of RCTs comparing one with another, generally leading to much ado about very small differences, usually in surrogate measures such as costliness or patency of the stent. But there are very few RCTs comparing the invasive intervention with non-invasive best medical care of the day: 3 for CABG and 4 for angioplasty with or without stenting. In these large and largely elegant RCTs, the likelihood of death or a heart attack if treated invasively is no different from the likelihood if treated medically. Whether anyone might be spared some degree of chest pain by submitting to an invasive treatment is arguable since the results are neither compelling nor consistent. Yet, interventional cardiology remains the engine of the American “health care” enterprise. It carries on despite the RCTs because its advocates launch such arguments as “We do it differently” or “The RCTs were keenly focused on particular populations of patients and we reserve these interventions for others we deem appropriate.” These arguments walk a fine line between hubris and quackery.
So many invasive procedures are done to the coronary arteries of the young and the elderly that interventional cardiology has long lent itself to CER. We know from observational studies that that it does not seem to matter much if the heart attack patient has an invasive intervention quickly or it is delayed or not at all. We know from observational studies, and even trials rewarding some but not all hospitals for getting doctors to adhere to the “guidelines” for managing heart disease, that adherence does not make much of a difference. Do the results of this CER mean that we need to further improve the efficiency and quality of the performance of invasive treatments as many would argue? Or can we hope that more exacting CER can parse out some meaningful indication from large data sets, some compelling inference that only particular people with particular conditions are advantaged and therefore are the only candidates for interventional cardiology?
Or are we using the promise of CER to postpone calling a halt to the ineffective and inefficacious engine of American “health care”. The available science is consistent with the argument that interventional cardiology is not contributing to the health of the patient. I would argue that interventional cardiology should be halted until someone can demonstrate substantial efficacy and a meaningful benefit-to-risk ratio in some subset. Then CER can ask whether the benefit demonstrated in the efficacy trial translates to benefit in common practice.
Efficacy research is the horse; CER is the cart
Interventional cardiology for coronary artery disease is but one of many object lessons. There is much in common practice that has never been shown to be efficacious in any subset of patients. Some practices take up residence in the common sense despite having never been studied. Some practices, like interventional cardiology, persist because intellectual and fiscal interests are vested in the entrenchment despite the results of efficacy trials. CER can not inform efficacy, and CER can not inform effectiveness unless there is an example of efficacious therapy against which practices are compared. Otherwise, CER can be comparing degrees of ineffectiveness.
The way forward is to design efficacy trials that are more efficient in providing gold standards for comparison and as efficient in defining false starts that are not allowed into common practice until the approach is superseded by one of demonstrated efficacy. This is not all that difficult to do. Let’s return to the limitations of efficacy trials listed above:
- Homogeneity of study populations is not a limitation for the quest for a meaningful standard of efficacy. At least we will know the intervention is good for someone.
- Surrogate measures are useful to bolster the hypothesis that something might work. They have a dismal track record for testing the hypothesis that something does work. Clinically important outcomes must be invoked for such a test. If it is not feasible because the clinical outcome is too slow to develop or too infrequent, compromise is not an option. The intervention can not be studied at all, or it can not be studied until an appropriate subpopulation can be identified, or one must bite the bullet and undertake a lengthy RCT.
- Surrogate outcomes are not the only way that RCT results can lead to spurious clinical assumptions. “Composite outcomes” are even worse. RCTs in cardiology are notorious for an outcome such as “death from heart disease or heart attack or the need for another procedure.” When these studies are closely read, one learns that any difference detected is almost exclusively in “the need for another procedure” which is a highly subjective and interactive outcome that can speak to preconceptions on the part of the doctor or the patient rather than the efficacy of the intervention.
- Modern epidemiology is so wedded to the notion of statistical significance that concern about the statistical significance of “What?” is overwhelmed. “What?” is the clinical significance? Just because the difference observed between the active and control limbs of the RCT wouldn’t have happened by chance too often does not mean that the difference is clinically important even in the occasional patient. I’ll illustrate this by touching the Third Rail that the debate over the clinical utility of mammography has become. Malmö is a city in Sweden where women were invited to volunteer for a RCT; half would be offered routine screening mammography for a decade and the other half encouraged see their physicians whenever they had concern about the health of their breasts. That’s the difference between screening and diagnostic protocols; in screening one is agreeing to a test simply as a matter of course, in diagnostics one agrees to the testing in response to a clinical complaint. Back to the Malmö RCT. Over 40,000 women between age 40 and 60 volunteered for the RCT. Invasive cancer was detected in statistically significantly more women who were in the screened group than in the diagnostic group. Impressed? How about if I told you that 7 of 2000 women screened for a year were found to have invasive breast cancer and 5 of 2000 women in the diagnostic group for a year were found to have invasive breast cancer. Was all the screening worth this difference in absolute number of additional cancers detected? I could have told you that screening detected 40% more cancers but you won’t be swayed by the relative increase now that you know the absolute increase was 0.1%, will you? Would you consider the screening valuable if I told you that for every woman whose invasive breast cancer was treated so that they lived long enough to die from something else at a ripe old age, another two were treated unnecessarily since they died from something else before their breast cancer could be their reaper? How about all the false positive mammograms and false positive biopsies? There is a debate about mammography because it is a very marginal test that clearly is not doing as well as the common sense assumes.
- How small an effect can we detect in a RCT? Theoretically we can detect a very small effect. Theoretically we can detect an effect even smaller than the Malmö result. In order to do so, you need to randomize a large, homogeneous population whose size is determined by the level of statistical significance you choose and the nature of the health effect you seek. Death is the least equivocal outcome, for example. The quest for the small effect is the mantra of modern epidemiology. However, I consider such “small effectology” a sophism. No human population is homogeneous; we differ one from another in obvious, often measurable ways but also in less obvious, immeasurable ways. When we randomize individuals in any homogeneous population into a treatment group and a control group we assume that all the immeasurable differences randomize 50:50 or if not the randomization errors counterbalance. The smaller effect we are seeking, the more likely we are to be fooled by randomization errors that account for the difference rather than the treatment. That’s why so many small effects that emerge from RCTs do not reproduce.
Evidence Based Medicine can be more than a Shibboleth
The philosophical challenge in the design of efficacy trials relates to the notion of “clinically significant.” How high should we set the bar for the absolute difference in outcome between the treated and control groups in the RCT to be considered compelling? One way to get one’s mind around this question is to convert the absolute difference into a more intuitively appealing measure, the Number Needed to Treat (NNT). If the outcome is readily measured and unequivocal, such as death or stroke or heart attack, I would find the intervention valuable if I had to treat 20 patients to spare 1. Few students of efficacy would be persuaded if we had to treat more than 50 to spare 1. Between 20 and 50 delineates the communitarian ethic; smaller effects are ephemeral. For an outcome that is more difficult to measure than death or the like, an outcome that relates to symptoms or quality of life, I would argue for a more stringent bar.
If we applied this logic to RCTs, the trials would be far more efficient (in investigator/volunteer time, materiel, and cost) and the results far more reliable. If we applied this logic to RCTs, we would eliminate trials designed only to license agents no better than those already licensed (“me too” trials) and trials designed only for marketing purposes (“seed” trials). If we only licensed clinically efficacious interventions going forward, we could turn to CER to understand their effectiveness in practice. If we applied this logic retrospectively, to the trials that have already accumulated, we would soon realize how much of what is common practice is on the thinnest of evidentiary ice, how much has fallen through and how much supports an enterprise that is known to be inefficacious. It would take great transparency and political will to apply this razor retrospectively. We, the people, deserve no less.
Nortin M. Hadler, MD, MACP, FACR, FACOEM (AB Yale University, MD Harvard Medical School) trained at the Massachusetts General Hospital, the National Institutes of Health in Bethesda, and the Clinical Research Centre in London. He joined the faculty of the University of North Carolina in 1973 and was promoted to Professor of Medicine and Microbiology/Immunology in 1985. He serves as Attending Rheumatologist at the University of North Carolina Hospitals.
For 30 years he has been a student of “the illness of work incapacity”; over 200 papers and 12 books bear witness to this interest. He has lectured widely, garnered multiple awards, and served lengthy Visiting Professorships in England, France, Israel and Japan. He has been elected to membership in the American Society for Clinical Investigation and the National Academy of Social Insurance. He is a student of the approach taken by many nations to the challenges of applying disability and compensation insurance schemes to such predicaments as back pain and arm pain in the workplace. He has dissected the fashion in which medicine turns disputative and thereby iatrogenic in the process of disability determination, whether for back or arm pain or a more global illness narrative such as is labeled fibromyalgia. He is widely regarded for his critical assessment of the limitations of certainty regarding medical and surgical management of the regional musculoskeletal disorders. Furthermore, he has applied his critical razor to much that is considered contemporary medicine at its finest.