At the Patient-Centered Outcomes Research Institute (PCORI), we believe comparative clinical effectiveness research (CER) is important and that we have a critical role to play in establishing the nation’s CER priorities. I’m pleased to say that many respondents to the latest National Pharmaceutical Council (NPC) survey think so as well.
While results of CER studies that we and others are funding have yet to be completed, and CER’s ultimate ability to transform our healthcare system is still years away, nearly all respondents in this fourth annual survey agree that CER is here to stay and that it will become increasingly important in aiding decision making. Respondents also indicated that CER has not yet assessed the broader array of outcomes that matter to patients.
These are important insights. The survey tracks the attitudes of researchers, policymakers, employers, business groups, insurers, and health plans. Engaging with these stakeholders – along with patients, caregivers, clinicians, and other providers – and ensuring that the work we fund provides evidence they can trust and use, are essential if CER is to realize its potential in guiding health care.
That’s why Congress authorized PCORI’s establishment as an independent, non-profit organization focused on ensuring that the broad healthcare community is meaningfully engaged in our work. We’re governed by a diverse board that represents all stakeholders. And through an open and collaborative approach to research, we’re identifying the questions patients and other clinical decision makers need answered, so they can make better-informed choices that will lead to better outcomes.
We’ve already awarded $464.4 million to support 279 studies that advance patient-centered CER and we expect to commit another $1 billion over the next two years.
The new Patient-Centered Outcomes Research Institute (PCORI) has been asking different stakeholders about the most important issues to address with the hundreds of millions of dollars the quasi-governmental group will shortly be doling out in grants. Not surprisingly, the stakeholders have been more than happy to respond.
PCORI’s most recent day of dialogue, which I attended as a representative of the Society for Participatory Medicine (SPM), was characterized by genteel civility and a big question mark: “Is PCORI serious about transforming health care?” When I asked directly, I didn’t get much of an answer. The reason, I suspect, goes to PCORI’s origins. It is the offspring of a shotgun marriage between goo-goos and pinky-ringers, and no one is quite sure yet what this child will be once it grows up.
Let me pause here a moment to parse the political shorthand. “Goo-goos” are “good government” types, the kind of folks who trumpet the need for transparency in government or better public transit. Goo-goos, seeing the half trillion dollars or so of waste in U.S. health care system, called for a new national organization to carry out comparative effectiveness research in order to help Americans get the most value for our money.
The goo-goos pointed out that our current regulatory structure is designed to ensure that treatments are safe and effective, not compare them. Nor does the private sector have much incentive to pay for comparative studies that may undermine products currently selling quite nicely, thank you.
Cynics say Washington is the city where good ideas go to die. A promising strategy for holding down health care costs in the Obama administration’s reform bill – providing patients and doctors with authoritative information on what works best in health care – should provide a classic test of that proposition, assuming the law survives the next election.
Experts estimate anywhere from 10 to 30 percent of the health care that Americans receive is wasted. It is either ineffective or does more harm than good. To put that in perspective, waste costs anywhere from $250 billion and $750 billion a year, or as much as three-fourths of the annual federal deficit.
Yet every effort to curb wasteful spending (health care fraud, though pervasive, is estimated at less than a quarter of the total) has come up short. Neither Medicare and Medicaid’s efforts at government price controls nor the insurance industry’s efforts at managing care has succeeded in stopping health care spending from rising at twice the rate of the overall economy. Only the recent deep recession curbed costs, and that was because people lost their insurance when they lost their jobs and stopped going to the doctor. The bill for that postponed maintenance isn’t in yet.
For over a decade, the health policy world has held out comparative effectiveness research – comparing competing approaches to treating disease – as one possible solution to eliminating waste in the health care delivery system. If only doctors and patients knew what worked best, knew what worked less well than advertised, and knew what didn’t work at all, they would, through better-informed choices, gradually eliminate much of the waste in the system.
According to the Pacific Research Institute recently, because of “Comparative Effectiveness Research” (CER) “under conservative assumptions, R&D investment in new and improved pharmaceuticals and devices and equipment would be reduced by about $10 billion per year over the period 2014 through 2025, or about 10-12 percent. This reduction in the advance of medical technology would impose an expected loss of about 5 million life-years annually, with a conservative economic value of $500 billion, an amount substantially greater than the entire U.S. market for pharmaceuticals and devices and equipment.” [Study available here.]
I haven’t read the study. I don’t need to, since it is so obviously true, if we just make certain assumptions, such as:
Every dime spent on R&D for drugs and devices is wisely spent, on advances that will save and improve lives.
Every dime spent on finding out whether those drugs and devices actually work as advertised, and don’t actually kill people, and do it better or cheaper than other drugs and devices, is a dime wasted. CER just slows down legitimate, helpful research.
Experience does not show us any examples of wasteful or unnecessary drugs or devices. Those multiple peer-reviewed research papers showing that we waste hundreds of billions of dollars every year on useless complex back surgeries, the 22% of implanted defibrillators that are unnecessary, tens of millions of unnecessary scans, coronary stents put in people with stable heart disease and no heart pain, the heartburn surgeries that work no better than over-the-counter drugs—those studies are all false, wrong, some kind of mumbo-jumbo that we can safely ignore.
If we just make those few simple assumptions, the study has a valid point. If we don’t accept those assumptions, we have to wonder about the mental state, motivations, and personal finances of someone who would cook up such an obvious bit of flim-flam.
Joe is a healthcare speaker, writer, and consultant, working with clients ranging from the WHO, the Global Business Network, and the U.K. NHS, to the majority of state hospital associations. Joe writes at imaginewhatif.
Comparative Effectiveness Research (CER) is suddenly a hot topic at all the health care conferences. How come? Everybody agrees that we have to decrease per-capita cost and increase quality. Why? Government programs like Medicare and Medicaid foot more than 50% of our nation’s health bill, and if everything stays the same these programs will go belly up (bankrupt) in 8 years. Big problem.
Health and Human Services (HHS) has defined comparative effectiveness research as conducting and synthesizing research comparing the benefits and harms of different interventions and strategies to prevent, diagnose, treat, and monitor health conditions in “real world” settings. In other words, CER is figuring out what treatments, tests, and drugs work and which ones don’t work.
John E. Wennberg spent a whole career at Dartmouth studying American medicine, and he comes to the startling conclusion that 60% of Medicare is spent on supply sensitive care (physician visits, consultations, imaging exams, and hospital and ICU admissions) and 25% on preference sensitive care (PSA tests, mammography, and elective surgery). Although we assume that this care is based on solid scientific evidence, Wennberg states that “medical science is virtually silent on such matters” as how often to see a patient, what test to order, and whether to admit a patient to the hospital or ICU. Some evidence based medicine experts state that only about 20% of what physicians do is based on sound science.
The Government Accountability Office last week appointed two “faster cures” patient advocates and a former insurance company executive now on the AARP board to the three slots reserved for patient and consumer representatives on the Patient-Centered Outcomes Research Institute board, which will oversee comparative effectiveness research under health care reform.
The reform legislation passed last March gave GAO the job of appointing the 17 public members, which also includes five representatives of private payers, five physicians, and three industry representatives (one each for drugs, devices and diagnostic manufacturers). A full list can be found here.
The three “patient and consumer” representatives are:
Ellen Sigal, chair, Friends of Cancer Research.
Sigal is an outspoken advocate for more money for cancer research. Her board is comprised largely of fellow executives in the research community, including staff from the American Cancer Society, Research America!, and the American Society of Clinical Oncology, which represents cancer docs. She serves on numerous non-profit boards, including the Reagan-Udall Foundation set up by the Food and Drug Administration to expedite drug development; and has served on numerous Institute of Medicine panels investigating new ways of conducting cancer research that can lead to faster access to new medicines.
Writing in the New England Journal of Medicine (Identifying and Eliminating the Roadblocks to Comparative-Effectiveness Research) three authors share their experience in running a head-to-head trial of Avastin (bevacizumab) versus Lucentis (ranibizumab) for wet age-related macular degeneration (AMD). They describe the barriers they faced and suggest that they will need to be removed for comparative effectiveness research –as envisioned under ARRA– to succeed. They make good points and may well be correct in their policy recommendations.
However the case of Avastin and Lucentis is unusual. The products are made by the same manufacturer and are essentially identical. Avastin and Lucentis are marketed separately by Genentech mainly to allow the company to capture a return on investment from its R&D. The issue is that a regular dose of Avastin (e.g., for lung cancer) can be divided up into many doses for the eye. Since the products are sold by volume it turns out that Avastin is cheap when used for wet AMD, even though it’s pricey when used for cancer. As I’ve suggested previously, Genentech should be able to charge Lucentis prices for Avastin when it’s used in the eye. So there are quite a lot of people –starting with the manufacturer itself– who didn’t really want this study to go forward. That’s less likely to be the case with other studies.Continue reading…
Last summer President Obama signed the American Recovery and Reinvestment Act into law. Tucked into the legislation was $1.1 billion to support comparative effectiveness research (CER). The legislation charged the Institute of Medicine with defining CER. Its Committee on Comparative Effectiveness Research Prioritization rapidly came up with,
…the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat and monitor a clinical condition, or to improve the delivery of care. The purpose of CER is to assist consumers, clinicians, purchasers, and policy makers to make informed decisions that will improve health care at both the individual and population levels.
The Committee then elicited over 2500 opinions from 1500 stakeholders and produced a list of the 100 highest-ranked topics for CER (www.iom.edu/cerpriorities). Proposals to undertake CER are pouring forth from investigators across the land. There is no doubt that an enormous amount of data will be generated by 2015. But there is every reason to doubt whether many inferences can be teased out of these data that will actually advantage patients, consumers, or the health of the nation.
I am no Luddite. For me “evidence based medicine” is not a shibboleth; it’s an axiom. Furthermore, having trained as a physical biochemist, I am comfortable with the most rigorous of the quantitative sciences let alone biostatistics. However, you can’t compare treatments for effectiveness unless you are quite certain that one of the comparators is truly efficacious. There must be a group of patients for whom one treatment has unequivocal and important efficacy. Otherwise, the comparison might discern differences in relative ineffectiveness.
The academic epidemiologists who spearheaded the CER agenda are aware of the analytic challenges but are convinced these can be overcome. I would argue that CER can never succeed as the primary mechanism to assure the provision of rational health care. It has a role as a secondary mechanism, a surveillance method to fine tune the provision of rational health care, once such is established.
The difference between efficacy and effectiveness
My assertion may seem counter-intuitive. After all, we hear every day about pharmaceuticals that are licensed by the FDA because of a science that supports the assertion of benefit. In epidemiology-speak, the science that the FDA reviews does not speak to the effectiveness of the drug, but to its efficacy. The science of efficacy tests the hypothesis that a particular drug or other intervention works in a particular group of similar patients. CER asks whether an intervention works better than other interventions in practice where the patients and the doctors are heterogeneous. The rational for the CER movement is the perceived limitations of efficacy research. I argue that the limitations of efficacy research are much more readily overcome than the limitations on CER.
The gold standard of efficacy research is the randomized controlled trial (RCT). In a RCT, patients with a particular disease are randomly assigned to receive either a study intervention or a comparator (often a placebo). After a pre-determined interval, the previously defined clinical outcome is compared in the active and control limbs of the trial. If there is no difference, one can argue that the intervention offers no demonstrable clinical benefit to patients such as those in the study. If there is a difference, the contrary argument is tenable.
This elegant approach to establishing clinical utility has its roots in antiquity, at least as far back as Avicenna. The modern era commences after World War II and escalates dramatically after 1962 when the Kefauver-Harris Amendment to the laws regulating the US Food and Drug Administration mandated demonstration of efficacy before pharmaceuticals could be licensed. Modern biostatistics has probed every nuance of the RCT paradigm. The result is a highly sophisticated understanding of the limitations of the RCT, an understanding that has fueled the call for CER:
The more homogeneous the study population, the more likely any efficacy will be demonstrated and the more compelling any assertion as to its lacking. However, the homogeneity compromises the ability to assume the result generalizes to different kinds of patients.
Many important clinical outcomes are either infrequent or occur late in the course of disease. It is difficult to maintain and fund RCTs that require years or decades before one can hope to see a difference between the active and control limbs. The compromise is to study “surrogate” outcomes, measures that in theory reflect the disease process, but are not themselves clinically important outcomes. Thus we have thousands of studies of blood pressure, cholesterol, blood sugar, PSA and the like but comparatively few studies that use heart attacks, death from prostate cancer, or other untoward clinical outcomes as the end-point.
How big a difference between the active and control limbs is important? Biostatistics has dictated that we should pay attention to any difference that is unlikely to happen by chance too often. “Too often” traditionally is considered no more than 5% of the time, but that’s a matter risk-taking philosophy. What are we to make of a difference that is clinically very small, even if it is unlikely to happen by chance more than 5% of the time? Is it possible that the small effect will be important, perhaps less small, when the constraints of homogeneity are removed in practice? In practice, drugs licensed for one disease are even tried for other “off label” indications where effectiveness may emerge.
The corollary limitation relates to the negative trial. If there is no demonstrable difference, does that mean that there is no effect? Or could the effect have been too small to detect because of the duration of the trial or the size or homogeneity of the population studied? Even a very small effect, advantaging only the occasional patient, can translate into many benefited people when tens of thousands are treated.
Devices and surgical procedures are used practice; rigorous testing as to efficacy is not a statutory requirement. Maybe in the “real world” a treatment that was never studied or studied in a limited fashion turns out to really advantage patients in practice, or advantage some patients – or not.
CER to the rescue?
The methodology employed for CER is not the RCT. CER is an exercise in “observational research”. CER examines real world data sets to deduce benefit or lack thereof. This entails the development of large-scale, clinical and administrative networks to provide the observational data. Then biostatistics must come to grips with issues that make defining the heterogeneity of populations recruited into RCTs seem trivial. In the RCT, the volunteers can be examined and questioned individually and in detail and the criteria for admission into the trial defined a priori. Nothing about the validity of diagnosis, clinical course, interventions, coincident diseases, personal characteristics or outcomes can be assumed in observational data sets. There must be efforts at validating all such crucial variables. No matter how compulsively this is done, CER demands judgments about the importance of each of these variables. It is argued that some of these limitations are overcome because CER is not attempting to ask whether a particular intervention works in practice, but whether it works better than another option also in practice. It is even suggested that encouraging or introducing particular interventions or practice styles into some practice communities and not others would facilitate CER. Perhaps.
The object lesson of interventional cardiology
Interventional cardiology for coronary artery disease is the engine of the American “health care” enterprise. Angioplasties, stents of various kinds, and coronary artery bypass grafting (CABG) have attained “entitlement” status. There are thousands of RCTs comparing one with another, generally leading to much ado about very small differences, usually in surrogate measures such as costliness or patency of the stent. But there are very few RCTs comparing the invasive intervention with non-invasive best medical care of the day: 3 for CABG and 4 for angioplasty with or without stenting. In these large and largely elegant RCTs, the likelihood of death or a heart attack if treated invasively is no different from the likelihood if treated medically. Whether anyone might be spared some degree of chest pain by submitting to an invasive treatment is arguable since the results are neither compelling nor consistent. Yet, interventional cardiology remains the engine of the American “health care” enterprise. It carries on despite the RCTs because its advocates launch such arguments as “We do it differently” or “The RCTs were keenly focused on particular populations of patients and we reserve these interventions for others we deem appropriate.” These arguments walk a fine line between hubris and quackery.
So many invasive procedures are done to the coronary arteries of the young and the elderly that interventional cardiology has long lent itself to CER. We know from observational studies that that it does not seem to matter much if the heart attack patient has an invasive intervention quickly or it is delayed or not at all. We know from observational studies, and even trials rewarding some but not all hospitals for getting doctors to adhere to the “guidelines” for managing heart disease, that adherence does not make much of a difference. Do the results of this CER mean that we need to further improve the efficiency and quality of the performance of invasive treatments as many would argue? Or can we hope that more exacting CER can parse out some meaningful indication from large data sets, some compelling inference that only particular people with particular conditions are advantaged and therefore are the only candidates for interventional cardiology?
Or are we using the promise of CER to postpone calling a halt to the ineffective and inefficacious engine of American “health care”. The available science is consistent with the argument that interventional cardiology is not contributing to the health of the patient. I would argue that interventional cardiology should be halted until someone can demonstrate substantial efficacy and a meaningful benefit-to-risk ratio in some subset. Then CER can ask whether the benefit demonstrated in the efficacy trial translates to benefit in common practice.
Efficacy research is the horse; CER is the cart
Interventional cardiology for coronary artery disease is but one of many object lessons. There is much in common practice that has never been shown to be efficacious in any subset of patients. Some practices take up residence in the common sense despite having never been studied. Some practices, like interventional cardiology, persist because intellectual and fiscal interests are vested in the entrenchment despite the results of efficacy trials. CER can not inform efficacy, and CER can not inform effectiveness unless there is an example of efficacious therapy against which practices are compared. Otherwise, CER can be comparing degrees of ineffectiveness.
The way forward is to design efficacy trials that are more efficient in providing gold standards for comparison and as efficient in defining false starts that are not allowed into common practice until the approach is superseded by one of demonstrated efficacy. This is not all that difficult to do. Let’s return to the limitations of efficacy trials listed above:
Homogeneity of study populations is not a limitation for the quest for a meaningful standard of efficacy. At least we will know the intervention is good for someone.
Surrogate measures are useful to bolster the hypothesis that something might work. They have a dismal track record for testing the hypothesis that something does work. Clinically important outcomes must be invoked for such a test. If it is not feasible because the clinical outcome is too slow to develop or too infrequent, compromise is not an option. The intervention can not be studied at all, or it can not be studied until an appropriate subpopulation can be identified, or one must bite the bullet and undertake a lengthy RCT.
Surrogate outcomes are not the only way that RCT results can lead to spurious clinical assumptions. “Composite outcomes” are even worse. RCTs in cardiology are notorious for an outcome such as “death from heart disease or heart attack or the need for another procedure.” When these studies are closely read, one learns that any difference detected is almost exclusively in “the need for another procedure” which is a highly subjective and interactive outcome that can speak to preconceptions on the part of the doctor or the patient rather than the efficacy of the intervention.
Modern epidemiology is so wedded to the notion of statistical significance that concern about the statistical significance of “What?” is overwhelmed. “What?” is the clinical significance? Just because the difference observed between the active and control limbs of the RCT wouldn’t have happened by chance too often does not mean that the difference is clinically important even in the occasional patient. I’ll illustrate this by touching the Third Rail that the debate over the clinical utility of mammography has become. Malmö is a city in Sweden where women were invited to volunteer for a RCT; half would be offered routine screening mammography for a decade and the other half encouraged see their physicians whenever they had concern about the health of their breasts. That’s the difference between screening and diagnostic protocols; in screening one is agreeing to a test simply as a matter of course, in diagnostics one agrees to the testing in response to a clinical complaint. Back to the Malmö RCT. Over 40,000 women between age 40 and 60 volunteered for the RCT. Invasive cancer was detected in statistically significantly more women who were in the screened group than in the diagnostic group. Impressed? How about if I told you that 7 of 2000 women screened for a year were found to have invasive breast cancer and 5 of 2000 women in the diagnostic group for a year were found to have invasive breast cancer. Was all the screening worth this difference in absolute number of additional cancers detected? I could have told you that screening detected 40% more cancers but you won’t be swayed by the relative increase now that you know the absolute increase was 0.1%, will you? Would you consider the screening valuable if I told you that for every woman whose invasive breast cancer was treated so that they lived long enough to die from something else at a ripe old age, another two were treated unnecessarily since they died from something else before their breast cancer could be their reaper? How about all the false positive mammograms and false positive biopsies? There is a debate about mammography because it is a very marginal test that clearly is not doing as well as the common sense assumes.
How small an effect can we detect in a RCT? Theoretically we can detect a very small effect. Theoretically we can detect an effect even smaller than the Malmö result. In order to do so, you need to randomize a large, homogeneous population whose size is determined by the level of statistical significance you choose and the nature of the health effect you seek. Death is the least equivocal outcome, for example. The quest for the small effect is the mantra of modern epidemiology. However, I consider such “small effectology” a sophism. No human population is homogeneous; we differ one from another in obvious, often measurable ways but also in less obvious, immeasurable ways. When we randomize individuals in any homogeneous population into a treatment group and a control group we assume that all the immeasurable differences randomize 50:50 or if not the randomization errors counterbalance. The smaller effect we are seeking, the more likely we are to be fooled by randomization errors that account for the difference rather than the treatment. That’s why so many small effects that emerge from RCTs do not reproduce.
Evidence Based Medicine can be more than a Shibboleth
The philosophical challenge in the design of efficacy trials relates to the notion of “clinically significant.” How high should we set the bar for the absolute difference in outcome between the treated and control groups in the RCT to be considered compelling? One way to get one’s mind around this question is to convert the absolute difference into a more intuitively appealing measure, the Number Needed to Treat (NNT). If the outcome is readily measured and unequivocal, such as death or stroke or heart attack, I would find the intervention valuable if I had to treat 20 patients to spare 1. Few students of efficacy would be persuaded if we had to treat more than 50 to spare 1. Between 20 and 50 delineates the communitarian ethic; smaller effects are ephemeral. For an outcome that is more difficult to measure than death or the like, an outcome that relates to symptoms or quality of life, I would argue for a more stringent bar.
If we applied this logic to RCTs, the trials would be far more efficient (in investigator/volunteer time, materiel, and cost) and the results far more reliable. If we applied this logic to RCTs, we would eliminate trials designed only to license agents no better than those already licensed (“me too” trials) and trials designed only for marketing purposes (“seed” trials). If we only licensed clinically efficacious interventions going forward, we could turn to CER to understand their effectiveness in practice. If we applied this logic retrospectively, to the trials that have already accumulated, we would soon realize how much of what is common practice is on the thinnest of evidentiary ice, how much has fallen through and how much supports an enterprise that is known to be inefficacious. It would take great transparency and political will to apply this razor retrospectively. We, the people, deserve no less.
Nortin M. Hadler, MD, MACP, FACR, FACOEM (AB Yale University, MD Harvard Medical School) trained at the Massachusetts General Hospital, the National Institutes of Health in Bethesda, and the Clinical Research Centre in London. He joined the faculty of the University of North Carolina in 1973 and was promoted to Professor of Medicine and Microbiology/Immunology in 1985. He serves as Attending Rheumatologist at the University of North Carolina Hospitals.
For 30 years he has been a student of “the illness of work incapacity”; over 200 papers and 12 books bear witness to this interest. He has lectured widely, garnered multiple awards, and served lengthy Visiting Professorships in England, France, Israel and Japan. He has been elected to membership in the American Society for Clinical Investigation and the National Academy of Social Insurance. He is a student of the approach taken by many nations to the challenges of applying disability and compensation insurance schemes to such predicaments as back pain and arm pain in the workplace. He has dissected the fashion in which medicine turns disputative and thereby iatrogenic in the process of disability determination, whether for back or arm pain or a more global illness narrative such as is labeled fibromyalgia. He is widely regarded for his critical assessment of the limitations of certainty regarding medical and surgical management of the regional musculoskeletal disorders. Furthermore, he has applied his critical razor to much that is considered contemporary medicine at its finest.