The Limitations of Healthcare Science

Sidney Le UCSFEvery once in awhile on the wards, one of the attending physicians will approach me and ask me to perform a literature review on a particular clinical question. It might be a question like “What does the evidence say about how long should Bactrim should be given for a UTI?” or “Which is more effective in the management of atrial fibrillation, rate control or rhythm control?” A chill usually runs down my spine, like that feeling one gets when a cop siren wails from behind while one is driving. But thankfully, summarizing what we know about a subject is actually a pretty formulaic exercise, involving a PubMed search followed by an evaluation of the various studies with consideration for generalizability, bias, and confounding.

A more interesting question, in my opinion, is to ask why we do not know what we do not know. To delve into is a question requires some understanding of how research is conducted, and it has implications for how clinicians make decisions with their patients. Below, I hope to provide some insights into the ways in which clinical research is limited. In doing so, I hope to illustrate why some topics we know less about, and why some questions are perhaps even unknowable.

Negative studies are difficult to publish

A positive study is one that demonstrates a statistically significant result. Anegative study is one that shows no statistically significant difference. Any researcher would agree that it is easier to publish a positive study — after all, it is more exciting to read a study that suggests that some new kind of treatment works, as opposed to a study that shows that a treatment did not do anything. I would also contribute an additional point, which is that it is analytically easier to construct a compelling positive study (“even limitations in our data, we were able to show a statistically significant improvement in mortality in the group that received this surgical technique”) vs. a compelling negative study (“there was no statistically significant difference between the two groups, and we are confident that we are interpreting our data well enough and have a large enough sample size to be able to detect a meaningful difference if there were one”).

So when one delves into a particular research question, one must interpret the literature in the context of possible negative studies that may have been performed and not published. Admittedly, this is a bit like asking this Californian to ponder earthquakes — the ground may be shifting beneath my feet, but it gives me less anxiety to ignore the possibility.

Publication a slow, deliberate process

Briefly, here are the steps:

  • Submit manuscript to journal
  • Hope it does not get rejected immediately. If it does submit to another journal.
  • Wait for peer reviews, hope manuscript does not get rejected based on those peer reviews.
  • Make revisions based on concerns of reviewers
  • Journal officially accept manuscript for publication and eventually publishes it

Each one of these steps can take months. For example, a study that I worked on, ironically titled “Timeliness of Care in US Emergency Departments: An Analysis of Newly Released Metrics From the Centers for Medicare & Medicaid Services.” This data was analyzed and the first draft of the manuscript was written over the course of a week in August 2013. The final manuscript (which was fairly similar to the first draft in my opinion) was published in November 2014.

There are, of course, many ways that researchers get their research out there, such as posters and presentations at conferences, research meetings, blogs, Twitter. But the fact of the matter is that a lot of science is known by someone, long before it gets in the public domain.

Certain populations are systematically underrepresented in medical research

Every study involves an investigation into a particular population sample, which researchers work very delicately to select. Given that researchers use samples, the interpretation of how a study informs the care of any individual patient must consider the generalizability of that study. But if we look across multiple studies, or across even entire fields of research, and examine which samples are being studied, it is apparent that there are large groups of people who are underrepresented in clinical research. For example, much has been written about how clinical research studies enroll disproportionately few minorities. Our science is largely based on a Caucasian population! Much research into hospital quality measures, as another example, is based on fee-for-service inpatient Medicare claims which excludes all the outpatient services that hospitals provide, Medicare Advantage patients, people younger than 65. The quality of care that 27-year-olds like me receive is relatively poorly studied.

Researchers generally choose the samples they study out of convenience, the discussion section of a paper typically pays at least some lip service to the limitations on generalizing the results of that study to other populations. But because of the systematic underrepresentation of certain populations in research, clinicians are left to make assumptions, i.e. a medication will be equally effective in poorly-studied population X as in well-studied population Y. These kinds of assumptions about generalizability are strong ones, ones that basic scientists and social scientists would be more hesitant about.

Clinical research favors a handful of simple methods

Student’s t-test, chi-square test for independence, ordinary least squares regression, logistic regression, and Cox proportional hazards regression account for the vast majority of analytic methods in clinical research. And indeed, those were pretty much all of the analytical methods that I was taught in my evidence-based medicine course in medical school. While these methods are probably sufficient for understanding and performing randomized control trials, there are so many other valuable methods in observational data research that one rarely sees. Without advocating for the adoption the “mathiness” of economics, clinical research could stand to learn about methods seen in other fields. Instrumental variable methods, for example, are part of the fundamentals of econometrics and could deepen our understanding of observational data in medicine.

It is all about the average, when it comes to medical research

A distribution of data might look like this:



Medical research largely concerns itself with where the little triangle below points:


That is the average, a single number that the fundamentally underlies the various statistical methods that are common in medical research, but by itself cannot truly describes an entire distribution. Papers will generally also present standard deviations, which is helpful, but truly only sufficient if one assumes a normal distribution. One rarely sees medians or percentiles in medical research, let alone more obscure concepts like skewness or kurtosis. In a sense, our science is based on how averages relate to averages, and ignores much of the complexity of the entire distributions of what we measure.

This has profound clinical implications. Countless times, my patients ask, “Will this treatment work?” And I might be left to say something like, “85% of people see some response” ← a statement about averages, “but everyone is different, some people respond better, some people respond worse, some people not at all” ← a hand-wavy statement about the rest of the distribution.

Clinical research lives in two dimensions

Treatment and outcome. Independent variable and dependent variable. X and Y. Left-sided and right-sided. Does this surgical technique lower recurrence? Does this drug decrease cardiovascular risk? The majority of clinical research is focused on linking one thing with another thing, in pursuit of establishing a causal relationship. Researchers spend less time thinking about how a third thing (or even a fourth thing) might modulate the relationship between the first two things. To what extent does age influence the effectiveness of this drug in lowering risk of cardiovascular events?

Researchers do investigate those “three-dimensional” questions by using methods like stratification or effect modification, but over all it represents a minority of all research effort (perhaps tucked away in a Table 4 or 5 of a paper). Maybe the “big data” or “precision medicine” movements are the solution.

The easily measurable is favored over the hard-to-measure, let alone the immeasurable

If one is going to perform research, it is of course natural to prioritize the low hanging fruit. This means investigating particular outcomes that are more easily measured than others. Death, for example, is perhaps the simplest outcome that there is to measure in healthcare — in fact, many countries have national registries of when/why every single one of its citizens dies. Probably the next easiest type of outcome to measure are non-death discrete events, e.g. a hospitalization, an adverse drug event, a cancer recurrence. Measuring quality of life is more difficult — you have to go around asking people self-report their quality of life. And if ones believes, as integrative medicine pioneer Dr. Rachel Remen does, that to heal is to help people purse what has meaning and value in life…good luck measuring that outcome!

The tyranny of multiple comparisons vs. the requirements for pre-specified analyses

Most research findings are presented alongside a p-value, which is a way of describing what are the chances that a particular result might be due to randomness in the data, rather than representing a true effect. The lower the p-value, the more valid the result, and a p-value of less <0.05 is the standard, albeit arbitrary, cutoff for statistical significance in clinical research. However, when a researcher performs many different statistical comparisons, the probability that one of those many will achieve statistical significance at a <0.05 level increases, an issue known as the multiple comparisons problem. One solution is to adjust the cutoff for statistical significance — essentially the more tests a researcher performs, the more stringent the cutoff for significance needs to be.

This is all good, but what if a researcher submits a manuscript that contains ten comparisons, but in reality performed one hundred throughout the course of his investigation? The significance cutoff really should be adjusted to account for one hundred comparisons, but was likely only adjusted for ten when it was submitted for publication. It is a problem called data mining. Researchers understand that it is poor form to do this, though “data mining” to one person might be “thoughtfully exploring the data” to someone else. Indeed, data mining typically occurs not because a researcher is actively snooping around the data for a significant result, but because a researcher has worked with the data for so long that it might have just happened by accident.

Besides self-policing, there are two mechanisms to protect against against data mining. Reviewers may ask the authors to run other analyses to see if they support the results that were presented. There also may be a requirement that before any data are acquired, the authors have to specify exactly which analyses they plan to perform. It should be pointed out that such a requirement makes research less efficient. If pre-specified analyses are important, then every data set can only really be analyzed once, and one is restricted from exploring hypotheses that are generated by the results of the initial analyses.

Research is expensive!

The NIH devotes several billion dollars to clinical research on its own, and clinical research is also supported by various state organizations and philanthropy. While this may sound like a lot of money, it is not! Research is quite expensive, if you factor in the cost of salary, equipment/overhead, staff support, data collection, etc. There are unfortunately more interesting research questions than money to properly investigate all of those questions.

Another source of funding is industry…but accepting funding from industry has its issues. Say you have a pharmaceutical company that has developed a new drug, and they then pay a group of researchers to conduct a study that tests the efficacy of that drug. We can all see the problem in this scenario. It is hence critically important for researchers to disclose any conflicts of interests. For better or worse, the knee-jerk reaction of most academics is to discredit studies when there is a blatant conflict of interest.

Given the resource constraints, researchers try to be cost-effective, perhaps even taking shortcuts. It might mean interviewing subjects every other year instead of every year. Or following the subjects for 5 years, instead of 10. Of the common clinical research study designs, randomized control trials tend to be by far the most expensive type of study, followed by cohort studies, case-control studies, cross-sectional studies, and case reports.

Ethical considerations provide boundaries on what kinds of studies are permissible

From 1932 to 1972, an infamous clinical study was conducted by the U.S. Public Health Service, in which African-American men were untreated for syphilis to observe the natural progression of the disease. None of the infected men were told they had the disease, and none were treated with penicillin after the antibiotic became a proven treatment. Public outrage and congressional investigation into the Tuskegee Syphilis Study eventually led to the establishment of the Office of Human Research Protections within the Department of Health and Human Services and a series of federal laws and regulations requiring the protection of human subjects.

Scientists are thankfully much more informed and sensitive to how to ethically conduct research. Ethical considerations rightfully places limitations on which kinds of research are permissible (particularly randomized control trials), but as a result, scientists have to accept that some knowledge is unattainable. You cannot design a study that randomly assigns people to cigarette smoking (a fact touted by the tobacco industry). You can design an ethical randomized control trial that investigates the use of cannabis to reduce nausea and vomiting during chemotherapy. You probably cannot design an ethical randomized control trial that investigates toxicity in recreational use of cannabis.

There is a lot of pressure and competition in academia…and a lot of scientific misconduct

It seems like every month, I read a story in the news about how a researcher was was caught fabricating and falsifying data. This is reflected in the increasing number of studies that are retracted, and I cannot help but think that this is related to increasing pressure and competition in academia. The mechanisms that prevent scientific misconduct are feeble. One has to attest to the integrity of the study when it is accepted for publication. Researchers sometimes attempt to reproduce each other’s results, though researchers are generally much more interested in pursuing their own research. And certainly the consequences of being caught fudging are severe, often grounds for dismissal. But despite the consequences, the temptation to fabricate results is real.

Scientific misconduct erodes the public’s faith in the integrity of science. It is hard to digest research if one has to also entertain the possibility that someone made the stuff up! Furthermore, once a study is out there, it never truly disappears, even if it is retracted. Vaccines and autism, anyone?

Sidney Le is a UCSF medical student and health services researcher

2 replies »

  1. Dr. Palmer – Indeed, researchers spend their time considering various individual associations, yet there is undoubtedly a fundamental interconnectedness of the different systems and disease processes in the human body that is much harder to describe empirically.

  2. That was excellent, Sidney. It’s almost impossible to do research on groups that have only the one experimental variable. I have done thousands of autopsies. Everyone has co-morbidities. Eg. it is easy to find small carcinoid tumors in the bowel if you look hard enough. A large number of us have a few tics in the colon. If you carefully examine reduction mammoplasty tissue, you will find occasional tumors. There is a lot of very minimal patchy old myocarditis, pneumonitis and encephalitis. Almost everyone has some small focus of chronic inflammation: a little prostatitis or cervicitis or sinusitis…or gingival or apical inflammation in the teeth. And, of course, the skin. It is a zoo of mostly trivial pathology. You get my point.