OP-ED

Why Bad Research Makes It into Good Medical Journals

Last week, a study in the New England Journal of Medicine called into question the effectiveness of surgical checklists for preventing harm.

Atul Gawande—one of the original researchers demonstrating the effectiveness of such checklists and author of a book on the subject—quickly wrote a rebuttal on the The Incidental Economist.

He writes, “I wish the Ontario study were better,” and I join him in that assessment, but want to take it a step further.

Gawande first criticizes the study for being underpowered. I had a hard time swallowing this argument given they looked at over 200,000 cases from 100 hospitals. I had to do the math. A quick calculation shows that given the rates of death in their sample, they only had about 40% power [1].

Then I became curious about Gawande’s original study. They achieved better than 80% power with just over 7,500 cases. How is this possible?!?

The most important thing I keep in mind when I think about statistical significance—other than the importance of clinical significance [2]—is that not only does it depend on the sample size, but also the baseline prevalence and the magnitude of the difference you are looking for. In Gawande’s original study, the baseline prevalence of death was 1.5%.

This is substantially higher than the 0.7% in the Ontario study. When your baseline prevalence approaches the extremes (i.e.—0% or 50%) you have to pump up the sample size to achieve statistical significance.

So, Gawande’s study achieved adequate power because their baseline rate was higher and the difference they found was bigger. The Ontario study would have needed a little over twice as many cases to achieve 80% power.

This raises an important question: why didn’t the Ontario study look at more cases?

The number of cases in a study is dictated by limitations in data collection. Studies are generally limited by the manpower they can afford to hire and the realistic time limitations of conducting a study.

However, studies that use existing databases are usually not subject to these constraints. While creating queries to extract data is often tricky, once you have setup your extraction methodology it simply dumps the data into your study database.

You can extend or contract the time period for data collection by simply changing the parameters of your query. Modern computing power means there are few limitations on the sizes of these study databases and the statistical methodologies we can employ. Simply put, the Ontario study (which relied on ‘administrative health data,’ read: ‘existing data’) easily could have doubled the number of cases in their study.

Exactly how did they define their study group? As Gawande points out in his critique, the Ontario study relied on this bizarre 3-month window before and after checklist implementation at individual hospitals. Why 3 months? Why not 6 or 12 or 18? They even write in their methods:

We conducted sensitivity analyses using different periods for comparison. [3]

They never give the results of these sensitivity analyses or provide sound justification for the choice of a 3-month period. Three months not only keeps their power low, but it fails to account for secular trends. Maybe something like influenza was particularly bad in the post-checklist period, leading to more deaths despite effective checklist use.

Maybe a new surgical technique or tool was introduced, like DaVinci robots, or many new, inexperienced surgeons were hired that increased mortality. In discussing their limitations, they address this:

Since surgical outcomes tend to improve over time, it is highly unlikely that confounding due to time-dependent factors prevented us from identifying a significant improvement after implementation of a surgical checklist.

I will leave it to you to decide if you think this is an adequate explanation. I’m not buying it.

Gawande concludes that this study reflects a failure of implementation of using checklists, rather than a failure of checklists themselves. I’m inclined to agree.

Ultimately, I don’t wonder why this study was published; bad studies are published all the time (hence the work of John Ioannidis). I wonder why this study was published in the New England Journal of Medicine. NEJM is supposed to be the gold-standard for academic medical research.

If they print it, you should be confident in the results and conclusions. Their editors and peer reviewers are supposed to be the best in the world. The Ontario study seems to be far below the standards I expect for NEJM.

I think their decision to accept the paper hinged on the fact that this was a large study that showed a negative finding on a subject that has been particularly hotover the past few years [4]. Nobody seemed to care that this was not a particularly well-conducted study; this is the sadness that plagues the medical research community. Be a critical reader.

Josh Herigon, MPH (@JoshHerigon) is a 4th year medical student at the University of Kansas who writes about the intersection of medicine, technology, and social media at mediio, where this post originally appeared.


  1. Remember, we conventionally aim for a power of 80% (or better). 
  2. Clinical significance refers to the importance of a finding in terms of its impact on something clinically meaningful. To use data for the Ontario study as an example, they show a statistically significant drop in the length of hospital stays from 5.11 days to 5.07 days. Despite this finding’s statisticalsignificance, who cares?! You’re still in the hospital 5 days roughly. 
  3. I am taking ‘sensitivity analysis’ to mean in this case that they actually looked at various time periods—maybe 6 or 12 or 18 months—to see how their results changed. Usually when people do this, they give some indication of the results of their sensitivity analyses and why they decided to stick with the original plan. 
  4. Yes, checklists are hot. I mean, Atul Gawande wrote a best-selling book about them. Granted, he’s such a great writer that he could spend 300 pages expounding upon why the sky is blue and it would sell.

Livongo’s Post Ad Banner 728*90

4
Leave a Reply

3 Comment threads
1 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
4 Comment authors
Saurabh JhaJosh HerigonSteveOther Tuscaloosa Recent comment authors
newest oldest most voted
Saurabh Jha
Guest
Saurabh Jha

I suspect the researchers did their original sample size calculations based on an estimated size of effect. They extrapolated the mortality reduction in the 2009 NEJM paper on checklists to their own study population. As pointed out the baseline prevalence of surgical death affects sample size. However, the baseline prevalence of surgical complications also affects the incremental gain of checklists. When the study did not find a benefit, which might have surprised the researchers, they might have not published, in which case they would have been accused of publication bias. I don’t believe this NEJM study should be dismissed so… Read more »

Steve
Guest
Steve

Maybe they were interested in the policy of mandated checklists, not the checklists themselves. That seems like an important thing to study since regulators can’t make people use checklists, they can only mandate them. As Atul emphasizes, these are not the same thing, and both are interesting.

Josh Herigon
Guest
Josh Herigon

Absolutely, Steve. I think the authors of the Ontario paper should have made that more clear in their construction of the article. However, I do not think it is an excuse for poor methodology. I think making a point about the difficulty of implementation and the problematic nature of mandates would have been further supported by looking at data over a longer period of time.

Other Tuscaloosa
Guest
Other Tuscaloosa

Idiocracy + oligarchy = no authoritative sources for anything.