Data Parasites?

flying cadeuciiIn just four years, it seems, data science has devolved from the “sexiest job of the 21stcentury” to a community of “research parasites.”

The latest assessment is courtesy of an editorial in the New England Journal of Medicine (NEJM), written by editor-in-chief Jeff Drazen, along with Dan Longo.

Essentially, Longo and Drazen argue that while the Platonic ideal of rich data sharing is lovely, reality is not so pretty.

First, Longo and Drazen allege, researchers who weren’t involved in gathering the original data often lack essential appreciation for how it was gathered, and thus may misinterpret it, as they “may not understand the choices made in defining the parameters.”

Second–and this is really the heart of the issue–Longo and Drazen worry that a new class of research person will emerge—people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

Instead, Longo and Drazen urge would-be data scientists to work collaboratively with the original investigators, and to share co-authorship, and cite as an exemplar a paper in the current issue of the NEJM where just this model was successfully pursued.

My Twitter feed exploded in response to this editorial:

Michael Eisen, geneticist at UC-Berkeley: “One of the most shockingly anti-science things ever written.”

Sek Kathiresan, cardiologist/geneticist at MGH/Broad: “Shocking to see disparaging term ‘research parasite’ to describe use of data often created w/ public funds.”

Michael Hoffman, computational genomicist at the Princess Margaret Cancer Centre, Toronto: “Fear that others may use data ‘to try to disprove what the original investigators had posited’ is a dangerous misunderstanding of science.”

In contrast, I was delighted to see this editorial.

Not because I agreed with it–my heart is truly with the data scientists–but because I was grateful that someone had the courage to articulate a perspective I’ve come to believe is shared by the vast majority of academic researchers, but publicly voiced by no one–until now.

The result: a classic case of stated preference vs. revealed preference, where every academic researcher dutifully claims to be interested in sharing their data widely and freely, but somehow, tend not to actually do this.

Why? I’m sure the reasons vary, but somewhere near the top of the list is that most researchers perceive very little upside in generously and richly sharing their raw data. At a minimum, it’s regarded as a thankless hassle (one of the reasons negative results are often not published, and one of the reasons submissions to public resources like ClinVar usually come in much slower than many might anticipate), and beyond that, someone may look at the data differently and come to a different conclusion.

True, that might be how science is supposed to work–there is that–but it’s incredibly difficult to motivate individuals to act in a fashion they view–not unreasonably–as against their self-interest.

This may not be how the world should work–but is how it often seems to work, and without some acknowledgement of this perspective, efforts to spur data sharing among academic investigators will simply consume everyone’s time and arrive at high-minded conclusions not likely to be actioned.

The perfect example of this: our endless discussions about sharing data from electronic health records (EHRs). Everyone publicly agrees data should be shared easily, but somehow, progress is almost unimaginably slow–almost as if the key stakeholders really aren’t that interested in seeing it happen (see my comments about data silos here, and interoperability here).

Imagine how much better off we might be if a hospital executive wrote a similarly honest editorial about reluctance to share EHR data for competitive reasons. True, the author would be pilloried and then fired, but at least the candid perspective might meaningfully inform the national conversation.

I would like to live in a world where research data is shared in the expansive fashion envisioned by Atul Butte, rather than as described by Drazen and Longo. I would like to live in a world where clinical data is shared in a generous, intuitive and friction-free manner, rather than needing to pried forcibly out of an EHR.

But until and unless we come to grips with the gritty, human, competitive reasons why data aren’t shared, we’re headed to a future of righteous proclamations–and little data movement.

David Shaywitz is based in Mountain View, California. He is Chief Medical Officer at DNAnexus, a Mountain View based company and holds an adjunct appointment, Visiting Scientist, in the Department of Biomedical Informatics at Harvard Medical School. 

Categories: Uncategorized

Tagged as:

2 replies »

  1. If a hundred providers are all competing on what data they collect and how they present it, maybe there is more innovation and creativity and clinical usefulness ( value) in their cumulative output than there would be in the output of a hundred matched providers who shared all their data.

    To measure this we have to agree upon how the measure the value of the data to the entire system. How do we measure this?

  2. Interesting post. Perhaps if Taylor Swift’s® attorneys have a little spare time left over from trademarking every phrase in the English language and suing those who use them on t-shirts, they can turn their guns on clinical data. 😉