A Case for Open Data

A couple of weeks ago, President Obama launched a new open data policy (pdf) for the federal government. Declaring that, “…information is a valuable asset that is multiplied when it is shared,” the Administration’s new policy empowers federal agencies to promote an environment in which shareable data are maximally and responsibly accessible. The policy supports broad access to government data in order to promote entrepreneurship, innovation, and scientific discovery.

If the White House needed an example of the power of data sharing, it could point to the Psychiatric Genomics Consortium (PGC). The PGC began in 2007 and now boasts 123,000 samples from people with a diagnosis of schizophrenia, bipolar disorder, ADHD, or autism and 80,000 controls collected by over 300 scientists from 80 institutions in 20 countries. This consortium is the largest collaboration in the history of psychiatry.

More important than the size of this mega-consortium is its success. There are perhaps three million common variants in the human genome. Amidst so much variation, it takes a large sample to find a statistically significant genetic signal associated with disease. Showing a kind of “selfish altruism,” scientists began to realize that by pooling data, combining computing efforts, and sharing ideas, they could detect the signals that had been obscured because of lack of statistical power. In 2011, with 9,000 cases, the PGC was able to identify 5 genetic variants associated with schizophrenia. In 2012, with 14,000 cases, they discovered 22 significant genetic variants. Today, with over 30,000 cases, over 100 genetic variants are significant. None of these alone are likely to be genetic causes for schizophrenia, but they define the architecture of risk and collectively could be useful for identifying the biological pathways that contribute to the illness.

We are seeing a similar culture change in neuroimaging. The Human Connectome Project is scanning 1,200 healthy volunteers with state of the art technology to define variation in the brain’s wiring. The imaging data, cognitive data, and de-identified demographic data on each volunteer are available, along with a workbench of web-based analytical tools, so that qualified researchers can obtain access and interrogate one of the largest imaging data sets anywhere. How exciting to think that a curious scientist with a good question can now explore a treasure trove of human brain imaging data—and possibly uncover an important aspect of brain organization—without ever doing a scan.

However, not all scientists are comfortable sharing data. Some point out that data collected under different conditions or with different assessment tools should not be combined. Some have expressed concern that data will be “misinterpreted” if analyzed without the input of the researchers who collected the data. And others worry about the competitive disadvantage of sharing data before publication. In an academic culture that rewards the first to report a finding and for which publication is critical for promotion, sharing might seem unfair to early career scientists and unacceptable to more established investigators. Finally, privacy concerns may be a complex—though not insurmountable—barrier to sharing data, both for scientists and for research participants. We must not minimize these concerns. But as an agency that is ultimately focused on improving the health of patients, NIMH must find a way to balance the concerns of the academic community with our public health mission.

If “information is a valuable asset that is multiplied when it is shared,” then the question for publicly funded research is not if, but how to share. Currently, NIH policy expects a data sharing plan for all proposals over $500,000 per year in direct costs. However, some research communities have developed their own “subcultures” in which sharing is expected—and executed—for all grants, not just those over the $500,000 threshold. For example, all researchers conducting NIH-funded genome-wide association studies submit their data to the NIH Database for Genotypes and Phenotypes (dbGaP), as expected by the NIH GWAS Data Sharing Policy. In other areas, such as autism research, NIH expects all funded clinical studies to deposit data in the NIH National Database for Autism Research (NDAR).

These two trans-NIH data sharing efforts are a great start. But as a community, can we do better in other areas, such as clinical trials, by defining our standards for data sharing? For example, should we develop common data elements and create repositories for shared data in other research fields? What is the right balance between providing qualified researchers with access to data at the earliest opportunity while respecting the needs of those who collected the data? How can we incentivize sharing and data mining when many investigators do not have the funding to analyze their own data sufficiently? Should some data not be shared? NIH has been developing resources to facilitate this conversation, such as key elements to consider when preparing a data sharing plan (pdf).

The culture of science is changing. Just look at the Broad Institute’s global alliance on sharing of genomic and clinical data: over 70 health care, research, and disease advocacy organizations have taken the first steps dedicated to enabling secure sharing of data. Public demands for access, transparency, and accountability are increasing. New scientific journals are supporting more access to data. Scientists are demanding and finding the value in sharing and cooperation. I suspect the PGC, the Human Connectome Project, and NDAR are the front wave of what will be a sea change in the conduct and reporting of science. Having a new open data policy provides even more reason to create the tools and rules needed to support these changing times.

Thomas Insel, MD is the director of the National Institute of Mental Health (NIMH), where he blogs regularly on key initiatives in the mental health space.