Pharma’s (Big) Data Problem

C.P. Snow, author of “The Two Cultures”

Despite (some might say, because of) a raft of new biological methods, pharma R&D has struggled with its EROOM problem, the fact that the cost of successfully developing a new drug, including the cost of failures, has been relentlessly increasing, rather than decreasing, over time (EROOM is Moore spelled backwards, as in Moore’s Law, describing the rapid pace of technology improvement over time).

Given the impact of technology in so many other areas, the question many are now asking is whether technology could do its thing in pharma, and make drug development faster, cheaper, and better.

Many major pharmas believe the answer has to be yes, and have invested in some version of a by-now familiar data initiative aimed at aggregating and organizing internal data, supplementing this with available public data, and overlaying this with a set of analytical tools that will help the many data scientists these pharmas are urgently hiring to extract insights and accelerate research.

A bevy of established companies and consultancies, including Deloitte, Accenture, BCG, and McKinsey, are championing some version of this vision, along with a number of younger companies, including Silicon Valley powerhouse Palantir, and startups like Datavant.

Two significant challenges associated with this vision are:

(1) How do you achieve it?

(2) Will it actually work?

These are each enormous unknowns with which the entire industry is currently wrestling.

Organizing Pharma Data

“Every biomedical organization has their data spread across multiple data stores, and the real work to be done is curating it into a form that can be cross-analyzed,” Anthony Philippakis, Chief Data Officer of the Broad Institute, explained to me. “The challenge that life science organizations face is not so much analyzing their data, but rather organizing it.”

He adds, “As we think about making precision medicine a reality, I think it is much more likely that we will fail because of the challenges of data sharing and data curation, rather than the challenges of scalability or analytics.”

Philippakis has championed a concept he and his colleagues call a “data biosphere,” an ecosystem that “contains modular and interoperable components that can be assembled into diverse data environments.” Philippakis argues, “It is crazy to think that any one group can create a data platform that will satisfy the needs of all groups across all geographies. We need modular and interoperable services that result in an ecosystem of activity.” (See also this 2012 Atlantic piece.)

Leveraging Organized Data

As many biopharma companies invest significant treasure and time in collecting and organizing their data — an inordinately heavy lift — the question is: will it be worth it? Will having a huge amount of organized data radically change how pharma companies discover and develop drugs?

That’s the idea, certainly, if not the expectation – but we should take such predictions with a grain of salt. As Economist reporter Natasha Loder reminded us recently on twitter, analysts in 1997 predicted new techniques – including “things like ‘bioinformatics’ and ‘rational drug design’ are likely to have a huge effect on the drug industry,” radically accelerating development time and doubling the success rate of late-phase trials. More than twenty years later, we’ve seen that these techniques are certainly powerful – yet unfortunately, they’ve not (yet?) cracked the nut of drug development.

Even so, there seems to be a pervasive sense that once the required internal and external data are cobbled together, R&D-altering insights will follow. How?

This is the pharma data version of the famous South Park underpants gnome story, where the business plan is roughly:

Step 1: Collect lots of data

Step 2:

Step 3: Insight and efficiency!

At least, this seems to be how the investment is viewed from the R&D trenches (as I recently discussed), where drug developers are vaguely aware of institutional data efforts, work that for the most part hasn’t yet really impacted how most drug development teams go about their jobs.

From the C-suite, though, Step 2 is “data science,” and the plan of many pharmas, to paraphrase Matt Damon in The Martian, is to collect a ton of information and then data science the heck of out it.


Aren’t Pharma’s Already All About Data And Science?

What exactly is this much-celebrated “data science,” and how is it different from what pharmas are already doing? After all, quantitatively analyzing data is already a central part of pharma R&D, and has been for quite some time.

One of the best answers I’ve heard is from UCSF’s Atul Butte, who told me,

“In general, most computational/informatics folks are viewed as service providers, and they are really not shown all the data available within a company.  To be blunt, they [i.e. pharmas] should be empowering computational folks to come up with ideas and hypotheses using their own internal data (and outside public data), and running experiments to test those ideas.”

Harvard professor Zak Kohane agrees. “Biomedical informaticians and clinical investigators often view each other as intellectual peasants providing rote/mechanical services.”

The problem Butte and Kohane are pointing out in that in most pharmas, drug development is driven by bench scientists (in preclinical development), and by clinical investigators (once a product enters human studies). A lot of analyses are performed, but in relatively stereotypical or predefined ways, trying to answer specific questions posed by bench or clinical scientists. Ideally, those doing this analysis work in close partnership with those leading drug development, but I suspect very few of these statisticians and analysts believe they’re driving the bus.

What Butte and Kohane are arguing for is at some level a radical change; it’s the suggestion that if you let savvy data scientists loose on a reasonable amount of data, and let them figure out what to ask, they will come up with insights by asking questions others in the organization might not have thought of. (There are obviously parallels here to the “research parasite” debate of 2016, another tussle between clinical investigators and data scientists.)

Complicating matters is the deep disconnect between those expert in traditional domains of drug discovery, and those expert in data science. As Kohane observes, “What is best is one brain (multidisciplinary team second best) capable of a nimble back and forth between questions, hypothesis testing and analyses,” echoing a very similar assertion Calico Chief Computing Officer Daphne Koller recently made on our Tech Tonics podcast (episode here).


The fascinating question is how all this gets resolved. At many biopharmas today, drug development teams toil pretty much as they always have, while data groups collect and assemble both data and talent, in relative isolation. It is a tale of two cultures worthy of C.P. Snow.


It’s not clear to me, or to anyone, how this story ends, but pragmatically, a key step has got to be meaningful contributions from data scientists – novel insights from select, aggregated datasets that are too important for a pharma R&D organization to ignore. Inevitably, these insights will be from the integration of some datasets, but not all possible datasets, and it’s interesting to contemplate which datasets will be most informative. Which do you – should you — start with to try to generate a quick, meaningful win?

There are two categories of positive outcomes I can foresee. First, a small company will figure out how to effectively leverage a group of datasets, and use this insight to successfully generate either a molecule or an approach (to clinical trial recruiting, say) that pharmas would be keen to access. Second, a large company might figure out how to solve the culture problem, and find a way to help traditional drug developers and eager data scientists work together effectively, and in particular learn the sorts of questions and approaches relevant to each domain. This seems incredibly difficult to pull off in pharma companies, which tend to be extremely territorial by nature – but imagine the great outcome achievable for both the company that figures this out, as well as for the patients who would benefit from the novel insights such a collaborative team might generate.

David Shaywitz is a Senior Partner with Takada Ventures and a Visiting Scientist at Harvard Medical School. 

Categories: Uncategorized

Tagged as: , , ,

2 replies »

  1. The reference to Baron Snow (1905-80), also a physician, is odd. He is known for a speech given at the Senate House in Cambridge, England on May 7, 1957. Here is a brief quotation from that lecture: “I believe the world is increasingly in danger of becoming split into groups which cannot communicate with each other, which no longer think of each other as members of the same species.” Taken alone, this quotation might be viewed as referring to the geopolitical divisions within the world-wide community. This view, in fact, would not represent its original context. He went on to describe his view that a communication gap was evolving between the sciences realm and the humanities realm of knowledge.
    It might be possible to view this communication gap and its associated cognitive dissonance as the ‘root cause’ of ‘root causes’ for our nation’s poorly focused attempts at healthcare reform. There is no reason to believe that any of the current strategies will solve its cost and quality problems, community by community. Meanwhile, our nation’s health spending increases at the rate of 5.0% compounded annually, inflation and economic growth corrected. And our nation’s maternal mortality incidence continues to worsen annually as it has for more than 25 years.

  2. Neither is the answer. Why you say. The answer may be found in the following definition of HEALTH.
    A person’s continuing expression of variably stable survival that is:
    ENDOWED BY the person’s individually unique CLUSTERS of Human Capabilities and their transformation during maternal gestation to become sufficient for stable survival after birth as a dependent person with an INNATE TEMPERAMENT and BASELINE HOMEOSTASIS;
    NURTURED BY the Caring Relationships originating initially
    ….before birth from among the dependent person’s Family with a commitment to fulfill the person’s caring-learning-creative CLUSTER of Human Capabilities for becoming an independent person AND
    ….after birth from among the person’s Extended Family and the Neighborhood Network of the person’s Family with a commitment to offer continuing support for the person’s Family, especially during the person’s early childhood;
    MATURED BY the repetitive occurrence of Disruptive Events, Entropy and their interacting effects of the evolving Resilience of the person’s Innate Temperament and Baseline Homeostasis, as ameliorated concurrently through the Caring Relationships originating from within the person’s Extended Family; AND
    SUSTAINED BY the person’s Family Traditions and by the COMMON GOOD of the person’s community until the person’s CLUSTERS of Human Capabilities become insufficient for stable survival from the cumulative effects of Disruptive Events and Entropy occurring during the person’s life-time.
    The essential ingredients for a person’s variably unstable HEALTH are unlikely to be defined at one point in time by one or more giant algorithms. The keys to this lie in its Caring Relationships. I close with this definition of a CARING RELATIONSHIP as in marital, parental, sibling, medical, pastoral, teacher, or spiritual.
    A variably asymmetric interaction between two persons
    .occurring for a brief period of time and mutually experienced
    .during the initial interaction, and any continuing repetition,
    .with sufficient congruence among it attributes
    .for perceiving a shared Beneficent intent
    .to enhance each other’s Autonomy by communicating with
    …………………………………………………………..HONESTY and
    To round out this over-all concept requires a whole hand-full of supplementary definitions, chiefly Family, Social Capital and COMMON GOOD. Another Time. If you are really curious: see