The Rise of Big Data

Health care is in the process of getting itself computerized. Fashionably late to the party, health care is making a big entrance into the information age, because health care is well positioned to become a big player in the ongoing Big Data game. In case you haven’t noticed computerized health care, which used to be the realm of obscure and mostly small companies, is now attracting interest from household names such as IBM, Google, AT&T, Verizon and Microsoft, just to name a few. The amount and quality of Big Data that health care can bring to the table is tremendous and it complements the business activities of many large technology players. We all know about paper charts currently being transformed via electronic medical records to computerized data, but what exactly is Big Data? Is it lots and lots of data? Yes, but that’s not all it is.

Americans live for approximately 78 years. They see a doctor about 4 times per year and spend on average 0.6 days each year in a hospital.

To keep a life time record of blood pressure readings for all Americans, including metadata (date/time of reading, who recorded the measure and where, etc.) takes approximately 6 TB (terabytes) of storage space, or about 12 laptops with standard 600 GB hard drives. Not too big. What if we start using mobile wearable devices to quantify ourselves, as some folks already do, and we record blood pressure, say, every hour? We will require 1460 TB of storage, or almost 3000 laptops, or the equivalent of 6 times the digitized contents of the Library of Congress, and this is for blood pressure monitoring only.

Adding in the remaining 99.9% of the medical record, including large imaging files, hospital monitoring devices, pharmacy data, insurer data, telehealth sessions and other personal health sensors, and keeping in mind that all these data are meant to be exchanged freely over the Internet, we are approaching a data tsunami of biblical proportions. And we are not done just yet. Once health care’s Big Data is released into the mainstream Internet, it will initiate secondary and tertiary waves of new data created by consumers addressing their newly found health care data on social media venues, specialty forums, blogs and commercial sites offering services for health data. Big Data is the fluid combination of the ever increasing real-time data streams created by everything from government to businesses to Facebook, Twitter, Geo-locators, mobile devices and connected sensors everywhere. Big Data is as much about size as it is about cross pollination of data from disparate sources.

A fascinating June 2011, McKinsey report predicts that Big Data is the “next frontier for innovation, competition, and productivity” and that Big Data will become equal to labor and capital in its importance to production. For U.S. health care, the report is predicting $300 billion per year in savings due to utilization of Big Data to drive the execution of strategies proposed by health care experts. In the area of clinical operations, the report lists projected savings from Comparative Effectiveness Research (CER) when tied to insurance coverage, Clinical Decision Support (CDS) savings derived from delegating work to lower paid resources and from reductions in adverse events, transparency for consumers in the form of quality reports for physicians and hospitals, home monitoring devices including pills that report back when they are ingested, and profiling patients for managed care interventions. Administrative savings are projected from automated systems to detect and reduce fraud and from shifting to outcomes based reimbursement for providers and, interestingly, for drug manufacturers through collective bargaining by insurers. Most savings listed under research and development opportunities from Big Data seem to accrue to pharmaceutical and device manufacturers. There is nothing to suggest that Big Data will somehow reduce unit prices of products or services.

To be honest, I don’t quite understand where the $300 billion in savings come from as there are no actual itemized numbers to support this prediction. In addition to stated reliance on individual studies and expert interviews, there are many structural assumptions regarding massive provider consolidation, proliferation of Accountable Care Organizations, technology adoption rates of 90% across the industry and data sharing amongst all stakeholders, at which point Big Data will come in and do its thing. The costs for generating, storing and analyzing Big Data which include emerging data storage technologies and analytical expertise are factored in, with the costs of national deployment of EHRs alone “estimated at around $20 billion a year, after initial deployment (estimated at up to $200 billion)”.

Most people, including doctors, will probably agree that pertinent data, big or small, can be transformed into pertinent information, and pertinent information is vital to good decision making. But is Big Data pertinent? Are all those petabytes of minute details about everything and everybody really useful, or are we just mixing a little wheat with a lot of chaff? There are various opinions on this, but the prevailing wisdom seems to be that the more data you have, the more likely you are to be able to extract something useful out of it. By observing patterns and correlations in this ocean of information you may discover answers to questions you wouldn’t have known to ask in the first place. There is much power in Big Data, but there is also danger. As big as Big Data may be, it does not guarantee that it is complete or accurate, which may lead to equally incomplete and inaccurate observations. Big Data is not available to all and is not created by all in equal amounts, which may lead to undue power for Big Data holders and misrepresentation of interests for those who do not generate enough Big Data. Collection and analysis of Big Data has obvious implications to privacy and human rights. But the biggest danger of all, in my opinion, is the forthcoming relaxations in the rigors of accepted scientific methods, and none seems bigger than the temptation to infer causality from correlation.

We’ve been there before. When humanity dwelt in caves and villages, correlation was enough to establish causality. We’ve come a long way since, but the global village we are creating today seems tempted to go back to observation as the main way of gaining understanding. Just like the historic villagers, we are now convinced that we can see everything there is to be seen; therefore the answers to all our questions must be found in the Big Data mirror we placed in front of us. All we have to do is stare at it long enough and the patterns will emerge. The sheer size and variety of Big Data will make it much easier to reject the null hypothesis and see patterns where none exist. On the other hand, if we keep staring at our digital selves in the eye for long enough, perhaps we will achieve the most coveted observation of all: a glimpse through the windows to our digitized soul.

Margalit Gur-Arie was COO at GenesysMD (Purkinje), an HIT company focusing on web based EHR/PMS and billing services for physicians. Prior to GenesysMD, Margalit was Director of Product Management at Essence/Purkinje and HIT Consultant for SSM Healthcare, a large non-profit hospital organization. She shares her thoughts about HIT topics and issues at her blog, On Healthcare Technology.

6 replies »

  1. Mike,

    The only way you can effectively understand medical data is to substantially use non-medical data at the same time. Looking at blood pressure data over time may show you that your blood pressure is only elevated when your mother-in-law is in town. That’s good to know – but you can’t find that out from medical data alone.

  2. It’s pretty clear you guys studied 20th century medicine but 16th century mathematics.

    Causality versus correlation, avoiding overfitting (translation: seeing patterns when there were none), etc, were problems that were well understood in the 19th century, and completely resolved in the 20th – notwithstanding the subsequent abuse they suffered at the hands of big tobacco.

    Maximum likelihood estimators and related ideas got their start in the 1920s. The seminal information-theoretic insights that underpin modern data mining arrived in the 50s. Then, in the 60s-70s, theoretically sound frameworks for general theorem inference were developed (MML in particular). Implying that if you need to `dispense with scientific rigor’, you’re simply incompetent.

    Since then, we got computers, startups, and a whole universe of people who are all going to be _very interested_ in doing this stuff right – if only because there’s a ton of money for the winners and they’ll all be at each other’s throats. Not to mention a sudden public interest in transparency and data sharing (which is going to help considerably).

    Finally: storage costs for all the medical data from my lifetime? I could stick it on a thumb drive now, for about $10 – and these (future) `data torrents’ will fit just as comfortably on (future) thumb drives. I’ll save more than that on my next copay, if it lets me choose a generic drug.

    Storage for the medical data for everyone forever? Google (and Bing) already store the entire web – and they do it cheaply enough to support themselves by _advertising_. Netflix already uses more bandwidth than will _ever_ be streamed as medical data (how much time do you spend watching movies compared to lying in a CAT scanner)?

    Medical `big data’ isn’t that big.

  3. Hi David and Gilles,
    Great to hear from you both.
    Yes, the BP example was a bit contrived for the sole purpose of illustrating the magnitude of storage needed for just 6 characters of life time data.

    Not sure why THCB did not hyperlink the citations in this post, but Danah Boyd’s article was what I cited for the dangers of big data.
    The temptation to think that all this data can provide all the insights we always wanted is very great. I know it is for me. But I agree Gilles, the scientific methods must be preserved, and as you probably saw in the Boyd article, some folks seem ready to dispense with rigor.

    There was an anecdotal story in another citation I had from an Aspen institute round table talking about an observed correlation of fully booked restaurants in 7 cities, three days prior to Bear Stearns’ demise. What does it mean? Should we start tracking restaurant occupancy to predict Wall Street trends? This is scary stuff….

  4. Margalit,

    Any time I see an article about the significance of Big Data I fear this is going to be yet another blind hymn to the Big Data Gods. But yours is clearly not that. Thank you!

    Those, like me, who have significant issues with the blind acceptance of Big Data as an automatic generator of great knowledge, can now use danah boyd’s “Six Provocations for Big Data” as a good basis for interesting conversations with those who want to minimize all medical encounters into finite structured data sets.

    Counterbalancing what could be construed as the negative impact of Big Data you can also see some amazingly good results. A nice example was mentioned this past June by Ben Goldacre in his blog about bad science: “There’s something magical about watching patterns emerge from data“.

    As long as results based on Big Data are questioned with the same level of doubt as any other scientifically validated dataset we should be able to obtain significant insights. Remains that the technological expertise needed to validites these data sets may be problematic in this initial phase.

  5. Margalit,
    Yes data is getting “big” and there is a legitimate concern about avoiding the overwhelming TMI syndrome and preserving usability. But the example of a lifetime of BP readings, even though it generates big numbers to illustrate your point, is strange. Who would want that? Is anyone proposing something like that?

    But you make a great conclusion about the danger of inferring causality from correlation. Though I question your specific example, this is a point that needs to be thought through before people plunge forward to assume that “more is better.” They also ought to think about “sometimes less is more!”