In their best-selling 2013 book Big Data: A Revolution That Will Transform How We Live, Work and Think, authors Viktor Mayer-Schönberger and Kenneth Cukier selected Google Flu Trends (GFT) as the lede of chapter one.
They explained how Google’s algorithm mined five years of web logs, containing hundreds of billions of searches, and created a predictive model utilizing 45 search terms that “proved to be a more useful and timely indicator [of flu] than government statistics with their natural reporting lags.”
Unfortunately, no. The first sign of trouble emerged in 2009, shortly after GFT launched, when it completely missed the swine flu pandemic. Last year, Nature reported that Flu Trends overestimated by 50% the peak Christmas season flu of 2012. Last week came the most damning evaluation yet.
In Science, a team of Harvard-affiliated researchers published their findings that GFT has over-estimated the prevalence of flu for 100 out of the last 108 weeks; it’s been wrong since August 2011.
The Science article further points out that a simplistic forecasting model—a model as basic as one that predicts the temperature by looking at recent-past temperatures—would have forecasted flu better than GFT.
In short, you wouldn’t have needed big data at all to do better than Google Flu Trends. Ouch.
In fact, GFT’s poor track record is hardly a secret to big data and GFT followers like me, and it points to a little bit of a big problem in the big data business that many of us have been discussing: Data validity is being consistently overstated.
As the Harvard researchers warn: “The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.”
The amount of data still tends to dominate discussion of big data’s value. But more data in itself does not lead to better analysis, as amply demonstrated with Flu Trends. Large datasets don’t guarantee valid datasets. That’s a bad assumption, but one that’s used all the time to justify the use of and results from big data projects.
As Washington remains deadlocked on the implementation of the Affordable Care Act, the US government’s shutdown has resulted in the furlough of nearly 70% of the Centers for Disease Control‘s (CDC’s) workforce. CDC Director Tom Frieden recently shared his thoughts in a tweet. We agree whole-heartedly. Although it’s all too easy to take the CDC staff for granted, they are the frontline sentinels (and the gold standard) for monitoring disease outbreaks. Their ramp-down could have serious public health consequences.
We are particularly concerned about the apparent temporary discontinuation of the CDC’s flu surveillance program, which normally provides weekly reports on flu activity. Although flu season typically begins in late fall, outbreaks have occurred earlier in previous years. In 2009, flu cases started accumulating in late summer/early fall. And given the potential for unique variants, such as the swine or avian flu, every season is unpredictable, making the need for regular CDC flu reports essential. We therefore hope to see the CDC restored to full capacity as soon as possible.
In the meantime, we would like to help by sharing data we have on communicable diseases, starting with the flu.
Because the athenahealth database is built on a single-instance, cloud-based architecture, we have the ability to report data in real time. As we have described in earlier posts, the physicians we serve are dispersed around the country with good statistical representation across practice types and sizes.
To get a read on influenza vaccination rates so far this season, we looked at more than two million patients who visited a primary care provider between August 1 and September 28, 2013 (Figure 1). We did not include data on vaccinations provided at retail clinics, schools or workplaces.
This year’s rates are trending in parallel to rates over the last four years, and slightly below those of the 2012-2013 season. However, immunizations accelerate when the CDC, and consequently the media, announce disease outbreaks and mount public awareness campaigns.