Several email lists I am on were abuzz last week about the publication of a paper that was described in a press release from Indiana University to demonstrate that “machine learning — the same computer science discipline that helped create voice recognition systems, self-driving cars and credit card fraud detection systems — can drastically improve both the cost and quality of health care in the United States.” The press release referred to a study published by an Indiana faculty member in the journal, Artificial Intelligence in Medicine .
While I am a proponent of computer applications that aim to improve the quality and cost of healthcare, I also believe we must be careful about the claims being made for them, especially those derived from results from scientific research.
After reading and analyzing the paper, I am skeptical of the claims made not only by the press release but also by the authors themselves. My concern is less about their research methods, although I have some serious qualms about them I will describe below, but more so with the press release that was issued by their university public relations office. Furthermore, as always seems to happen when technology is hyped, the press release was picked up and echoed across the Internet, followed by the inevitable conflation of its findings. Sure enough, one high-profile blogger wrote, “physicians who used an AI framework to make patient care decisions had patient outcomes that were 50 percent better than physicians who did not use AI.” It is clear from the paper that physicians did not actually use such a framework, which was only applied retrospectively to clinical data.
What exactly did the study show? Basically, the researchers obtained a small data set for one clinical condition in one institution’s electronic health record and applied some complex data mining techniques to show that lower cost and better outcomes could be achieved by following the options suggested by the machine learning algorithm instead of what the clinicians actually did. The claim, therefore, is that if the data mining were followed by the clinicians instead of their own decision-making, then better and cheaper care would ensue.
As done in many scientific papers about technology, the paper goes into exquisite detail about the data mining algorithms and the experiments comparing them. But the paper unfortunately provides very little description about the clinical data itself. There is a reference to another paper from a conference that appears to describe the data set , but it is still not clear how the data was applied to evaluate the algorithms.
I have a number of methodological problems with the paper. First is the paucity of clinical details about the data. The authors refer to a metric called the “outcomes rating scale” of the “client-directed outcome informed (CDOI) assessment.” No details are provided as to exactly what this scale measures or how differences in measurement correlate with improved clinical outcome. Furthermore, the variables of the details of care for the patient that the data mining algorithm supposedly outperforms are not described either. Therefore anyone hoping to understand the clinical value that this approach is claimed to have improved is not able to do so.
A second problem is that there is no discussion about the cost data or what cost perspective (e.g., system, clinician, societal, etc.) is taken. This is a common problem that plagues many studies in healthcare that attempt to measure costs . Given the relatively modest amounts of money spent care that is reported in their results, amounting only to a few hundred dollars per patient, it is unlikely that the data includes the full amount of the costs of treatment for each patient, or over an appropriate time period. If my interpretation of the low value of the cost data is correct (which is difficult to discern from reading the paper due, again due to lack of details), the data do not include the cost of clinician time, facilities, or longer-term costs beyond the time frame of the data set. If that is indeed the case, then it would be particularly problematic for a machine learning system, since such systems make inferences limited only to the data that is provided to the model. Therefore if poor data is provided to the model, its “conclusions” are suspect. (This raises a side issue as to whether there is truly “artificial intelligence” here, since the only intelligence applied by the system is the models developed by their human creators.)
A third concern is that this is a modeling study. As every evaluation methodologist knows, modeling studies are limited in their ability to assign cause and effect. There is certainly a role in informatics science for modeling studies, although we saw recently that such studies have their limits, especially when revisited over the long run. In this study, there may have been reasons for the clinicians following the more expensive path or confounding reasons why such patients had worse outcomes, but they cannot be captured by the approach used in this study.
This is related to the final and most serious problem of the work, which is that the modeling evaluation is a very weak form of evidence to demonstrate the value of an intervention. If the authors truly wanted to show the benefits of the system and approach they developed, they should have performed a randomized controlled trial that compared their intervention with an appropriate control group. This would have led to the type of study that the blogger mentioned above erroneously described this to be. Such a study design would assess some of the more vexing problems we face in informatics, such as whether the advice coming from a computer will change clinician behavior. Or, when such systems are introduced into the “real world,” whether the “advice” provided will prospectively lead to better outcomes.
I do believe that the kind of work addressed by this paper is important, especially as we move into the area of personalized medicine. As eloquently described by Stead and colleagues, healthcare will soon be reaching the point where the number of data points required for clinical decisions will exceed the bounds of human cognition . (It probably already has.) Therefore clinicians will require aids to their cognition provided by information systems, perhaps one like that described in the study.
But such aids require, like everything else in medicine, robust evaluative research to demonstrate their value. The methods used in this paper may indeed be the methods to provide this value, but the implementation and evaluation described miss the mark. That miss is further exacerbated by the hype and conflation the ensued after the paper was published.
What can we learn from this paper and its ensuing hype? First, bold claims require bold evidence to back them up. In the case of showing value for an approach in healthcare – be it test, treatment, or informatics application – we must use evaluation methods that provide best evidence for the claim. That is not always a randomized controlled trial, but in this situation, it would be, and the modeling techniques used are really just preliminary data that (might) justify an actual clinical trial. Second, when we perform technology evaluation, we need to describe, and ideally release, all of the clinical data so that others can analyze and even replicate the results. Finally, while we all want to disseminate the results of our research to the widest possible audience, we need to be realistic in explaining what we accomplished and what are its larger implications.
 Bennett, C. and K. Hauser (2013). Artificial intelligence framework for simulating clinical decision-making: a Markov decision process approach. Artificial Intelligence in Medicine. Epub ahead of print.
 Bennett, C., T. Doub, A. Bragg, J. Luellen, C. VanRegenmorter, J. Lockman and R. Reiserer (2011). Data mining session-based patient reported outcomes (PROs) in a mental health setting: toward data-driven clinical decision support and personalized treatment. 2011 First IEEE International Conference on Healthcare Informatics, Imaging and Systems Biology (HISB 2011), San Jose, CA. 229-236.
 Drummond, M. and M. Sculpher (2005). Common methodological flaws in economic evaluations. Medical Care. 43(7 Suppl): 5-14.
 Stead, W., J. Searle, H. Fessler, J. Smith and E. Shortliffe (2011). Biomedical informatics: changing what physicians need to know and how they learn. Academic Medicine. 86: 429-434.
William Hersh, MD is Professor and Chair of the Department of Medical Informatics & Clinical Epidemiology at Oregon Health & Science University in Portland, OR. He is a well-known leader and innovator in biomedical and health informatics. In the last couple years, he has played a leadership role in the ONC Workforce Development Program. He was also the originator of the 10×10 (“ten by ten”) coursein partnership with AMIA. Dr Hersh maintains the Informatics Professor blog.