One of the computer applications that has received the most attention in healthcare is Watson, the IBM system that achieved fame by beating humans at the television game show, Jeopardy!. Sometimes it seems there is such hype around Watson that people do not realize what the system actually does. Watson is a type of computer application known as a “question-answering system.” It works similarly to a search engine, but instead of retrieving “documents” (e.g., articles, Web pages, images, etc.), it outputs “answers” (or at least short snippets of text that are likely to contain answers to questions posed to it).
As one who has done research in information retrieval (IR, also sometimes called “search”) for over two decades, I am interested in how Watson works and how well it performs on the tasks for which it is used. As someone also interested in IR applied to health and biomedicine, I am even more curious about its healthcare applications. Since winning at Jeopardy!, Watson has “graduated medical school” and “started its medical career”. The latter reference touts Watson as an alternative to the “meaningful use” program providing incentives for electronic health record (EHR) adoption, but I see Watson as a very different application, and one potentially benefitting from the growing quantity of clinical data, especially the standards-based data we will hopefully see in Stage 2 of the program. (I also have skepticism for some of these proposed uses of Watson, such as its “crunching” through EHR data to “learn” medicine. Those advocating Watson performing this task need to understand the limits to observational studies in medicine.)
One concern I have had about Watson is that the publicity around it has been mostly news articles and press releases. As an evidence-based informatician, I would like to see more scientific analysis, i.e., what does Watson do to improve healthcare and how successful is it at doing so? I was therefore pleased to come across a journal article evaluating Watson . In this first evaluation in the medical domain, Watson was trained using several resources from internal medicine, such as ACP Medicine, PIER, Merck Manual, and MKSAP. Watson was applied, and further trained with 5000 questions, in Doctor’s Dilemma, a competition somewhat like Jeopardy! that is run by American College of Physicians and in which medical trainees participate each year. A sample question from the paper is, Familial adenomatous polyposis is caused by mutations of this gene, with the answer being, APC Gene. (Googling the text of the question gives the correct answer at the top of its ranking to this and the two other sample questions provided in the paper).
Watson was evaluated on an additional 188 unseen questions . The primary outcome measure was recall (number of correct answers) at 10 results shown, and performance varied from 0.49 for the baseline system to 0.77 for the fully adapted and trained system. In other words, looking at the top ten answers for these 188 questions, 77% of those Watson provided were correct.
We can debate whether or not this is good performance for a computer system, or a computer system being touted to provide knowledge to expert users. But a more disappointing aspect of the study is its limitations that I would have brought up had I been asked to peer-review the paper.
The first question I had was, how does Watson’s performance compare with other systems, including IR systems such as Google or Pubmed? As noted above, for the three example questions provided in the paper, Google gave the answer in the snippet of text from the top-ranking Web page each time. It would be interesting to know how other online systems would compare with Watson’s performance on the questions used in this study.
Another problem with the paper is that none of the 188 questions were provided, not even as an appendix. In all of the evaluation studies I have performed (e.g., [2-4]), I have always provided some or all of the questions used in the study so the reader could better assess the results.
A final concern was that Watson was not evaluated in the context of a real user. While systems usually need to be evaluated from the “system perspective” before being assessed with users, it would have been informative to see whether Watson provided novel information or altered decision-making in real-world clinical scenarios.
Nonetheless, I am encouraged that a study like this was done, and I hope that more comprehensive studies will be undertaken in the near future. I do maintain enthusiasm for systems like Watson and am confident they will find a role in medicine. But we need to be careful about hype and we must employ robust evaluation methods to test our claims as well as determine how they are best used.
1. Ferrucci, D, Levas, A, et al. (2012). Watson: Beyond Jeopardy! Artificial Intelligence. Epub ahead of print.
2. Hersh, WR, Pentecost, J, et al. (1996). A task-oriented approach to information retrieval evaluation. Journal of the American Society for Information Science. 47: 50-56.
3. Hersh, W, Turpin, A, et al. (2001). Challenging conventional assumptions of automated information retrieval with real users: Boolean searching and batch retrieval evaluations. Information Processing and Management. 37: 383-402.
4. Hersh, WR, Crabtree, MK, et al. (2002). Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions. Journal of the American Medical Informatics Association. 9: 283-293.
William Hersh, MD is Professor and Chair of the Department of Medical Informatics & Clinical Epidemiology at Oregon Health & Science University in Portland, OR. He is a well-known leader and innovator in biomedical and health informatics. In the last couple years, he has played a leadership role in the ONC Workforce Development Program. He was also the originator of the 10×10 (“ten by ten”) coursein partnership with AMIA. Dr Hersh maintains the Informatics Professor blog, where this piece originally appeared.