One of the computer applications that has received the most attention in healthcare is Watson, the IBM system that achieved fame by beating humans at the television game show, Jeopardy!. Sometimes it seems there is such hype around Watson that people do not realize what the system actually does. Watson is a type of computer application known as a “question-answering system.” It works similarly to a search engine, but instead of retrieving “documents” (e.g., articles, Web pages, images, etc.), it outputs “answers” (or at least short snippets of text that are likely to contain answers to questions posed to it).
As one who has done research in information retrieval (IR, also sometimes called “search”) for over two decades, I am interested in how Watson works and how well it performs on the tasks for which it is used. As someone also interested in IR applied to health and biomedicine, I am even more curious about its healthcare applications. Since winning at Jeopardy!, Watson has “graduated medical school” and “started its medical career”. The latter reference touts Watson as an alternative to the “meaningful use” program providing incentives for electronic health record (EHR) adoption, but I see Watson as a very different application, and one potentially benefitting from the growing quantity of clinical data, especially the standards-based data we will hopefully see in Stage 2 of the program. (I also have skepticism for some of these proposed uses of Watson, such as its “crunching” through EHR data to “learn” medicine. Those advocating Watson performing this task need to understand the limits to observational studies in medicine.)
One concern I have had about Watson is that the publicity around it has been mostly news articles and press releases. As an evidence-based informatician, I would like to see more scientific analysis, i.e., what does Watson do to improve healthcare and how successful is it at doing so? I was therefore pleased to come across a journal article evaluating Watson [1]. In this first evaluation in the medical domain, Watson was trained using several resources from internal medicine, such as ACP Medicine, PIER, Merck Manual, and MKSAP. Watson was applied, and further trained with 5000 questions, in Doctor’s Dilemma, a competition somewhat like Jeopardy! that is run by American College of Physicians and in which medical trainees participate each year. A sample question from the paper is, Familial adenomatous polyposis is caused by mutations of this gene, with the answer being, APC Gene. (Googling the text of the question gives the correct answer at the top of its ranking to this and the two other sample questions provided in the paper).
Watson was evaluated on an additional 188 unseen questions [1]. The primary outcome measure was recall (number of correct answers) at 10 results shown, and performance varied from 0.49 for the baseline system to 0.77 for the fully adapted and trained system. In other words, looking at the top ten answers for these 188 questions, 77% of those Watson provided were correct.