Chapter Nine: In Which Dr. Watson Discovers Med School Is Slightly Tougher Than He Had Been Led to Believe

One of the computer applications that has received the most attention in healthcare is Watson, the IBM system that achieved fame by beating humans at the television game show, Jeopardy!. Sometimes it seems there is such hype around Watson that people do not realize what the system actually does. Watson is a type of computer application known as a “question-answering system.” It works similarly to a search engine, but instead of retrieving “documents” (e.g., articles, Web pages, images, etc.), it outputs “answers” (or at least short snippets of text that are likely to contain answers to questions posed to it).

As one who has done research in information retrieval (IR, also sometimes called “search”) for over two decades, I am interested in how Watson works and how well it performs on the tasks for which it is used. As someone also interested in IR applied to health and biomedicine, I am even more curious about its healthcare applications. Since winning at Jeopardy!, Watson has “graduated medical school” and “started its medical career”. The latter reference touts Watson as an alternative to the “meaningful use” program providing incentives for electronic health record (EHR) adoption, but I see Watson as a very different application, and one potentially benefitting from the growing quantity of clinical data, especially the standards-based data we will hopefully see in Stage 2 of the program. (I also have skepticism for some of these proposed uses of Watson, such as its “crunching” through EHR data to “learn” medicine. Those advocating Watson performing this task need to understand the limits to observational studies in medicine.)

One concern I have had about Watson is that the publicity around it has been mostly news articles and press releases. As an evidence-based informatician, I would like to see more scientific analysis, i.e., what does Watson do to improve healthcare and how successful is it at doing so? I was therefore pleased to come across a journal article evaluating Watson [1]. In this first evaluation in the medical domain, Watson was trained using several resources from internal medicine, such as ACP MedicinePIERMerck Manual, and MKSAP. Watson was applied, and further trained with 5000 questions, in Doctor’s Dilemma, a competition somewhat like Jeopardy! that is run by American College of Physicians and in which medical trainees participate each year. A sample question from the paper is, Familial adenomatous polyposis is caused by mutations of this gene, with the answer being, APC Gene. (Googling the text of the question gives the correct answer at the top of its ranking to this and the two other sample questions provided in the paper).

Watson was evaluated on an additional 188 unseen questions [1]. The primary outcome measure was recall (number of correct answers) at 10 results shown, and performance varied from 0.49 for the baseline system to 0.77 for the fully adapted and trained system. In other words, looking at the top ten answers for these 188 questions, 77% of those Watson provided were correct.

We can debate whether or not this is good performance for a computer system, or a computer system being touted to provide knowledge to expert users. But a more disappointing aspect of the study is its limitations that I would have brought up had I been asked to peer-review the paper.

The first question I had was, how does Watson’s performance compare with other systems, including IR systems such as Google or Pubmed? As noted above, for the three example questions provided in the paper, Google gave the answer in the snippet of text from the top-ranking Web page each time. It would be interesting to know how other online systems would compare with Watson’s performance on the questions used in this study.

Another problem with the paper is that none of the 188 questions were provided, not even as an appendix. In all of the evaluation studies I have performed (e.g., [2-4]), I have always provided some or all of the questions used in the study so the reader could better assess the results.

A final concern was that Watson was not evaluated in the context of a real user. While systems usually need to be evaluated from the “system perspective” before being assessed with users, it would have been informative to see whether Watson provided novel information or altered decision-making in real-world clinical scenarios.

Nonetheless, I am encouraged that a study like this was done, and I hope that more comprehensive studies will be undertaken in the near future. I do maintain enthusiasm for systems like Watson and am confident they will find a role in medicine. But we need to be careful about hype and we must employ robust evaluation methods to test our claims as well as determine how they are best used.


1. Ferrucci, D, Levas, A, et al. (2012). Watson: Beyond Jeopardy! Artificial Intelligence. Epub ahead of print.
2. Hersh, WR, Pentecost, J, et al. (1996). A task-oriented approach to information retrieval evaluation. Journal of the American Society for Information Science. 47: 50-56.
3. Hersh, W, Turpin, A, et al. (2001). Challenging conventional assumptions of automated information retrieval with real users:  Boolean searching and batch retrieval evaluations. Information Processing and Management. 37: 383-402.
4. Hersh, WR, Crabtree, MK, et al. (2002). Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions. Journal of the American Medical Informatics Association. 9: 283-293.

William Hersh, MD is Professor and Chair of the Department of Medical Informatics & Clinical Epidemiology at Oregon Health & Science University in Portland, OR. He is a well-known leader and innovator in biomedical and health informatics. In the last couple years, he has played a leadership role in the ONC Workforce Development Program. He was also the originator of the 10×10 (“ten by ten”) coursein partnership with AMIA. Dr Hersh maintains the Informatics Professor blog, where this piece originally appeared.

14 replies »

  1. Google is a decent tool for medical-related queries, but it can also produce GI-GO. This has been a topic of public discussion and media attention:



    Google can also be subject to manipulation of its search findings, though they are continually trying to make it harder to do:


    I use Google a lot in my job, but recognize its limitations.

    I can concur there is a lot of hype on Watson. Whether it can live up to it is still unknown. However, since those using it did pay for the privilage, one should be able to gauge soon enough whether they see sufficient value to justify and continue the investment. If M.D. Anderson and Mayo don’t get their money’s worth in clinical improvement, they can always pull the plug.

    As I understand it, Watson also is built around a gainshare contracting model, so if a deployment is a bust, then IBM takes a hit.

    While the clinical side gets most of the attention, Watson is also being deployed by one major insurer to streamline prior authorizations. That may be where its ability to crunch through massive amounts of data shows to be particularly useful. One would hope it can automate and simplify the prior authorization process (which still has many labor intensive manual elements).

    One item I think worth noting is that Watson is simply the most well-known of its kind, but it isn’t the only game in town or the final word. There will be others that come later to challenge its lead and it is what we see as this form of technology assistance matures is what interests me the most.

    Oh, and a significant limitation on Watson is that it “reads” in English-only at this point (or did the last time I checked a few months ago). That is something that is opens doors to a lot of potential competitors.

  2. Actually, Google’s PageRank algorithm is pretty effective at pushing high-quality output to the top of the list.

    But you are missing my larger point, which is not that Watson cannot do any of the things you say it can do, but rather that there has yet been no objective, scientific assessment of these capabilities. I will be the strongest believer when it is shown conclusively, but until then, it all seems like mostly hype to me.

  3. A difference between Watson and Google is that Watson assigns reliability weightings to the information it uses. Google can give the same weight to junk science or quackery as it does to actual science and may up with the worst possible scenario based upon the query made to i.

    Watson also is able to work with unstructured data, which is the state of about 80% of medical data. This gives it access to the treasure trove of narrative information that comprises much of medicine.

    The idea of a head to head contest is kind of meaningless, as what Watson does is to augment a physician’s work. It offers possible answers with probability weights and ways to check them. It then refines the answers it offers in the future based upon the accuracy of the prior ones.

    What it can do well is vacuum up enormous amounts of information on a topic and plow through it to look for the best answers to the questions presented to it. If given a set of symptoms, it can churn through truly massive amounts of data, come up with some possibilities, weight them, and offer ways to confirm them (order this test or check this symptom).

    In the area of treatment plans, it can look across all of the available data it has, see what did or didn’t work when facing similar cases, and summarize options based upon probability of success.

    The clinician can see what should be a reliable summary of probably options regarding a diagnosis or treatment plan. The clinician can use it as an additional resource and then feed back to Watson which answer turned out to be correct.

    It looks to do on a very large scale what clinicans already do now on a smaller scale.

    One potential use of Watson and systems like it is as a massive collector of potentially useful data from reliable sources that can be used to improve clinical accuracy. The reliability of Watson and its peers could be improved with input of public health data, deidentified claims and enounters data from insurers, etc. There are clues to better health care decision-making in large numbers of disparate data sources. A way to aggregate them together for use by clinicians and others is the potential benefit.

    I would hope insights into better diagnosis and treatment made possible by Watson are shared across health care.

  4. Here we go again with misunderstanding of Watson. We can argue the merits of “meaningful use” vs. other approaches to electronic documentation of patients. But that is a topic for a different thread. The point here is that you, like many others, are conflating what Watson can and cannot do. Watson is a question-answering system, and not a system that puts data into the right box or template. This is the problem that motivated this blog post, which is that people are way overstating what Watson is designed to do.

  5. Really the patient should be entering the data into the computer. They have all kinds of time, since each ED encounter today is 3-10 times as long as 10 years ago.

    Watson could put it in the right box or template or whatever.

    EHR turns docs into federal work flow minions. Instead of seeing patients at normal speed we get to work at government speed.

    “Good enough for government work” should scare any reasonable person who ever expects to be apatient.

    EHR is in fact dangerous. It diverts physician time from timely care of more patients to the tedium of being a transcriptionist. Also the records are meaningless with all sorts of filler that is questonable at best.

  6. I am a fan of natural language processing too, but I do not see it completely replacing structured data for a long, long time. There is value to structured data, and as long as NLP is imperfect, we will still need that value.

  7. Then maybe we shouldn’t be talking about Watson going to medical school, but should talking about Watson going to “Search Engine School”!

    I will tell you that I find using an online search engine useful when I want to look up more information about a particular disease. And I could see how a good medical search engine could help form a differential diagnoses based upon a complex of signs and symptoms.

  8. Thank you for the kind reference, Dr. Hersh. The point I was trying to make is that the Watson software, other than being an IR system, is also incorporating extensive natural language processing tools in its adaptive search algorithms, and as such, it will eventually obviate the need for severely restricting the ways physicians (and others) interact with medical computing devices.
    In my opinion, this is a good thing., and I suspect that in a few short years, we will be looking back with much disbelief at the shortsighted and very expensive obsession with “structured” or “computable” data requirements imposed by Meaningful Use and its myriad of checkboxes.
    I for one am not expecting Watson to solve mysterious cases, as much as I expect it to parse through mountains of content of all types, including medical records, and sort out the pieces that are pertinent and relevant at a particular moment in time, for a particular user, trying to accomplish a particular task. A digital cognitive assistant of sorts, and nothing more.

  9. I don’t think that a comparison of Watson versus the ER doc is the appropriate study. To me, a better study would be the ER doc with versus without Watson. I don’t think anyone is advocating that computers replace physicians, but instead we need to determine how to best augment the work of the healthcare professional.

  10. Whatsen Williams, there is plenty of evidence to support the use of IT in healthcare [e.g., the systematic review of studies prior to 2011, Buntin, MB, Burke, MF, et al. (2011). The benefits of health information technology: a review of the recent literature shows predominantly positive results. Health Affairs. 30: 464-471, as well as newer studies, such as Kern, LM, Barrón, Y, et al. (2012). Electronic health records and ambulatory quality of care. Journal of General Internal Medicine: Epub ahead of print.] But you are correct that there are well-documented instances of harm as well, and we still have a lot more to learn about the best practices for use of health IT.

    However, that is somewhat tangential to the discussion of the evaluation of Watson. We still need appropriate evaluations of systems like Watson as well.

  11. I would love to see Watson go “head to head” with a real doctor and see how he does. (And I know this would be a hard experiment to do in real life)

    Here is the scenario:

    100 patients through the ER – 50 assigned to Watson, 50 to an ER doc

    Watson and his “scribe” get to ask any questions they want, order any labs they want and use the physical exam of the ER doc. Scribe is not allowed to help Watson, other than to translate Watson’s questions and input answers from patient.

    ER doc does what ER doc normally does.

    The 50 patients are compared on a variety of criteria including:

    1) Accuracy of diagnosis
    2) Timeliness of diagnosis
    3) Cost of tests used to arrive at diagnosis
    4) Patient satisfaction

    I predict that a human ER doc will outperform Watson.

  12. @bubba PROVE IT. That argument is vapid. EHR, CPOE, and CDS are medical devices. They have zero approval by the FDA, and there is not any evidence to support safety and efficacy. No one see,s to wan to break the HIT evangelism movement.

  13. Maybe so Whatsen, maybe somebody should do that study. BUT ask yourself – How many people died have died because their paper charts were not up to date? Because their doctors handwriting was illegible? Because GASP somebody lost a piece of paper? Because one piece of paper – the important one with the biopsy result – didn’t catch up with the other pieces of paper. Because a doofus spilled their Mr. Pibb on the patient’s family history.

  14. If you r that interested in evidence based medicine, what is the evidence that EHR devices, with their CPOE and CDS attachments, improve outcomes or reduce costs?

    What is the incidence of death in the week after EHR crashes that cause care to be delayed for hours, or adverse eventsand near misses caused by EHR devices?