Tech

Chapter Nine: In Which Dr. Watson Discovers Med School Is Slightly Tougher Than He Had Been Led to Believe

One of the computer applications that has received the most attention in healthcare is Watson, the IBM system that achieved fame by beating humans at the television game show, Jeopardy!. Sometimes it seems there is such hype around Watson that people do not realize what the system actually does. Watson is a type of computer application known as a “question-answering system.” It works similarly to a search engine, but instead of retrieving “documents” (e.g., articles, Web pages, images, etc.), it outputs “answers” (or at least short snippets of text that are likely to contain answers to questions posed to it).

As one who has done research in information retrieval (IR, also sometimes called “search”) for over two decades, I am interested in how Watson works and how well it performs on the tasks for which it is used. As someone also interested in IR applied to health and biomedicine, I am even more curious about its healthcare applications. Since winning at Jeopardy!, Watson has “graduated medical school” and “started its medical career”. The latter reference touts Watson as an alternative to the “meaningful use” program providing incentives for electronic health record (EHR) adoption, but I see Watson as a very different application, and one potentially benefitting from the growing quantity of clinical data, especially the standards-based data we will hopefully see in Stage 2 of the program. (I also have skepticism for some of these proposed uses of Watson, such as its “crunching” through EHR data to “learn” medicine. Those advocating Watson performing this task need to understand the limits to observational studies in medicine.)

One concern I have had about Watson is that the publicity around it has been mostly news articles and press releases. As an evidence-based informatician, I would like to see more scientific analysis, i.e., what does Watson do to improve healthcare and how successful is it at doing so? I was therefore pleased to come across a journal article evaluating Watson [1]. In this first evaluation in the medical domain, Watson was trained using several resources from internal medicine, such as ACP MedicinePIERMerck Manual, and MKSAP. Watson was applied, and further trained with 5000 questions, in Doctor’s Dilemma, a competition somewhat like Jeopardy! that is run by American College of Physicians and in which medical trainees participate each year. A sample question from the paper is, Familial adenomatous polyposis is caused by mutations of this gene, with the answer being, APC Gene. (Googling the text of the question gives the correct answer at the top of its ranking to this and the two other sample questions provided in the paper).

Watson was evaluated on an additional 188 unseen questions [1]. The primary outcome measure was recall (number of correct answers) at 10 results shown, and performance varied from 0.49 for the baseline system to 0.77 for the fully adapted and trained system. In other words, looking at the top ten answers for these 188 questions, 77% of those Watson provided were correct.

We can debate whether or not this is good performance for a computer system, or a computer system being touted to provide knowledge to expert users. But a more disappointing aspect of the study is its limitations that I would have brought up had I been asked to peer-review the paper.

The first question I had was, how does Watson’s performance compare with other systems, including IR systems such as Google or Pubmed? As noted above, for the three example questions provided in the paper, Google gave the answer in the snippet of text from the top-ranking Web page each time. It would be interesting to know how other online systems would compare with Watson’s performance on the questions used in this study.

Another problem with the paper is that none of the 188 questions were provided, not even as an appendix. In all of the evaluation studies I have performed (e.g., [2-4]), I have always provided some or all of the questions used in the study so the reader could better assess the results.

A final concern was that Watson was not evaluated in the context of a real user. While systems usually need to be evaluated from the “system perspective” before being assessed with users, it would have been informative to see whether Watson provided novel information or altered decision-making in real-world clinical scenarios.

Nonetheless, I am encouraged that a study like this was done, and I hope that more comprehensive studies will be undertaken in the near future. I do maintain enthusiasm for systems like Watson and am confident they will find a role in medicine. But we need to be careful about hype and we must employ robust evaluation methods to test our claims as well as determine how they are best used.

References

1. Ferrucci, D, Levas, A, et al. (2012). Watson: Beyond Jeopardy! Artificial Intelligence. Epub ahead of print.
2. Hersh, WR, Pentecost, J, et al. (1996). A task-oriented approach to information retrieval evaluation. Journal of the American Society for Information Science. 47: 50-56.
3. Hersh, W, Turpin, A, et al. (2001). Challenging conventional assumptions of automated information retrieval with real users:  Boolean searching and batch retrieval evaluations. Information Processing and Management. 37: 383-402.
4. Hersh, WR, Crabtree, MK, et al. (2002). Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions. Journal of the American Medical Informatics Association. 9: 283-293.

William Hersh, MD is Professor and Chair of the Department of Medical Informatics & Clinical Epidemiology at Oregon Health & Science University in Portland, OR. He is a well-known leader and innovator in biomedical and health informatics. In the last couple years, he has played a leadership role in the ONC Workforce Development Program. He was also the originator of the 10×10 (“ten by ten”) coursein partnership with AMIA. Dr Hersh maintains the Informatics Professor blog, where this piece originally appeared.

Livongo’s Post Ad Banner 728*90

14
Leave a Reply

7 Comment threads
7 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
7 Comment authors
JamesMD as HELLMargalit Gur-ArieWilliam Hersh, MDlegacyflyer Recent comment authors
newest oldest most voted
James
Guest
James

A difference between Watson and Google is that Watson assigns reliability weightings to the information it uses. Google can give the same weight to junk science or quackery as it does to actual science and may up with the worst possible scenario based upon the query made to i. Watson also is able to work with unstructured data, which is the state of about 80% of medical data. This gives it access to the treasure trove of narrative information that comprises much of medicine. The idea of a head to head contest is kind of meaningless, as what Watson does… Read more »

William Hersh, MD
Guest

Actually, Google’s PageRank algorithm is pretty effective at pushing high-quality output to the top of the list.

But you are missing my larger point, which is not that Watson cannot do any of the things you say it can do, but rather that there has yet been no objective, scientific assessment of these capabilities. I will be the strongest believer when it is shown conclusively, but until then, it all seems like mostly hype to me.

James
Guest
James

Google is a decent tool for medical-related queries, but it can also produce GI-GO. This has been a topic of public discussion and media attention: http://www.usatoday.com/story/news/nation/2013/01/14/googling-medical-advice/1833947/ http://www.kevinmd.com/blog/2013/02/dr-google-tips-patients-diagnose-online.html Google can also be subject to manipulation of its search findings, though they are continually trying to make it harder to do: http://en.wikipedia.org/wiki/Google_bomb I use Google a lot in my job, but recognize its limitations. I can concur there is a lot of hype on Watson. Whether it can live up to it is still unknown. However, since those using it did pay for the privilage, one should be able to gauge soon… Read more »

MD as HELL
Guest
MD as HELL

Really the patient should be entering the data into the computer. They have all kinds of time, since each ED encounter today is 3-10 times as long as 10 years ago. Watson could put it in the right box or template or whatever. EHR turns docs into federal work flow minions. Instead of seeing patients at normal speed we get to work at government speed. “Good enough for government work” should scare any reasonable person who ever expects to be apatient. EHR is in fact dangerous. It diverts physician time from timely care of more patients to the tedium of… Read more »

William Hersh, MD
Guest

Here we go again with misunderstanding of Watson. We can argue the merits of “meaningful use” vs. other approaches to electronic documentation of patients. But that is a topic for a different thread. The point here is that you, like many others, are conflating what Watson can and cannot do. Watson is a question-answering system, and not a system that puts data into the right box or template. This is the problem that motivated this blog post, which is that people are way overstating what Watson is designed to do.

Margalit Gur-Arie
Guest

Thank you for the kind reference, Dr. Hersh. The point I was trying to make is that the Watson software, other than being an IR system, is also incorporating extensive natural language processing tools in its adaptive search algorithms, and as such, it will eventually obviate the need for severely restricting the ways physicians (and others) interact with medical computing devices. In my opinion, this is a good thing., and I suspect that in a few short years, we will be looking back with much disbelief at the shortsighted and very expensive obsession with “structured” or “computable” data requirements imposed… Read more »

William Hersh, MD
Guest

I am a fan of natural language processing too, but I do not see it completely replacing structured data for a long, long time. There is value to structured data, and as long as NLP is imperfect, we will still need that value.

legacyflyer
Guest
legacyflyer

I would love to see Watson go “head to head” with a real doctor and see how he does. (And I know this would be a hard experiment to do in real life) Here is the scenario: 100 patients through the ER – 50 assigned to Watson, 50 to an ER doc Watson and his “scribe” get to ask any questions they want, order any labs they want and use the physical exam of the ER doc. Scribe is not allowed to help Watson, other than to translate Watson’s questions and input answers from patient. ER doc does what ER… Read more »

William Hersh, MD
Guest

I don’t think that a comparison of Watson versus the ER doc is the appropriate study. To me, a better study would be the ER doc with versus without Watson. I don’t think anyone is advocating that computers replace physicians, but instead we need to determine how to best augment the work of the healthcare professional.

legacyflyer
Guest
legacyflyer

Then maybe we shouldn’t be talking about Watson going to medical school, but should talking about Watson going to “Search Engine School”!

I will tell you that I find using an online search engine useful when I want to look up more information about a particular disease. And I could see how a good medical search engine could help form a differential diagnoses based upon a complex of signs and symptoms.

Whatsen Williams
Guest
Whatsen Williams

@bubba PROVE IT. That argument is vapid. EHR, CPOE, and CDS are medical devices. They have zero approval by the FDA, and there is not any evidence to support safety and efficacy. No one see,s to wan to break the HIT evangelism movement.

Bubba For President
Guest

Maybe so Whatsen, maybe somebody should do that study. BUT ask yourself – How many people died have died because their paper charts were not up to date? Because their doctors handwriting was illegible? Because GASP somebody lost a piece of paper? Because one piece of paper – the important one with the biopsy result – didn’t catch up with the other pieces of paper. Because a doofus spilled their Mr. Pibb on the patient’s family history.

Whatsen Williams
Guest
Whatsen Williams

If you r that interested in evidence based medicine, what is the evidence that EHR devices, with their CPOE and CDS attachments, improve outcomes or reduce costs?

What is the incidence of death in the week after EHR crashes that cause care to be delayed for hours, or adverse eventsand near misses caused by EHR devices?

William Hersh, MD
Guest

Whatsen Williams, there is plenty of evidence to support the use of IT in healthcare [e.g., the systematic review of studies prior to 2011, Buntin, MB, Burke, MF, et al. (2011). The benefits of health information technology: a review of the recent literature shows predominantly positive results. Health Affairs. 30: 464-471, as well as newer studies, such as Kern, LM, Barrón, Y, et al. (2012). Electronic health records and ambulatory quality of care. Journal of General Internal Medicine: Epub ahead of print.] But you are correct that there are well-documented instances of harm as well, and we still have a… Read more »