“What is… Wegener’s Granulomatosis?”

A terrific article in The New York Times Magazine this summer described the decade-long effort on the part of IBM artificial intelligence researchers to build a computer that can beat humans in the game of “Jeopardy!” Since I’m not a computer scientist, their pursuit struck me at first as, well, trivial. But as I read the story, I came to understand that the advance may herald the birth of truly usable artificial intelligence for clinical decision-making.

And that is a big deal.

I’ve lamented, including in an article in this month’s Health Affairs, on the curious omission of diagnostic errors from the patient safety radar screen. Part of the problem is that diagnostic errors are awfully hard to fix. The best we’ve been able to do is improve information flow to try to prevent handoff errors, and teach ourselves to perform meta-cognition: that is, we can think about our own thinking, so that we are aware of common pitfalls and catch them before we pull our diagnostic trigger.

These solutions are fine, but they go only so far. In the age of Google, you’d think we’d be on the cusp of developing a computer that is a better diagnostician than the average doctor. Unfortunately, computer scientists have thought we were close to this same breakthrough for the past 40 years and both they and practicing clinicians have always come away disappointed. Before getting to the Jeopardy-playing computer, I’ll start by recounting the generally sad history of artificial intelligence (AI) in medicine, some of it drawn from our chapter on diagnostic errors in Internal Bleeding:

In 1957, AI pioneer Herbert Simon, assuming that chess mastery was a simple matter of computational muscle, predicted that a chess-playing computer would defeat a human grandmaster within a decade. Although machines might not “think” like humans, they could arrive at the same results by making billions of calculations in a few seconds.

Not quite. It was not until forty years later, 1997, that a supercomputer – IBM’s “Deep Blue,” a 1.4-ton behemoth capable of pondering 200 million chess moves each second – was able to defeat the Russian grandmaster Garry Kasparov. While this glorious victory did not translate into business success (it turns out that the skills needed to master the game of chess don’t easily translate into a marketable project for business decision-making), it was nonetheless a remarkable achievement.

How did the computer finally achieve its victory? It turned out that Deep Blue didn’t win just by “brute-forcing” a mind-numbing sequence of possible moves and countermoves, most of which would have been nonsensical. Instead, it was taught to analyze implications and possibilities, not just individual moves, more closely mirroring the way Kasparov and other masters actually played the game.

But if constructing a computer program to beat a chess grandmaster was challenging, developing a useful medical AI program was damn near impossible. After all, there are only 85 billion possible chess openings (and that’s just for the first four moves!); while the human body’s response to illness is virtually limitless, as are the illnesses themselves.

Undaunted, in the 1980s medical informaticians dove headlong into the quest for a “killer app” medical AI program. Going by names like DxPlain and Iliad, virtually all suffered from an inability to “roll with the punches” – to handle unexpected or extraneous data – like an expert. While they could create lists of possible diagnoses that included a few surprising and plausible choices, all of them also spewed out lots of unusable garbage. Moreover, the programs were clunky and expensive, and, because all clinical data were on paper charts, it took redundant work to enter the necessary information into the computer program to generate the output. By the early 1990s, the field of medical AI was moribund, the enthusiasm sapped.

There was likely another reason the programs never caught on: experts tend to be skeptical of computers that purport to be smarter than they are. Consider this tragic example from another industry. Moments before a planeload of Russian schoolchildren collided with a DHL cargo jet over Switzerland in 2002, the Russian pilot received conflicting orders from two sources: one human, the other a machine. The human was a befuddled Swiss air traffic controller whose backup collision alarm system was on the fritz and whose colleague was on a break. The machine was the computerized collision-avoidance system (CAS) aboard the doomed plane. When the human controller noticed an apparent collision course between the school kids and the cargo flight, he ordered the Russian airliner to “Dive!” The Russian’s on-board CAS, on the other hand, detecting an obstacle hurtling toward it, instructed the pilot (in that distinctive but less-than-confidence-inspiring computer voice), to “Pull up!” With only seconds to react, the pilot chose to obey the human voice, and the results were catastrophic – and heartbreaking. “Pilots tend to listen to the air traffic controller because they trust a human being and know that a person wants to keep them safe,” said an airport safety consultant soon after the crash.

Despite all of these obstacles and black eyes, I believe that medical AI is finally poised for a comeback. And that’s where IBM’s Jeopardy-playing computer fits in.

IBM’s goal this time is not to beat humans at chess, a tour de force but one without obvious business applicability, but rather to master the task of rapid, accurate question answering, a skill of great relevance to businesses ranging from law firms to help desks. When someone at IBM suggested using the game of Jeopardy as a high profile way to demonstrate the computer’s new talents, many were skeptical. Chess, after all, is logical and mathematical, whereas language is much more nuanced and complex… particularly the language of Jeopardy, with its puns, allusions, and wordplay. The engineer leading the IBM team, David Ferrucci, remembers being told “No, it’s too hard, forget it…” when he originally broached the idea.

The Times magazine piece illustrates the fundamental obstacle – Ferrucci calls it the “intended meaning” of language problem – and it took a new paradigm to allow Watson (the computer is named for IBM’s founder, Thomas J. Watson) to overcome it. Consider a typical Jeopardy question, “The name of this hat is elementary, my dear contestant.” The wordplay is obvious to most humans: “elementary, my dear Watson” immediately evokes thoughts of Sherlock Holmes, and every Holmes buff know that the detective wore a deerstalker hat.

But for a computer to figure this out, it has to first recognize the subtle allusion and translate it into a more linear question: “What sort of hat did Sherlock Holmes wear?” Early AI programs, even if they could overcome the wordplay issue (none could), often stalled out on the more straightforward trivia question. While a programmer could build a database including hundreds of Sherlock Holmes-related factoids, it was too labor intensive to try to do so around all possible topics (just consider also having to build one on Jerry Seinfeld’s girlfriends, cities in the Czech republic, and – more to our point – causes of hemoptysis). Ferrucci calls this the “boiling the ocean” problem, and, until recently, it was a deal breaker for most AI programs trying to confront huge swaths of information.

The breakthrough came when increasingly powerful computers began to process statistical correlations, learning that words like “Sherlock Holmes,” “opium,” and “deerstalker hat,” and, yes, “elementary, my Dear Watson” often keep each other’s company in the literature, and that these linked phrases specifically don’t include words like “Houston” or “sand trap.” The combination of increased computing power and speed, combined with the explosion of online sources of information (including rhyming dictionaries and thesauruses), allowed new programs to mine these correlations for answers to all kinds of questions.

Loaded with tens of millions of such documents in its prodigious memory (the computer is not connected to the Internet), Watson’s blistering computing speed also allowed it to simultaneously run more than one hundred different algorithms to try to answer a question. The results of these algorithms are back tested for plausibility (using a method similar to bootstrapping, for you statistical types). When Watson is playing Jeopardy, a plausibility threshold is set and, if one of the answers crosses that threshold, Watson rings in. Of course, the computer never forgets to phrase the answer in the form of a question.

Watson isn’t perfect. In a preliminary Jeopardy matchup last winter, the computer sometimes buzzed in too late, or misunderstood a category heading, or even gave a few absurd answers. Despite these shortcomings, Watson still managed to win two-thirds of his games against fairly good human contestants. A highly advertised matchup with a “Jeopardy champion” – the Times speculated it might be all-time champ Ken Jennings – is anticipated some time this fall (here’s IBM’s promotional video; it’s pretty cool).

IBM plans to sell customized versions of Watson to businesses within a few years, including in healthcare. “I want to create a medical version of this,” says John Kelly, head of IBM’s research labs. “A Watson, MD, if you will.” Constantly enriched with a steady stream of research papers and textbooks, Kelly hopes to overcome a fundamental problem for physicians: “the new procedures, the new medicines, the new capabilities are being generated faster than physicians can absorb on the front lines….” Although a medical version of Watson will need to run on a million dollar IBM server and the program itself might cost a few million more, the cost will probably come down over time.

Watson may be so “smart” because its algorithms mimic how the human brain functions – instantaneously sorting through thousands of possibilities, testing them against known patterns, ultimately settling on the most plausible matches. We physicians are trained to do these things, and then to go even further: to perform iterative hypothesis testing, developing a list of potential diagnoses that might fit a given set of facts (signs, symptoms, initial studies) and then a testing strategy designed to render some of the possibilities more likely and others less so. This is tricky stuff, particularly since each diagnostic test – whether another piece of history (“does the pain go to your back?”), a physical finding (is there a murmur?), a serum ANCA or a CT angiogram – has false negatives and positives, and needs to be interpreted in the light of prior probabilities, in keeping with the Theorem of Bayes. Ultimately, the expert clinician settles on a final answer, when the probability of one of the diagnoses crosses a magical threshold in which he or she determines – in a shorthand that masks its magnificent complexity – that we’ve “ruled in” a diagnosis.

With Watson-like programs, we may finally be on the cusp of having computer systems that will at least do the first step very well: taking an initial fact set and using it to answer a clinical question or create a differential diagnosis list. (There are early medical versions of this model; the best known is called Isabel, and some of its early results are relatively promising. But none have anywhere near Watson’s computerized firepower.) The other steps might prove to be easier – a “Watson MD” could surely “know” the test characteristics of the most common medical studies, and easily apply the Bayesian algorithm to these results.

Finally, the next generation of medical AI computers will ultimately “learn” from their experience. Once every patient’s data is stored in the computer and the final, correct answer is also captured by the system, the AI program need not rely only on textbook chapters and articles as its source of data. Instead, it could learn that patients like the one you are seeing ultimately turned out to have Wegener’s granulomatosis, even though they were frequently mistakenly diagnosed initially as having atypical pneumonia or sinusitis. And it could adjust its algorithm accordingly. This, of course, is analogous to Amazon.com’s magical feat of informing us that “customers like you bought X book.” Except it would be “patients like yours had Y disease.”

I’m not alone in thinking about Watson’s potential gifts as a diagnostician. I’ve corresponded with Stephen Baker, a technology reporter (The Numerati) who is writing a book about Watson, “Final Jeopardy,” that will be published next spring. He writes:

The exponential growth of information represents an enormous challenge for doctors. There are terabytes of data about diseases and symptoms, treatments and outcomes. This data is leading to an explosion of research papers. In 2008, there were 50,000 papers published on neuroscience alone, more than twice as many as in 2006. It’s impossible for one person, or even a team of people to keep on top of these learnings. Conceivably, a question-answering machine, like IBM’s Watson, could reading those thousands of papers, find trends and correlations, and answer questions about them. A tool like this, matching symptoms of patients with findings in the literature and records, could help doctors come up with diagnoses, and point to dangers and downfalls of their own suggestions. This machine, a bionic Dr. House, would by no means be infallible. Some of its suggestions would be silly, and it would be up to humans to vet its suggestions. But it could be a useful tool.

It will not be easy to translate Watson’s gifts into medical reality. But I am convinced that the same kind of thinking and technology that spawned Watson will ultimately help us make better diagnoses. Will we – particularly “cognitive specialists” like me – be put out to pasture? I think we’ll be OK for a while. For as remarkable as Watson is, and “Watson MD” might prove to be, there is no evidence, yet, that Watson is capable of judgment. Or empathy.

Robert Wachter, MD, is widely regarded as a leading figure in the modern patient safety movement. Together with
Dr. Lee Goldman, he coined the term “hospitalist” in an influential 1996 essay in The New England Journal of Medicine. His most recent book, Understanding Patient Safety, (McGraw-Hill, 2008) examines the factors that have contributed to what is often described as “an epidemic” facing American hospitals. His posts appear semi-regularly on THCB and on his own blog, Wachter’s World.