Separating the Art of Medicine From Artificial Intelligence

Artificial intelligence requires data. Ideally that data should be clean, trustworthy and above all, accurate. Unfortunately, medical data is far from it. In fact medical data is sometimes so far removed from being clean, it’s positively dirty.

Consider the simple chest X-ray, the good old-fashioned posterior-anterior radiograph of the thorax. One of the longest standing radiological techniques in the medical diagnostic armoury, performed across the world by the billions. So many in fact, that radiologists struggle to keep up with the sheer volume, and sometimes forget to read the odd 23,000 of them. Oops.

Surely, such a popular, tried and tested medical test should provide great data for training AI? There’s clearly more than enough data to have a decent attempt, and the technique is so well standardised and robust that surely it’s just crying out for automation?

A random anonymised chest X-ray taken from the NIH dataset. Take a look, and make a note of what you think you can see… there’s a test later.

Unfortunately, there is one small and inconvenient problem — humans.

Human radiologists are so bad interpreting chest X-rays and/or agreeing what findings they can see, that the ‘report’ that comes with the digital image is often either entirely wrong, partially wrong, or omits information. It’s not the humans’ fault… they are trying their best! When your job is to process thousands of black and white pixels into a few words of natural language text in approximately 30 seconds, it’s understandable that information gets lost and errors are made. Writing a radiology report is an extreme form of data compression — you are converting around 2 megabytes of data into a few bytes, in effect performing lossy compression with a huge compressive ratio. It’s like trying to stream a movie through a 16K modem by getting someone to tap out what’s happening in Morse code. Not to mention the subjectivity of it all.

Don’t believe me that radiologists are bad?

Let’s look at the literature…

Swingler et al showed that radiologists’ overall sensitivity was 67% and specificity 59% for finding lymphadenopathy on children’s x-rays clinically suspected for having tuberculosis. (That’s means they only found something about 2/3rds of the time even though they knew there was something wrong, but were only correctly finding a lymph node just over half the time.)

Taghizadieh et al showed that radiologists’ sensitivity was 67% and specificity 78% for finding a pleural effusion (fluid around the lung — solid white on an X-ray, you’d think quite hard to miss…).

Quekel et al found that lung cancer was missed in one-fifth of cases, even though in retrospect the lesions were entirely visible! In nearly half of these cases, cancers had again been missed at least twice on subsequent X-rays.

Thankfully, research does show that medical training makes one slightly better than the average student or lay person…

Satia et al showed that 35% of non-radiologist junior doctors were unable to differentiate between heart failure and pneumonia, 18% were unable to diagnose a normal CXR, 17% were unable to spot a 3 cm right apical mass and 55% were unable to recognise features of chronic emphysema. Senior clinicians performed better in all categories.

At first, this might seem quite alarming! You’d probably expect modern medicine to be a bit better than getting things sort of right up to 2/3rds of the time at best. Well, actually it’s worse than that…

Not only are radiologists really quite bad at writing accurate reports on chest X-rays, they also write entirely different reports to each other given the same chest X-rays. The inter-observer agreement is so low, it’s laughable — one study showed a kappa value of 0.2 (0 is awful, 1 is perfect). Another study just gave up and concluded that “in patients with pneumonia, the interpretation of the chest X-ray, especially the smallest of details, depends solely on the reader.” Subjectivity is as subjectivity does, I suppose.

A few days ago, I took to Twitter to conduct a simple (totally unscientific) experiment to prove this.

I asked radiologists to look at a chest X-ray (taken from the anonymised NIH dataset) and tweet their report in response. I gave a brief fabricated history that wasn’t specific for any particular disease (54 years old, non-smoker, two weeks shortness of breath, outpatient) so as not to bias them towards any findings.

So how did everyone do? Here’s a few sample replies:


Individually, people performed as expected. They made some correct and some probably incorrect observations, and some suggested further imaging with CT. But did people agree? No two suggestions were identical. Some were close, but no two reports mentioned exactly the same findings or came to the exact same conclusion. Reported findings ranged from infection, to adenopathy, to hypertension, to emphysema, to cancer to tuberculosis.

However, there was an overall trend that emerged. If you trawl through all the replies, there are certain findings that were picked out more often than others, and these included a left apical nodule, hyper-expanded lungs and an indistinguishable left heart border. I don’t know what the correct ‘read’ is for this chest X-ray (I even had a go myself, and wrote a different report to everyone else), but I would tend to agree with these three main findings. The labels from the NIH dataset were ‘nodule’ and ‘pneumonia’ as mined from the original report. Sadly, there is no follow up CT or further clinical information, so we shall never know the truth.

(Incidentally, the thread rather took a different turn, with medics from other professions joining in, offering rather humorous opinions of their own. I recommend you have a read if you want a laugh! And, yes, the radiologists did better as a group. Phew!)

What I find fascinating is how the reports could have changed by simply altering some of the surrounding metadata. If I had, for instance, given a history of smoking 40 cigarettes a day, would the reports have been far more concerned with emphysema and a lung cancer than the possible pneumonia? What if I had said the patient was 24 not 54? What if I had said they had alpha-1 anti-trypsin deficiency? What if this chest X-ray had come from sub-saharan Africa? Would tuberculosis then be the most common reported finding?

The interpretation of the image is subject to all sorts of external factors, including patient demographics, history and geography. The problem is even worse for more complex imaging modalities such as MRI or operator dependent modalities like ultrasound, where observer error is even higher.

Why does all this matter? So what if a chest X-ray report isn’t very accurate? The image is still there, so no data is really lost, is it?

The problem quickly becomes apparent when you start using the written report to train an AI to learn how to interpret the image. The machine learning team at Stanford have done exactly this, using 108,948 labelled chest X-rays freely available from the NIH. They proudly announced their results as outperforming a radiologist at finding pneumonia. Now, I’m all for cutting edge research, and I think it’s great that datasets like this are released to the public for exactly this reason…BUT we have to be extremely careful about how we interpret the results of any algorithms built on this data, because, as I have shown, the data is dirty. (I’m not the only one — please read Dr Luke Oakden-Rayner’s blog examining the dataset in detail.)

How is it possible to train an AI to be better than a human, if the data you give it is of the same low quality as produced by humans? I don’t think it is…

It boils down to a simple fact — chest X-ray reports were never intended to be used for the development of AI. They were only ever supposed to be an opinion, an interpretation, a creative educated guess. Reading a chest X-ray is more equivalent to an art than a science. A chest X-ray is neither the final diagnostic test nor the first, it is just one part of a suite of diagnostic steps in order to get to a clinical end-point. The chest X-ray itself is not a substitute for a ground truth. In fact, it’s only real purpose is to act as a form of ‘triage’ — with the universal clinical question being “is there something here that I need to worry about?”. That’s where the value in a chest X-ray lies — answering “should I worry?”, rather than “what is the diagnosis?”. Perhaps the researchers at Stanford have been trying to answer the wrong question…

If we are to develop an AI that can actually ‘read’ chest X-rays, then future research should be concentrated on three things:

  1. The surrounding metadata and finding a ground truth, rather than relying on a human-derived report that wasn’t produced with data-mining in mind. An ideal dataset would include all the patient’s details, epidemiology, history, blood tests, follow up CT results, biopsy results, genetics and more. Sadly, this level of verified anonymised data doesn’t exist, at least not in the format required for machine reading. Infrastructure should therefore be put into collating and verifying this metadata, at a bare minimum, preferably at scale.
  2. Meticulous labelling of the dataset. And I do mean absolutely painstakingly thoroughly annotating images using domain experts trained specifically to do so for the purposes of providing machine-learning ready data. Expert consensus opinion, alongside accurate metadata, will be demonstrably better than using random single-reader reports. Thankfully, this is what some of the more reputable AI companies are doing. Yes, it’s expensive and time consuming, but it’s a necessity if the end-goal is to be attained. This is what I have termed as the data-refinement process, specifically the level B to level A stage. Skip this, and you’ll never beat human performance.
  3. Standardising radiological language. Many of the replies I got to my simple Twitter experiment used differing language to describe roughly similar things. For instance ‘consolidation’ is largely interchangeable with ‘pneumonia’. Or is it? How do we define these terms, and when should one be used instead of the other? There is huge uncertainty in human language, and this extends to radiological language. (Radiologists are renowned in medical circles for their skill at practicing uncertainty, known as ‘the hedge’). Until this uncertainty is removed, and terminology agreed upon for every single possible use case, it is hard to see how we can progress towards a digital nirvana. Efforts are underway to introduce a standardised language (RadLex), however uptake by practicing radiologists has been slow and rather patchy. I don’t know what the answer is to this, but I know the problem is language!

Until we have done all of this, the only really useful value of AI in chest radiography is, at best, to provide triage support — tell us what is normal and what is not, and highlight where it could possibly be abnormal. Just don’t try and claim that AI can definitively tell us what the abnormality is, because it cant do so any more accurately than we can because the data is dirty because we made it thus.

For now, let’s leave the fuzzy thinking and creative interpretation up to us humans, separate the ‘art’ of medicine from ‘artificial intelligence’, and start focusing on producing oodles of clean data.

If you are as excited as I am about the future of AI in medical imaging, and want to discuss these ideas, please do get in touch. I’m on Twitter @drhughharvey

If you enjoyed this article, it would really help if you hit recommend and shared it.

Dr Harvey is a board certified radiologist and clinical academic, trained in the NHS and Europe’s leading cancer research institute, the ICR, where he was twice awarded Science Writer of the Year. He has worked at Babylon Health, heading up the regulatory affairs team, gaining world-first CE marking for an AI-supported triage service, and is now a consultant radiologist, Royal College of Radiologists informatics committee member, and advisor to AI start-up companies, including Kheiron Medical.

Categories: Uncategorized

Tagged as: , , ,

5 replies »

  1. I think it is possible that AI will be useful, despite its reliance on exactly the same defective data-set as is now used by humans.

    The reason is that the data can be manipulated in so many more ways with computing machinery than with our cerebral wet-ware.

    Say one has a ground glass infiltrate on x-ray. With human efforts, and in a typical scientifc paper, this can be compared with signs of HIV and PCP, hypersensitivity pneumonia, BNP and CHF, an early lepidic spread adenoca, a biopsy showing alveolar damage, an elevated temperature, and say 10 other guesses/entities off the pages of a typical pulmonary medicine textbook.

    Using computing approaches, this finding can be compared with, and a coorelation coefficient be found, between this finding and literally thousands of other bits of information in the patient’s record. E.g. is there some relationship between ground glass infiltrates and the presence of a left shift in the white count ? Or a K of 6.2 meq/l? Or a past history of psoriasis? Or irritable bowel syndrome? On and on.

    You get the point.

  2. There is no such thing as a “test”. “Tests” have many values. There are CXRs that are easy to read, there are some that are not. A PSA of 30 means something different than a 10. Test result values are many and make sense only in terms of their magnitude and hypothesis testing by knowing the patient. I manipulate radiology readings all the time to alter their sensitivity/specificity to my diagnosis search. AI may not help us if they are better at finding the useless abnormalities; my point; read your own tests, know the absolute differences from normal, know the percent chance of the normal/abnormal findings in each potential diagnosis and revise accordingly.

  3. Are lab tests (imaging included) supposed to exist to make a diagnosis or are they there to 1) validate or improve a presumptive diagnosis and 2) to act as a baseline?

  4. While I strongly agree with point #1, and #3 is a good thing to do regardless of whether we are building an AI or not, what reason is there to think #2 is necessary?

    Meticulous human description of the images seems beside the point to me. Let the machine do semi-supervised learning and figure out the best patterns to diagnose various conditions. It doesn’t matter if the machine does it differently than we do; in fact, maybe we can reverse-engineer the algorithm to improve human prediction. And of course, emphatically, this all only works if the machine is trained by comparing images to verified diagnoses determined not just through reading the image (point #1). Otherwise, you are training it to predict what radiologists say about images, not what they actually show.