I got asked the other day to comment for Wired on the role of AI in Covid-19 detection, in particular for use with CT scanning. Since I didn’t know exactly what resources they had on the ground in China, I could only make some generic vaguely negative statements. I thought it would be worthwhile to expand on those ideas here, so I am writing two blog posts on the topic, on CT scanning for Covid-19, and on using AI on those CT scans.
As background, the pro-AI argument goes like this:
CT screening detects 97% of Covid-19, viral PCR only detects 70%!
A radiologist takes 5-10 minutes to read a CT chest scan. AI can do it in a second or two.
If you use CT for screening, there will be so many studies that radiologists will be overwhelmed.
In this first post, I will explain why CT, with or without AI, is not worthwhile for Covid-19 screening and diagnosis, and why that 97% sensitivity report is unfounded and unbelievable.
Next post, I will address the use of AI for this task specifically.
One big theme in AI research has been the idea of interpretability. How should AI systems explain their decisions to engender trust in their human users? Can we trust a decision if we don’t understand the factors that informed it?
I’ll have a lot more to say on the latter question some other time, which is philosophical rather than technical in nature, but today I wanted to share some of our research into the first question. Can our models explain their decisions in a way that can convince humans to trust them?
I am a radiologist, which makes me something of an expert in the field of human image analysis. We are often asked to explain our assessment of an image, to our colleagues or other doctors or patients. In general, there are two things we express.
What part of the image we are looking at.
What specific features we are seeing in the image.
This is partially what a radiology report is. We describe a feature, give a location, and then synthesise a conclusion. For example:
There is an irregular mass with microcalcification in the upper outer quadrant of the breast. Findings are consistent with malignancy.
You don’t need to understand the words I used here, but the point is that the features (irregular mass, microcalcification) are consistent with the diagnosis (breast cancer, malignancy). A doctor reading this report already sees internal consistency, and that reassures them that the report isn’t wrong. An common example of a wrong report could be:
Super-resolution* promises to be one of the most impactful medical imaging AI technologies, but only if it is safe.
Last week we saw the FDA approve the first MRI super-resolution product, from the same company that received approval for a similar PET product last year. This news seems as good a reason as any to talk about the safety concerns myself and many other people have with these systems.
Disclaimer: the majority of this piece is about medical super-resolution in general, and not about the SubtleMR system itself. That specific system is addressed directly near the end.
Super-resolution is, quite literally, the “zoom and enhance” CSI meme in the gif at the top of this piece. You give the computer a low quality image and it turns it into a high resolution one. Pretty cool stuff, especially because it actually kind of works.
In medical imaging though, it’s better than cool. You ever wonder why an MRI costs so much and can have long wait times? Well, it is because you can only do one scan every 20-30 minutes (with some scans taking an hour or more). The capital and running costs are only spread across one to two dozen patients per day.
So what if you could get an MRI of the same quality in 5 minutes? Maybe two to five times more scans (the “getting patient ready for the scan” time becomes the bottleneck), meaning less cost and more throughput.
Medical AI testing is unsafe, and that isn’t likely to change anytime soon.
No regulator is seriously considering implementing “pharmaceutical style” clinical trials for AI prior to marketing approval, and evidence strongly suggests that pre-clinical testing of medical AI systems is not enough to ensure that they are safe to use. As discussed in a previous post, factors ranging from the laboratory effect to automation bias can contribute to substantial disconnects between pre-clinical performance of AI systems and downstream medical outcomes. As a result, we urgently need mechanisms to detect and mitigate the dangers that under-tested medical AI systems may pose in the clinic.
In a recent preprint co-authored with Jared Dunnmon from Chris Ré’s group at Stanford, we offer a new explanation for the discrepancy between pre-clinical testing and downstream outcomes: hidden stratification. Before explaining what this means, we want to set the scene by saying that this effect appears to be pervasive, underappreciated, and could lead to serious patient harm even in AI systems that have been approved by regulators.
But there is an upside here as well. Looking at the failures of pre-clinical testing through the lens of hidden stratification may offer us a way to make regulation more effective, without overturning the entire system and without dramatically increasing the compliance burden on developers.
A huge new CT brain dataset was released the other day, with the goal of training models to detect intracranial haemorrhage. So far, it looks pretty good, although I haven’t dug into it in detail yet (and the devil is often in the detail).
Of course, this lead to cynicism from the usual suspects as well.
And the conversation continued from there, with thoughts ranging from “but since there is a hold out test set, how can you overfit?” to “the proposed solutions are never intended to be applied directly” (the latter from a previous competition winner).
As the discussion progressed, I realised that while we “all know” that competition results are more than a bit dubious in a clinical sense, I’ve never really seen a compelling explanation for why this is so.
Hopefully that is what this post is, an explanation for why competitions are not really about building useful AI systems.
I’ve been talking in recent posts about how our typical methods of testing AI systems are inadequate and potentially unsafe. In particular, I’ve complainedthat all of the headline-grabbing papers so far only do controlled experiments, so we don’t how the AI systems will perform on real patients.
Today I am going to highlight a piece of work that has not received much attention, but actually went “all the way” and tested an AI system in clinical practice, assessing clinical outcomes. They did an actual clinical trial!
Big news … so why haven’t you heard about it?
The Great Wall of the West
Tragically, this paper has been mostly ignored. 89 tweets*, which when you compare it to many other papers with hundreds or thousands of tweets and news articles is pretty sad. There is an obvious reason why though; the article I will be talking about today comes from China (there are a few US co-authors too, not sure what the relative contributions were, but the study was performed in China).
China is interesting. They appear to be rapidly becoming the world leader in applied AI, including in medicine, but we rarely hear anything about what is happening there in the media. When I go to conferences and talk to people working in China, they always tell me about numerous companies applying mature AI products to patients, but in the media we mostly see headline grabbing news stories about Western research projects that are still years away from clinical practice.
This shouldn’t be unexpected. Western journalists have very little access to China**, and Chinese medical AI companies have no need to solicit Western media coverage. They already have access to a large market, expertise, data, funding, and strong support both from medical governance and from the government more broadly. They don’t need us. But for us in the West, this means that our view of medical AI is narrow, like a frog looking at the sky from the bottom of a well^.