I got asked the other day to comment for Wired on the role of AI in Covid-19 detection, in particular for use with CT scanning. Since I didn’t know exactly what resources they had on the ground in China, I could only make some generic vaguely negative statements. I thought it would be worthwhile to expand on those ideas here, so I am writing two blog posts on the topic, on CT scanning for Covid-19, and on using AI on those CT scans.
As background, the pro-AI argument goes like this:
CT screening detects 97% of Covid-19, viral PCR only detects 70%!
A radiologist takes 5-10 minutes to read a CT chest scan. AI can do it in a second or two.
If you use CT for screening, there will be so many studies that radiologists will be overwhelmed.
In this first post, I will explain why CT, with or without AI, is not worthwhile for Covid-19 screening and diagnosis, and why that 97% sensitivity report is unfounded and unbelievable.
Next post, I will address the use of AI for this task specifically.
By VASANTH VENUGOPAL MD and VIDUR MAHAJAN MBBS, MBA
What can Artificial
Intelligence (AI) do?
simply put, do two things – one, it can do what humans can do. These are tasks
like looking at CCTV cameras, detecting faces of people, or in this case, read
CT scans and identify ‘findings’ of pneumonia that radiologists can otherwise
also find – just that this happens automatically and fast. Two, AI can do
things that humans can’t do – like telling you the exact time it would take you
to go from point A to point B (i.e. Google maps), or like in this case,
diagnose COVID-19 pneumonia on a CT scan.
on CT scans?
an infection of the lungs, is a killer disease. According to WHO statistics from
2015, Community Acquired Pneumonia (CAP) is the deadliest communicable disease
and third leading cause of mortality worldwide leading to 3.2 million deaths
be classified in many ways, including the type of infectious agent (etiology),
source of infection and pattern of lung involvement. From an etiological classification
perspective, the most common causative agents of pneumonia are bacteria
(typical like Pneumococcus, H.Influenza and atypical like Legionella,
Mycoplasma), viral (Influenza, Respiratory Syncytial Virus, Parainfluenza, and
adenoviruses) and fungi (Histoplasma & Pneumocystis Carinii).
This is the part two of a three-part series. Catch up on Part One here.
Preetham Srinivas, the head of the
chest radiograph project in Qure.ai, summoned Bhargava Reddy, Manoj Tadepalli, and
Tarun Raj to the meeting room.
“Get ready for an all-nighter, boys,”
Qure’s scientists began investigating
the algorithm’s mysteriously high performance on chest radiographs from a new
hospital. To recap, the algorithm had an area under the receiver operating
characteristic curve (AUC) of 1 – that’s 100 % on multiple-choice question
“Someone leaked the paper to AI,”
“It’s an engineering college joke,”
explained Bhargava. “It means that you saw the questions before the exam. It
happens sometimes in India when rich people buy the exam papers.”
Just because you know the questions
doesn’t mean you know the answers. And AI wasn’t rich enough to buy the AUC.
The four lads were school friends from
Andhra Pradesh. They had all studied computer science at the Indian Institute
of Technology (IIT), a freaky improbability given that only hundred out of a
million aspiring youths are selected to this most coveted discipline in India’s
most coveted institute. They had revised for exams together, pulling
all-nighters – in working together, they worked harder and made work more fun.
One big theme in AI research has been the idea of interpretability. How should AI systems explain their decisions to engender trust in their human users? Can we trust a decision if we don’t understand the factors that informed it?
I’ll have a lot more to say on the latter question some other time, which is philosophical rather than technical in nature, but today I wanted to share some of our research into the first question. Can our models explain their decisions in a way that can convince humans to trust them?
I am a radiologist, which makes me something of an expert in the field of human image analysis. We are often asked to explain our assessment of an image, to our colleagues or other doctors or patients. In general, there are two things we express.
What part of the image we are looking at.
What specific features we are seeing in the image.
This is partially what a radiology report is. We describe a feature, give a location, and then synthesise a conclusion. For example:
There is an irregular mass with microcalcification in the upper outer quadrant of the breast. Findings are consistent with malignancy.
You don’t need to understand the words I used here, but the point is that the features (irregular mass, microcalcification) are consistent with the diagnosis (breast cancer, malignancy). A doctor reading this report already sees internal consistency, and that reassures them that the report isn’t wrong. An common example of a wrong report could be:
AI in medical imaging entered the consciousness of radiologists just a few years ago, notably peaking in 2016 when Geoffrey Hinton declared radiologists’ time was up, swiftly followed by the first AI startups booking exhibiting booths at RSNA. Three years on, the sheer number and scale of AI-focussed offerings has gathered significant pace, so much so that this year a decision was made by the RSNA organising committee to move the ever-growing AI showcase to a new space located in the lower level of the North Hall. In some ways it made sense to offer a larger, dedicated show hall to this expanding field, and in others, not so much. With so many startups, wiggle room for booths was always going to be an issue, however integration of AI into the workflow was supposed to be a key theme this year, made distinctly futile by this purposeful and needless segregation.
By moving the location, the show hall for AI startups was made more difficult to find, with many vendors verbalising how their natural booth footfall was not as substantial as last year when AI was upstairs next to the big-boy OEM players. One witty critic quipped that the only way to find it was to ‘follow the smell of burning VC money, down to the basement’. Indeed, at a conference where the average step count for the week can easily hit 30 miles or over, adding in an extra few minutes walk may well have put some of the less fleet-of-foot off. Several startup CEOs told us that the clientele arriving at their booths were the dedicated few, firming up existing deals, rather than new potential customers seeking a glimpse of a utopian future. At a time when startups are desperate for traction, this could have a disastrous knock-on effect on this as-yet nascent industry.
It wasn’t just the added distance that caused concern, however. By placing the entire startup ecosystem in an underground bunker there was an overwhelming feeling that the RSNA conference had somehow buried the AI startups alive in an open grave. There were certainly a couple of tombstones on the show floor — wide open gaps where larger booths should have been, scaled back by companies double-checking their diminishing VC-funded runway. Zombie copycat booths from South Korea and China had also appeared, and to top it off, the very first booth you came across was none other than Deep Radiology, a company so ineptly marketed and indescribably mysterious, that entering the show hall felt like you’d entered some sort of twilight zone for AI, rather than the sparky, buzzing and upbeat showcase it was last year. It should now be clear to everyone who attended that Gartner’s hype curve has well and truly been swung, and we are swiftly heading into deep disillusionment.
No one knows who gave Rahul Roy
tuberculosis. Roy’s charmed life as a successful trader involved traveling in his
Mercedes C class between his apartment on the plush Nepean Sea Road in South
Mumbai and offices in Bombay Stock Exchange. He cared little for Mumbai’s weather.
He seldom rolled down his car windows – his ambient atmosphere, optimized for
his comfort, rarely changed.
Historically TB, or
“consumption” as it was known, was a Bohemian malady; the chronic suffering produced
a rhapsody which produced fine art. TB was fashionable in Victorian Britain, in
part, because consumption, like aristocracy, was thought to be hereditary. Even
after Robert Koch discovered that the cause of TB was a rod-shaped bacterium –
Mycobacterium Tuberculosis (MTB), TB had a special status denied to its immoral
peer, Syphilis, and unaesthetic cousin, leprosy.
TB became egalitarian in the early twentieth
century but retained an aristocratic noblesse oblige. George Orwell may have
contracted TB when he voluntarily lived with miners in crowded squalor to
understand poverty. Unlike Orwell, Roy had no pretentions of solidarity with
poor people. For Roy, there was nothing heroic about getting TB. He was
embarrassed not because of TB’s infectivity; TB sanitariums are a thing of the
past. TB signaled social class decline. He believed rickshawallahs, not
traders, got TB.
Super-resolution* promises to be one of the most impactful medical imaging AI technologies, but only if it is safe.
Last week we saw the FDA approve the first MRI super-resolution product, from the same company that received approval for a similar PET product last year. This news seems as good a reason as any to talk about the safety concerns myself and many other people have with these systems.
Disclaimer: the majority of this piece is about medical super-resolution in general, and not about the SubtleMR system itself. That specific system is addressed directly near the end.
Super-resolution is, quite literally, the “zoom and enhance” CSI meme in the gif at the top of this piece. You give the computer a low quality image and it turns it into a high resolution one. Pretty cool stuff, especially because it actually kind of works.
In medical imaging though, it’s better than cool. You ever wonder why an MRI costs so much and can have long wait times? Well, it is because you can only do one scan every 20-30 minutes (with some scans taking an hour or more). The capital and running costs are only spread across one to two dozen patients per day.
So what if you could get an MRI of the same quality in 5 minutes? Maybe two to five times more scans (the “getting patient ready for the scan” time becomes the bottleneck), meaning less cost and more throughput.
Medical AI testing is unsafe, and that isn’t likely to change anytime soon.
No regulator is seriously considering implementing “pharmaceutical style” clinical trials for AI prior to marketing approval, and evidence strongly suggests that pre-clinical testing of medical AI systems is not enough to ensure that they are safe to use. As discussed in a previous post, factors ranging from the laboratory effect to automation bias can contribute to substantial disconnects between pre-clinical performance of AI systems and downstream medical outcomes. As a result, we urgently need mechanisms to detect and mitigate the dangers that under-tested medical AI systems may pose in the clinic.
In a recent preprint co-authored with Jared Dunnmon from Chris Ré’s group at Stanford, we offer a new explanation for the discrepancy between pre-clinical testing and downstream outcomes: hidden stratification. Before explaining what this means, we want to set the scene by saying that this effect appears to be pervasive, underappreciated, and could lead to serious patient harm even in AI systems that have been approved by regulators.
But there is an upside here as well. Looking at the failures of pre-clinical testing through the lens of hidden stratification may offer us a way to make regulation more effective, without overturning the entire system and without dramatically increasing the compliance burden on developers.
Despite an area under the ROC curve of 1, Cassandra’s
prophesies were never believed. She neither hedged nor relied on retrospective
data – her predictions, such as the Trojan war, were prospectively validated. In
medicine, a new type of Cassandra has emerged –
one who speaks in probabilistic tongue, forked unevenly between the
probability of being right and the possibility of being wrong. One who, by conceding
that she may be categorically wrong, is technically never wrong. We call these
new Minervas “predictions.” The Owl of Minerva flies above its denominator.
Deep learning (DL) promises to transform the prediction
industry from a stepping stone for academic promotion and tenure to something
vaguely useful for clinicians at the patient’s bedside. Economists studying AI believe that AI is revolutionary,
revolutionary like the steam engine and the internet, because it better predicts.
Recently published in Nature, a sophisticated DL algorithm was able to predict acute kidney injury (AKI), continuously, in hospitalized patients by extracting data from their electronic health records (EHRs). The algorithm interrogated nearly million EHRS of patients in Veteran Affairs hospitals. As intriguing as their methodology is, it’s less interesting than their results. For every correct prediction of AKI, there were two false positives. The false alarms would have made Cassandra blush, but they’re not bad for prognostic medicine. The DL- generated ROC curve stands head and shoulders above the diagonal representing randomness.
The researchers used a technique called “ablation analysis.”
I have no idea how that works but it sounds clever. Let me make a humble
prophesy of my own – if unleashed at the bedside the AKI-specific, DL-augmented
Cassandra could unleash havoc of a scale one struggles to comprehend.
Leaving aside that the accuracy of algorithms trained
retrospectively falls in the real world – as doctors know, there’s a difference
between book knowledge and practical knowledge – the major problem is the
effect availability of information has on decision making. Prediction is
fundamentally information. Information changes us.
By ROBERT C. MILLER, JR. and MARIELLE S. GROSS, MD, MBE
This piece is part of the series “The Health Data Goldilocks Dilemma: Sharing? Privacy? Both?” which explores whether it’s possible to advance interoperability while maintaining privacy. Check out other pieces in the series here.
The problem with porridge
Today, we regularly hear stories of research teams using artificial intelligence to detect and diagnose diseases earlier with more accuracy and speed than a human would have ever dreamed of. Increasingly, we are called to contribute to these efforts by sharing our data with the teams crafting these algorithms, sometimes by healthcare organizations relying on altruistic motivations. A crop of startups have even appeared to let you monetize your data to that end. But given the sensitivity of your health data, you might be skeptical of this—doubly so when you take into account tech’s privacy track record. We have begun to recognize the flaws in our current privacy-protecting paradigm which relies on thin notions of “notice and consent” that inappropriately places the responsibility data stewardship on individuals who remain extremely limited in their ability to exercise meaningful control over their own data.
Emblematic of a broader trend, the “Health Data Goldilocks Dilemma” series calls attention to the tension and necessary tradeoffs between privacy and the goals of our modern healthcare technology systems. Not sharing our data at all would be “too cold,” but sharing freely would be “too hot.” We have been looking for policies “just right” to strike the balance between protecting individuals’ rights and interests while making it easier to learn from data to advance the rights and interests of society at large.
What if there was a way for you to allow others
to learn from your data without compromising your privacy?
To date, a major strategy for striking this balance has involved the practice of sharing and learning from deidentified data—by virtue of the belief that individuals’ only risks from sharing their data are a direct consequence of that data’s ability to identify them. However, artificial intelligence is rendering genuine deidentification obsolete, and we are increasingly recognizing a problematic lack of accountability to individuals whose deidentified data is being used for learning across various academic and commercial settings. In its present form, deidentification is little more than a sleight of hand to make us feel more comfortable about the unrestricted use of our data without truly protecting our interests. More of a wolf in sheep’s clothing, deidentification is not solving the Goldilocks dilemma.
Tech to the rescue!
Fortunately, there are a handful of exciting new technologies that may let us escape the Goldilocks Dilemma entirely by enabling us to gain the benefits of our collective data without giving up our privacy. This sounds too good to be true, so let me explain the three most revolutionary ones: zero knowledge proofs, federated learning, and blockchain technology.