Mr. Smith’s pneumonia was clinically shy. He didn’t have a fever. His white blood cells hadn’t increased. The only sign of an infection, other than his cough, was that his lung wasn’t as dark as it should be on the radiograph. The radiologist, taught to see, noticed that the normally crisp border between the heart and the lung was blurred like ink smudged on blotting paper. Something that had colonized the lungs was stopping the x-rays.
Hundred and twenty-five years ago, Wilhelm Conrad Roentgen, a German physicist and the Rector at the University of Wurzburg, made an accidental discovery by seeing something he wasn’t watching. Roentgen was studying cathode rays – invisible forces created by electricity. Using a Crookes tube, a pear-shaped vacuum glass tube with a pair of electrodes, Roentgen would fire the cathode rays from one end by an electric jolt. At the other end, the rays would leave the tube through a small hole, and generate colorful light on striking fluorescent material placed near the tube.
By then photography and fluorescence had captured literary and scientific imagination. In Arthur Conan Doyle’s Hound of the Baskervilles, the fire-breathing dog’s jaw had been drenched in phosphorus by its owner. Electricity and magnetism were the new forces. Physicists were experimenting in the backwaters of the electromagnetic spectrum without knowing where they were.
On November 8th, 1895, when after supper Roentgen went to his laboratory for routine experiments, something else caught Roentgen’s eyes. Roentgen closed the curtains. He wanted his pupils maximally dilated to spot tiny flickers of light. When he turned the voltage on the Crookes tube, he noticed that a paper soaked in barium platinocyanide on a bench nine feet away flickered. Cathode rays traveled only a few centimeters. Also, he had covered the tube with heavy cardboard to stop light. Why then did the paper glow?
Anyone who has read my blog or tweets before has probably seen that I have issues with some of the common methods used to analyse the performance of medical machine learning models. In particular, the most commonly reported metrics we use (sensitivity, specificity, F1, accuracy and so on) all systematically underestimate human performance in head to head comparisons against AI models.
This makes AI look better than it is, and may be partially responsible for the “implementation gap” that everyone is so concerned about.
Disclaimer: not peer reviewed, content subject to change
A (con)vexing problem
When we compare machine learning models to humans, we have a bit of a problem. Which humans?
In medical tasks, we typically take the doctor who currently does the task (for example, a radiologist identifying cancer on a CT scan) as proxy for the standard of clinical practice. But doctors aren’t a monolithic group who all give the same answers. Inter-reader variability typically ranges from 15% to 50%, depending on the task. Thus, we usually take as many doctors as we can find and then try to summarise their performance (this is called a multi-reader multicase study, MRMC for short).
Since the metrics we care most about in medicine are sensitivity and specificity, many papers have reported the averages of these values. In fact, a recent systematic review showed that over 70% of medical AI studies that compared humans to AI models reported these values. This makes a lot of sense. We want to know how the average doctor performs at the task, so the average performance on these metrics should be great, right?
In the last post I wrote about the recent decision by CMS to reimburse a Viz.AI stroke detection model through Medicare/Medicaid. I briefly explained how this funding model will work, but it is so darn complicated that it deserves a much deeper look.
To get more info, I went to the primary source. Dr Chris Mansi, the co-founder and CEO of Viz.ai, was kind enough to talk to me about the CMS decision. He was also remarkably open and transparent about the process and the implications as they see them, which has helped me clear up a whole bunch of stuff in my mind. High fives all around!
So let’s dig in. This decision might form the basis of AI reimbursement in the future. It is a huge deal, and there are implications.
The first thing to understand is that Viz.ai charges a subscription to use their model. The cost is not what was included as “an example” in the CMS documents (25k/yr per hospital), and I have seen some discussion on Twitter that it is more than this per annum, but the actual cost is pretty irrelevant to this discussion.
For the purpose of this piece, I’ll pretend that the cost is the 25k/yr in the CMS document, just for simplicity. It is order-of-magnitude right, and that is what matters.
A subscription is not the only way that AI can be sold (I have seen other companies who charge per use as well) but it is a fairly common approach. Importantly though, it is unusual for a medical technology. Here is what CMS had to say:
Occasionally, you get handed a question you know little about, but it’s clear you need to know more. Like most of us these days, I was chatting with my colleagues about the novel coronavirus. It goes by several names: SARS-CoV-2, 2019-nCoV or COVID-19 but I’ll just call it COVID. Declared a pandemic on March 12, 2020 by the World Health Organization (WHO), COVID is diagnosed by laboratory test – PCR. The early PCR test used in Wuhan was apparently low sensitivity (30-60%), lengthy to run (days), and in short supply. As CT scanning was relatively available, it became an importantdiagnostic tool for suspected COVID cases in Wuhan.
The prospect of scanning thousands of contagious patients was daunting, with many radiologists arguing back and forth about its appropriateness. As the pandemic has evolved, we now have better and faster PCR tests and most radiologists do not believe that CT scanning has a role for diagnosis of COVID, but rather should be reserved for its complications. Part of the reason is the concern of transmission of COVID to other patients or healthcare workers via the radiology department.
But then someone asked: “After you have scanned a patient for COVID, how long will the room be down?” And nobody really could answer – I certainly couldn’t. A recent white paper put forth by radiology leaders suggested anywhere from 30 minutes to three hours. A general review of infection control information for the radiologist and radiologic technologist can be found in Radiographics.
So, let’s go down the rabbit hole of infection control in the radiology department. While I’m a radiologist, and will speak about radiology-specific concerns, the fundamental rationale behind it is applicable to other ancillary treatment rooms in the hospital or outpatient arena, provided the appropriate specifics about THAT environment is obtained from references held by the CDC.
I got asked the other day to comment for Wired on the role of AI in Covid-19 detection, in particular for use with CT scanning. Since I didn’t know exactly what resources they had on the ground in China, I could only make some generic vaguely negative statements. I thought it would be worthwhile to expand on those ideas here, so I am writing two blog posts on the topic, on CT scanning for Covid-19, and on using AI on those CT scans.
As background, the pro-AI argument goes like this:
CT screening detects 97% of Covid-19, viral PCR only detects 70%!
A radiologist takes 5-10 minutes to read a CT chest scan. AI can do it in a second or two.
If you use CT for screening, there will be so many studies that radiologists will be overwhelmed.
In this first post, I will explain why CT, with or without AI, is not worthwhile for Covid-19 screening and diagnosis, and why that 97% sensitivity report is unfounded and unbelievable.
Next post, I will address the use of AI for this task specifically.
By VASANTH VENUGOPAL MD and VIDUR MAHAJAN MBBS, MBA
What can Artificial
Intelligence (AI) do?
simply put, do two things – one, it can do what humans can do. These are tasks
like looking at CCTV cameras, detecting faces of people, or in this case, read
CT scans and identify ‘findings’ of pneumonia that radiologists can otherwise
also find – just that this happens automatically and fast. Two, AI can do
things that humans can’t do – like telling you the exact time it would take you
to go from point A to point B (i.e. Google maps), or like in this case,
diagnose COVID-19 pneumonia on a CT scan.
on CT scans?
an infection of the lungs, is a killer disease. According to WHO statistics from
2015, Community Acquired Pneumonia (CAP) is the deadliest communicable disease
and third leading cause of mortality worldwide leading to 3.2 million deaths
be classified in many ways, including the type of infectious agent (etiology),
source of infection and pattern of lung involvement. From an etiological classification
perspective, the most common causative agents of pneumonia are bacteria
(typical like Pneumococcus, H.Influenza and atypical like Legionella,
Mycoplasma), viral (Influenza, Respiratory Syncytial Virus, Parainfluenza, and
adenoviruses) and fungi (Histoplasma & Pneumocystis Carinii).
AI in radiology is not new. In fact, the field is swarming with various apps and tools seeking to find a place in the radiologist’s toolkit to get more value out of medical imaging and improve patient care. So, how does a radiology team pick which tools to invest in? Enter Blackford Analysis, a health tech startup that has, simply put, designed an “app store” for radiology departments that liberates access to life-saving tech for radiologists. CEO Ben Panter explains how the platform not only gives radiologists access to a curated group of best-in-class AI radiology tools, but does so en-mass to circumvent the need for one-off approvals from hospital administrators and procurement teams.
Filmed at Bayer G4A Signing Day in Berlin, Germany, October 2019.
One big theme in AI research has been the idea of interpretability. How should AI systems explain their decisions to engender trust in their human users? Can we trust a decision if we don’t understand the factors that informed it?
I’ll have a lot more to say on the latter question some other time, which is philosophical rather than technical in nature, but today I wanted to share some of our research into the first question. Can our models explain their decisions in a way that can convince humans to trust them?
I am a radiologist, which makes me something of an expert in the field of human image analysis. We are often asked to explain our assessment of an image, to our colleagues or other doctors or patients. In general, there are two things we express.
What part of the image we are looking at.
What specific features we are seeing in the image.
This is partially what a radiology report is. We describe a feature, give a location, and then synthesise a conclusion. For example:
There is an irregular mass with microcalcification in the upper outer quadrant of the breast. Findings are consistent with malignancy.
You don’t need to understand the words I used here, but the point is that the features (irregular mass, microcalcification) are consistent with the diagnosis (breast cancer, malignancy). A doctor reading this report already sees internal consistency, and that reassures them that the report isn’t wrong. An common example of a wrong report could be:
AI in medical imaging entered the consciousness of radiologists just a few years ago, notably peaking in 2016 when Geoffrey Hinton declared radiologists’ time was up, swiftly followed by the first AI startups booking exhibiting booths at RSNA. Three years on, the sheer number and scale of AI-focussed offerings has gathered significant pace, so much so that this year a decision was made by the RSNA organising committee to move the ever-growing AI showcase to a new space located in the lower level of the North Hall. In some ways it made sense to offer a larger, dedicated show hall to this expanding field, and in others, not so much. With so many startups, wiggle room for booths was always going to be an issue, however integration of AI into the workflow was supposed to be a key theme this year, made distinctly futile by this purposeful and needless segregation.
By moving the location, the show hall for AI startups was made more difficult to find, with many vendors verbalising how their natural booth footfall was not as substantial as last year when AI was upstairs next to the big-boy OEM players. One witty critic quipped that the only way to find it was to ‘follow the smell of burning VC money, down to the basement’. Indeed, at a conference where the average step count for the week can easily hit 30 miles or over, adding in an extra few minutes walk may well have put some of the less fleet-of-foot off. Several startup CEOs told us that the clientele arriving at their booths were the dedicated few, firming up existing deals, rather than new potential customers seeking a glimpse of a utopian future. At a time when startups are desperate for traction, this could have a disastrous knock-on effect on this as-yet nascent industry.
It wasn’t just the added distance that caused concern, however. By placing the entire startup ecosystem in an underground bunker there was an overwhelming feeling that the RSNA conference had somehow buried the AI startups alive in an open grave. There were certainly a couple of tombstones on the show floor — wide open gaps where larger booths should have been, scaled back by companies double-checking their diminishing VC-funded runway. Zombie copycat booths from South Korea and China had also appeared, and to top it off, the very first booth you came across was none other than Deep Radiology, a company so ineptly marketed and indescribably mysterious, that entering the show hall felt like you’d entered some sort of twilight zone for AI, rather than the sparky, buzzing and upbeat showcase it was last year. It should now be clear to everyone who attended that Gartner’s hype curve has well and truly been swung, and we are swiftly heading into deep disillusionment.
Super-resolution* promises to be one of the most impactful medical imaging AI technologies, but only if it is safe.
Last week we saw the FDA approve the first MRI super-resolution product, from the same company that received approval for a similar PET product last year. This news seems as good a reason as any to talk about the safety concerns myself and many other people have with these systems.
Disclaimer: the majority of this piece is about medical super-resolution in general, and not about the SubtleMR system itself. That specific system is addressed directly near the end.
Super-resolution is, quite literally, the “zoom and enhance” CSI meme in the gif at the top of this piece. You give the computer a low quality image and it turns it into a high resolution one. Pretty cool stuff, especially because it actually kind of works.
In medical imaging though, it’s better than cool. You ever wonder why an MRI costs so much and can have long wait times? Well, it is because you can only do one scan every 20-30 minutes (with some scans taking an hour or more). The capital and running costs are only spread across one to two dozen patients per day.
So what if you could get an MRI of the same quality in 5 minutes? Maybe two to five times more scans (the “getting patient ready for the scan” time becomes the bottleneck), meaning less cost and more throughput.