By BRYAN CARMODY, MD
One of the most fun things about the United States Medical Licensing Examination (USMLE) pass/fail debate is that it’s accessible to everyone. Some controversies in medicine are discussed only by the initiated few – but if we’re talking USMLE, everyone can participate.
Simultaneously, one of the most frustrating things about the USMLE pass/fail debate is that everyone’s an expert. See, everyone in medicine has experience with the exam, and on the basis of that, we all think that we know everything there is to know about it.
Unfortunately, there’s a lot of misinformation out there – especially when we’re talking about Step 1 score interpretation. In fact, some of the loudest voices in this debate are the most likely to repeat misconceptions and outright untruths.
Hey, I’m not pointing fingers. Six months ago, I thought I knew all that I needed to know about the USMLE, too – just because I’d taken the exams in the past.
But I’ve learned a lot about the USMLE since then, and in the interest of helping you interpret Step 1 scores in an evidence-based manner, I’d like to share some of that with you here.
If you think I’m just going to freely give up this information, you’re sorely mistaken. Just as I’ve done in the past, I’m going to make you work for it, one USMLE-style multiple choice question at a time._
A 25 year old medical student takes USMLE Step 1. She scores a 240, and fears that this score will be insufficient to match at her preferred residency program. Because examinees who pass the test are not allowed to retake the examination, she constructs a time machine; travels back in time; and retakes Step 1 without any additional study or preparation.
Which of the following represents the 95% confidence interval for the examinee’s repeat score, assuming the repeat test has different questions but covers similar content?
The correct answer is D, 228-252.
No estimate is perfectly precise. But that’s what the USMLE (or any other test) gives us: a point estimate of the test-taker’s true knowledge.
So how precise is that estimate? That is, if we let an examinee take the test over and over, how closely would the scores cluster?
To answer that question, we need to know the standard error of measurement (SEM) for the test.
The SEM is a function of both the standard deviation and reliability of the test, and represents how much an individual examinee’s observed score might vary if he or she took the test repeatedly using different questions covering similar material.
So what’s the SEM for Step 1? According to the USMLE’s Score Interpretation Guidelines, the SEM for the USMLE is 6 points.
Around 68% of scores will fall +/- 1 SEM, and around 95% of scores fall within +/- 2 SEM. Thus, if we accept the student’s original Step 1 score as our best estimate of her true knowledge, then we’d expect a repeat score to fall between 234 and 246 around two-thirds of the time. And 95% of the time, her score would fall between 228 and 252.
Think about that range for a moment.
The +/- 1 SEM range is 12 points; the +/- 2 SEM range is 24 points. Even if you believe that Step 1 tests meaningful information that is necessary for successful participation in a selective residency program, how many people are getting screened out of those programs by random chance alone?
(To their credit, the NBME began reporting a confidence interval to examinees with the 2019 update to the USMLE score report.)
Learning Objective: Step 1 scores are not perfectly precise measures of knowledge – and that imprecision should be considered when interpreting their values.
A 46 year old program director seeks to recruit only residents of the highest caliber for a selective residency training program. To accomplish this, he reviews the USMLE Step 1 scores of three pairs of applicants, shown below.
- 230 vs. 235
- 232 vs. 242
- 234 vs. 249
For how many of these candidate pairs can the program director conclude that there is a statistical difference in knowledge between the applicants?
A) Pairs 1, 2, and 3
B) Pairs 2 and 3
C) Pair 3 only
D) None of the above
The correct answer is D, none of the above.
As we learned in Question 1, Step 1 scores are not perfectly precise. In a mathematical sense, an individual’s Step 1 score on a given day represents just one sampling from the distribution centered around their true mean score (if the test were taken repeatedly).
So how far apart do two individual samples have to be for us to confidently conclude that they came from distributions with different means? In other words, how far apart do two candidates’ Step 1 scores have to be for us to know that there is really a significant difference between the knowledge of each?
We can answer this by using the standard error of difference (SED). When the two samples are >/= 2 SED apart, then we can be confident that there is a statistical difference between those samples.
So what’s the SED for Step 1? Again, according to the USMLE’s statisticians, it’s 8 points.
That means that, for us to have 95% confidence that two candidates really have a difference in knowledge, their Step 1 scores must be 16 or more points apart.
Now, is that how you hear people talking about Step 1 scores in real life? I don’t think so. I frequently hear people discussing how a 5-10 point difference in scores is a major difference that totally determines success or failure within a program or specialty.
And you know what? Mathematics aside, they’re not wrong. Because when programs use rigid cutoffs for screening, only the point estimate matters – not the confidence interval. If your dream program has a cutoff score of 235, and you show up with a 220 or a 225, your score might not be statistically different – but your dream is over.
Learning Objective: To confidently conclude that two students’ Step 1 scores really reflect a difference in knowledge, they must be >/= 16 points apart.
A physician took USMLE Step 1 in 1994, and passed with a score of 225. Now he serves as program director for a selective residency program, where he routinely screens out applicants with scores lower than 230. When asked about his own Step 1 score, he explains that today’s USMLE are “inflated” from those 25 years ago, and if he took the test today, his score would be much higher.
Assuming that neither the test’s content nor the physician’s knowledge had changed since 1994, which of the following is the most likely score the physician would attain if he took Step 1 in 2019?
The correct answer is B, 225.
I hear this kind of claim all the time on Twitter. So once and for all, let’s separate fact from fiction.
FACT: Step 1 scores for U.S. medical students score are rising.
See the graphic below.
FICTION: The rise in scores reflects a change in the test or the way it’s scored.
See, the USMLE has never undergone a “recentering” like the old SAT did. Students score higher on Step 1 today than they did 25 years ago because students today answer more questions correctly than those 25 years ago.
Why? Because Step 1 scores matter more now than they used to. Accordingly, students spend more time in dedicated test prep (using more efficient studying resources) than they did back in the day. The net result? The bell curve of Step 1 curves shifts a little farther to the right each year.
Just how far the distribution has already shifted is impressive.
When the USMLE began in the early 1990s, a score of 200 was a perfectly respectable score. Matter of fact, it put you exactly at the mean for U.S. medical students.
Know what a score of 200 gets you today?
A score in the 9th percentile, and screened out of almost any residency program that uses cut scores. (And nearly two-thirds of all programs do.)
So the program director in the vignette above did pretty well for himself by scoring a 225 twenty-five years ago. A score that high (1.25 standard deviations above the mean) would have placed him around the 90th percentile for U.S. students. To hit the same percentile today, he’d need to drop a 255.
Now, can you make the argument that the type of student who scored in the 90th percentile in the past would score in the 90th percentile today? Sure. He might – but not without devoting a lot more time to test prep.
As I’ve discussed in the past, this is one of my biggest concerns with Step 1 Mania. Students are trapped in an arms race with no logical end, competing to distinguish themselves on the metric we’ve told them matters. They spend more and more time learning basic science that’s less and less clinically relevant, all at at the expense (if not outright exclusion) of material that might actually benefit them in their future careers.
(If you’re not concerned about the rising temperature in the Step 1 frog pot, just sit tight for a few years. The mean Step 1 score is rising at around 0.9 points per year. Just come on back in a while once things get hot enough for you.)
Learning Objective: Step 1 scores are rising – not because of a change in test scoring, but because of honest-to-God higher performance.
Two medical students take USMLE Step 1. One scores a 220 and is screened out of his preferred residency program. The other scores a 250 and is invited for an interview.
Which of the following represents the most likely absolute difference in correctly-answered test items for this pair of examinees?
The correct answer is B, 30.
How many questions do you have to answer correctly to pass USMLE Step 1? What percentage do you have to get right to score a 250, or a 270? We don’t know.
See, the NBME does not disclose how it arrives at a three digit score. And I don’t have any inside information on this subject. But we can use logic and common sense to shed some light on the general processes and data involved and arrive at a pretty good guess.
First, we need to briefly review how the minimum passing score for the USMLE is set, using a modified Angoff procedure.
The Angoff procedure involves presenting items on the test to subject matter experts (SMEs). The SMEs review each question item and predict what percentage of minimally competent examinees would answer the question correctly.
Here’s an example of what Angoff data look like (the slide is from a recent lecture).
As you can see, Judge A suspected that 59% of minimally competent candidates – the bare minimum we could tolerate being gainfully engaged in the practice of medicine – would answer Item 1 correctly. Judge B thought 52% of the same group would get it right, and so on.
Now, here’s the thing about the version of the Angoff procedure used to set the USMLE’s passing standard. Judges don’t just blurt out a guess off the top of their head and call it a day. They get to review data regarding real-life examinee performance, and are permitted to use that to adjust their initial probabilities.
Here’s an example of the performance data that USMLE subject matter experts receive. This graphic shows that test-takers who were in the bottom 10% of overall USMLE scores answered a particular item correctly 63% of the time.
(As a sidenote, when judges are shown data on actual examinee performance, their predictions shift toward the data they’ve been shown. In theory, that’s a good thing. But multiple studies – including one done by the NBME – show that judges change their original probabilities even when they’re given totally fictitious data on examinee performance.)
For the moment, let’s accept the modified Angoff procedure as being valid. Because if we do, it gives us the number we need to set the minimum passing score. All we have to do is calculate the mean of all the probabilities assigned for that group of items by the subject matter experts.
In the slide above, the mean probability that a minimally competent examinee would correctly answer these 10 items was 0.653 (red box). In other words, if you took this 10 question test, you’d need to score better than 65% (i.e., 7 items correct) to pass.
And if we wanted to assign scores to examinees who performed better than the passing standard, we could. But, we’ll only have 3 questions with which to do it, since we used 7 of the 10 questions to define the minimally competent candidate.
So how many questions do we have to assign scores to examinees who pass USMLE Step 1?
Well, Step 1 includes 7 sections with up to 40 questions in each. So there are a maximum of 280 questions on the exam.
However, around 10% of these are “experimental” items. These questions do not count toward the examinee’s score – they’re on the test to generate performance data (like Figure 1 above) to present in the future to subject matter experts. Once these items have been “Angoffed”, they will become scored items on future Step 1 tests, and a new wave of experimental items will be introduced.
If we take away the 10% of items that are experimental, then we have at most 252 questions to score.
How many of these questions must be answered correctly to pass? Here, we have to use common sense to make a ballpark estimate.
After all, a candidate with no medical knowledge who just guessed answers at random might get 25% of the questions correct. Intuitively, it seems like the lower bound of knowledge to be licensed as a physician has to be north of 50% of items, right?
At the same time, we know that the USMLE doesn’t include very many creampuff questions that everyone gets right. Those questions provide no discriminatory value. Actually, I’d wager that most Step 1 questions have performance data that looks very similar to Figure 1 above (which was taken from an NBME paper).
A question like the one shown – which 82% of examinees answered correctly – has a nice spread of performance across the deciles of exam performance, ranging from 63% among low performers to 95% of high performers. That’s a question with useful discrimination for an exam like the USMLE.
Still, anyone who’s taken Step 1 knows that some questions will be much harder, and that fewer than 82% of examinees will answer correctly. If we conservatively assume that there are only a few of these “hard questions” on the exam, then we might estimate that the average Step 1 taker is probably getting around ~75% of questions right. (It’s hard to make a convincing argument that the average examinee could possibly be scoring much higher. And in fact, one of few studies that mentions this issue actually reports that the mean item difficulty was 76%.)
The minimum passing standard has to be lower than the average performance – so let’s ballpark that to be around 65%. (Bear in mind, this is just an estimate – and I think, a reasonably conservative one. But you can run the calculations with lower or higher percentages if you want. The final numbers I show below won’t be that much different than yours unless you use numbers that are implausible.)
Everyone still with me? Great.
Now, if a minimally competent examinee has to answer 65% of questions right to pass, then we have only 35% the of the ~252 scorable questions available to assign scores among all of the examinees with more than minimal competence.
In other words, we’re left with somewhere ~85 questions to help us assign scores in the passing range.
The current minimum passing score for Step 1 is 194. And while the maximum score is 300 in theory, the real world distribution goes up to around 275.
Think about that. We have ~85 questions to determine scores over around an 81 point range. That’s approximately one point per question.
Folks, this is what drives #Step1Mania.
Note, however, that the majority of Step 1 scores for U.S./Canadian students fall across a thirty point range from 220 to 250.
That means that, despite the power we give to USMLE Step 1 in residency selection, the absolute performance for most applicants is similar. In terms of raw number of questions answered, most U.S. medical student differ by fewer than 30 correctly-answered multiple choice questions. That’s around 10% of a seven hour, 280 question test administered on a single day.
And what important topics might those 30 questions test? Well, I’ve discussed that in the past.
Learning Objective: In terms of raw performance, most U.S. medical students likely differ by 30 or fewer correctly-answered questions on USMLE Step 1 (~10% of a 280 question test).
A U.S. medical student takes USMLE Step 1. Her score is 191. Because the passing score is 194, she cannot seek licensure.
Which of the following reflects the probability that this examinee will pass the test if she takes it again?
The correct answer is C, 64%.
In 2016, 96% of first-time test takers from U.S. allopathic medical schools passed Step 1. For those who repeated the test, the pass rate was 64%. What that means is that >98% of U.S. allopathic medical students ultimately pass the exam.
I bring this up to highlight again how the Step 1 score is an estimate of knowledge at a specific point in time. And yet, we often treat Step 1 scores as if they are an immutable personality characteristic – a medical IQ, stamped on our foreheads for posterity.
But medical knowledge changes over time. I took Step 1 in 2005. If I took the test today, I would absolutely score lower than I did back then. I might even fail the test altogether.
But here’s the thing: which version of me would you want caring for your child? The 2005 version or the 2019 version?
The more I’ve thought about it, the stranger it seems that we even use this test for licensure (let alone residency selection). After all, if our goal is to evaluate competency for medical practice, shouldn’t a doctor in practice be able to pass the exam? I mean, if we gave a test of basketball competency to an NBA veteran, wouldn’t he do better than a player just starting his career? If we gave a test of musical competency to a concert pianist with a decade of professional experience, shouldn’t she score higher than a novice?
If we accept that the facts tested on Step 1 are essential for the safe and effective practice of medicine, is there really a practical difference between an examinee who doesn’t know these facts initially and one who knew them once but forgets them over time? If the exam truly tests competency, aren’t both of these examinees equally incompetent?
We have made the Step 1 score into the biggest false god in medical education.
By itself, Step 1 is neither good nor bad. It’s just a multiple choice test of medically-oriented basic science facts. It measures something – and if we appropriately interpret the measurement in context with the test’s content and limitations, it may provide some useful information, just like any other test might.
It’s our idolatry of the test that is harmful. We pretend that the test measures things that it doesn’t – because it makes life easier to do so. After all, it’s hard to thin a giant pile of residency applications with nuance and confidence intervals. An applicant with a 235 may be no better (or even, no different) than an applicant with a 230 – but by God, a 235 is higher.
It’s well beyond time to critically appraise this kind of idol worship. Whether you support a pass/fail Step 1 or not, let’s at least commit to sensible use of psychometric instruments.
Learning Objective: A Step 1 score is a measurement of knowledge at a specific point in time. But knowledge changes over time.
So how’d you do?
I realize that some readers may support a pass/fail Step 1, while others may want to maintain a scored test. So to be sure everyone receives results of this test in their preferred format, I made a score report for both groups.
Just like the real test, each question above is worth 1 point. And while some of you may say it’s non-evidence based, this is my test, and I say that one point differences in performance allow me to make broad and sweeping categorizations about you.
1 POINT – UNMATCHED
But thanks for playing. Good luck in the SOAP!
2 POINTS – ELIGIBLE FOR LICENSURE
Nice job. You’ve got what it takes to be licensed. (Or at least, you did on a particular day.)
3 POINTS – INTERVIEW OFFER!
Sure, the content of these questions may have essentially nothing to do with your chosen discipline, but your solid performance got your foot in the door. Good work.
4 POINTS – HUSAIN SATTAR, M.D.
You’re not just a high scorer – you’re a hero and a legend.
5 POINTS – NBME EXECUTIVE
Wow! You’re a USMLE expert. You should celebrate your outstanding performance with some $45 tequila shots while dancing at eye-level with the city skyline.
You regard USMLE Step 1 scores with a kind of magical thinking. They are not simply a one-time point estimate of basic science knowledge, or a tool that can somewhat usefully be applied to thin a pile of residency applications. Nay, they are a robust and reproducible glimpse into the very being of a physician, a perfectly predictive vocational aptitude test that is beyond reproach or criticism.
You realize that, whatever Step 1 measures, it is a rather imprecise in measuring that thing. You further appreciate that, when Step 1 scores are used for whatever purpose, there are certain practical and theoretical limitations on their utility. You understand – in real terms – what a Step 1 score really means.
(I only hope that the pass rate for this exam is as high as the real Step 1 pass rate.)
Dr. Carmody is a pediatric nephrologist and medical educator at Eastern Virginia Medical School. This post originally appeared on The Sheriff of Sodium here.