By KIP SULLIVAN JD
Review of The Tyranny of Metrics by Jerry Z. Muller, Princeton University Press, 2018
In the introduction to The Tyranny of Metrics, Jerry Muller urges readers to type “metrics” into Google’s Ngram, a program that searches through books and other material published over the last five centuries. He tells us we will find that the use of “metrics” soared after approximately 1985. I followed his instructions and confirmed his conclusion (see graph below). We see the same pattern for two other buzzwords that activate Muller’s BS antennae – “benchmarks,” and “performance indicators.” [1]
Muller’s purpose in asking us to perform this little exercise is to set the stage for his sweeping review of the history of “metric fixation,” which he defines as an irresistible “aspiration to replace judgment based on personal experience with standardized measurement.” (p. 6) His book takes a long view – he takes us back to the turn of the last century – and a wide view – he examines the destructive impact of the measurement craze on the medical profession, schools and colleges, police departments, the armed forces, banks, businesses, charities, and foreign aid offices.
Foreign aid? Yes, even that profession. According to a long-time expert in that field, employees of government foreign aid agencies have “become infected with a very bad case of Obsessive Measurement Disorder, an intellectual dysfunction rooted in the notion that counting everything in government programs will produce better policy choices and improved management.” (p. 155)
Muller, a professor of history at the Catholic University of America in Washington, DC, makes it clear at the outset that measurement itself is not the problem. Measurement is helpful in developing hypotheses for further investigation, and it is essential in improving anything that is complex or requires discipline. The object of Muller’s criticism is the rampant use of crude measures of efficiency (cost and quality) to dish out rewards and punishment – bonuses and financial penalties, promotion or demotion, or greater or less market share. Measurement can be crude because it fails to adjust scores for factors outside the subject’s control, and because it measures only actions that are relatively easy to measure and ignores valuable but less visible behaviors (such as creative thinking and mentoring). The use of inaccurate measurement is not just a waste of money; it invites undesirable behavior in both the measurers and the “measurees.” The measurers receive misleading information and therefore make less effective decisions (for example, “body count” totals tell them the war in Viet Nam is going well), and the subjects of measurement game the measurements (teachers “teach to the test” and surgeons refuse to perform surgery on sicker patients who would have benefited from surgery).
What puzzles Muller, and what motivated him to write this book, is why faith in the inappropriate use of measurement persists in the face of overwhelming evidence that it doesn’t work and has toxic consequences to boot. This mulish persistence in promoting measurement that doesn’t work and often causes harm (including driving good teachers and doctors out of their professions) justifies Muller’s harsh characterization of measurement mavens with phrases like “obsession,” “fixation,” and “cult.” “[A]lthough there is a large body of scholarship in the fields of psychology and economics that call into question the premises and effectiveness of pay for measured performance, that literature seems to have done little to halt the spread of metric fixation,” he writes. “That is why I wrote this book.” (p. 13)
A short history of Obsession Measurement Disorder in medicine
I read Muller’s book because I share his astonishment at the persistence of the measurement craze in the face of so much evidence that it is not working. Over the three decades that I have studied health policy, I have become increasingly baffled by people who promote various iterations of managed care in the face of evidence that they don’t work. In search of an explanation, I have, as Muller has, read books and news stories about the misuse of measurement in other fields, particularly education and banking. I have been especially baffled by the managed care movement’s enthusiasm for measuring the cost and quality of all actors in the health care system, an enthusiasm that emerged in the late 1980s when it was obvious that the propagation of HMOs, the movement’s founding project, was failing to control inflation. [2]
By the 1990s the enthusiasm for documents that handed out grades to insurance companies and providers on “consumer satisfaction,” mortality rates, etc. had become an obsession. Proponents of “report cards,” as these documents were called, hoped that “consumers” would read them and reward the good actors with their business and punish the bad actors by leaving them. That, of course, did not happen.
Frustrated by consumer disinterest in report cards, managed care proponents, such as the Medicare Payment Advisory Commission (MedPAC) and the Institute of Medicine (IOM), declared in the early 2000s that it was time to punish doctors and hospitals directly by rewarding them if they got good grades on crude measurements and punishing them if they didn’t. The term they used to describe this direct method of punishment was “pay for performance,” a phrase borrowed from the business world. By about 2004, that phrase had become so common in the health policy literature it was shortened to “P4P.”
The complete absence of evidence that P4P would improve the quality of medical care didn’t matter to MedPAC and other P4P advocates. [3] As evidence has piled up over the last decade indicating P4P doesn’t reduce costs and has mixed effects on quality, P4P proponents, true to form, have ignored it. [4]
Taylorism: Ground zero of the epidemic
It is impossible to identify a single Typhoid Mary responsible for the metrics-fixation epidemic, but it is fair to say a very important Typhoid Mary was Frederick Winslow Taylor. Muller identifies the rise of “Taylorism” in manufacturing in the early 1900s as a primary cause of the epidemic. Taylor, an American engineer, studied every action of workers in pig iron factories, estimated the average time of each action, then proposed to pay slower workers less and faster workers more. According to Taylor, determining who was slow and who was fast and paying accordingly required “an elaborate system for monitoring and controlling the workplace,” as Muller puts it. (p. 32) Taylor called his measurement-and-control system “scientific management.”
“Scientific management” assumed that managers with clipboards could distill the wisdom of their work force into a set of rules (later called “best practices,” another buzzword catapulted to stardom in the 1990s) and enforce those rules with pay-for-performance. The outcome of “scientific management,” according to Taylor, was that “all of the planning which under the old system was done by the workmen, must of necessity under the new system be done by management in accordance with the law of science.” (Muller, pp. 32-33) Here we see the beginning of the double standard now prevalent in health policy: People who flog faith-based P4P schemes hold themselves out as the bearers of “scientific” values (“evidence-based medicine,” to use the lingo invented in the early 1990s), while doctors who criticize metrics madness are said to be stuck in a “paternalistic culture.” [5]
The obvious corollary to “scientific management” was that leaders of corporations didn’t need any hands-on experience or training in the production of whatever it was their corporation produced. If you had a degree from a business school that taught “scientific management,” it shouldn’t matter to Sunbeam, for example, that “Chainsaw” Al Dunlap had no knowledge of how appliances are made. As long as he knew “management,” he was qualified to be Sunbeam’s CEO. Decades after Taylorism arose, this same logic would justify allowing managers of insurance company executives, Fortune 500 companies, and government insurance programs who never went to medical school to measure and micromanage doctors.
By the 1950s, this notion that standardized data in the hands of managers trumped experience had become deeply embedded in American business culture. By the 1960s, reports Muller, it had spread to the US military (Robert MacNamara’s background in accounting got him a job running a car company, and from there he jumped to the Pentagon where he and his “whiz kids” told the generals to count enemy corpses). By the 1980s it had infected other government agencies and much of the non-profit world, and by the late 1990s it had infected the services sector, including medicine.
Measuring the doctor and patient from afar
“Nowhere are metrics in greater vogue than in the field of medicine,” writes Muller. (p 103) The following statement by report-card and P4P guru Michael Porter, which Muller took from an article Porter co-authored for the Harvard Business Review, is a good illustration of how P4P proponents think and talk.
Rapid improvement in any field requires measuring results – a familiar principle in management…. Indeed, rigorous measurement of value (outcomes and costs) is perhaps the single most important step in improving health care. Wherever we see systematic measurement of results in health care … we see those results improve. [p. 107]
From this excerpt plus other sections of the Harvard Business Review article, we learn that Porter is absolutely convinced it’s possible to measure “outcomes and costs” accurately, and then divide cost into quality to derive “value.”
Note first the voice-of-God tone. God doesn’t have to document anything, and neither does Porter; there are no footnotes in this lengthy essay. Note next the grand assumption that improvement is only possible if “results” are measured. How do we know this? We just do. It’s a “principle of management,” says Porter (no doubt going all the way back to Frederick Taylor). Third, note the misrepresentation of the evidence. It simply isn’t true that “wherever” managers conduct “systematic measurement” of “performance” by doctors and hospitals, costs go down and/or quality goes up.
Muller compares the groupthink represented by Porter with research on both report cards and P4P schemes. The small body of research on report cards finds they have no impact on “consumer” behavior or patient outcomes. The large body of research on P4P indicates it may be raising costs when the costs providers incur to improve “performance” is taken into account, and it has at best a mixed effect on measured quality.
Muller suggests that the net effect of P4P on the health of all patients, that is, those whose care is measured and those whose care is not measured, is negative. Sicker patients are the ones most at risk in a system where P4P is rampant. Because the measures of cost and quality upon which P4P schemes are based are so inaccurate (because scores cannot be adjusted with anything resembling accuracy to reflect factors outside provider control), it induces a variety of “gaming” strategies, the worst of which are avoiding sicker patients and shifting resources away from patients whose care is not measured to those whose care is measured (“treating to the test”).
To illustrate how P4P damages sicker patients, Muller devotes two pages to the damage done by Medicare’s Hospital Readmissions Reduction Program (HRRP). This program, which began in 2012, punishes hospitals that have an above-average rate of 30-day readmissions (admissions that occur within 30 days of a discharge from a hospital) for patients with a half-dozen diagnoses. Muller reports that the HRRP has clearly had two negative effects. First, it has incentivized hospitals to keep sick patients away for at least 30 days after discharge, and if that’s not possible, to let them in but to put them on “observation” status, which means they are not counted as “readmitted.” [6] Second, it has led to the punishment of hospitals that treat sicker and poorer patients.
When Muller publishes a second edition of this book, he’ll no doubt add a page describing research done since his book was published showing that the HRRP appears to be killing patients with congestive heart failure (CHF). CHF was one of the three diagnoses that has been measured by the HRRP since it began (readmissions for heart attack and pneumonia were the other two).
Reversing the epidemic
Muller ends his book with a series of recommendations. He suggests, for example, that measures be developed from the bottom up and that financial rewards and penalties should be kept low if they are to be used at all. He does not attempt to offer political solutions. For this I do not criticize him. His book, which must have required years of research, is a valuable contribution to the largely one-sided debate about P4P in medicine, a debate which has only recently become more audible.
Here are my two cents on the politics of this issue. Groups representing doctors and nurses must take the lead in rolling back measurement mania. Doctors and nurses have great credibility with the public, and they have to cope every day with the consequences of measurement mania. They should focus on rolling back the P4P schemes now inflicted on the fee-for-service Medicare program because Medicare is so influential (“reforms” inflicted by Congress on Medicare are typically mimicked by the insurance industry). Groups working to reduce the cost of health care or improve quality of care for patients should also join the fight. They too have an interest in undermining the tyranny of metrics.
Of course, it would be nice if those who make a living promoting the inappropriate use of measurement would practice what they preach and examine their own behavior to see how it could be improved. Here’s a question that people in that business might pose to themselves now and then: Would you like your work to be subjected to measurement of its cost and quality by third parties, and would you like those third parties to alter your income based on the grades they decide to give you?
Footnotes:
[1] Just to test NGram, I entered other terms. “Automobile,” for example, rises up from zero mentions just before 1900 to a peak during about 1938-1942, then declines rapidly so that the rate by 2000 (the last year on the graph) is equal to the rate of 1910. “Database,” on the other hand, stays at zero mentions until about 1970, then skyrockets in the late 1970s.
[2] Accurate measurement of the cost and quality of insurance companies and providers was an essential element of “managed competition,” a proposal introduced in 1989 by Alaine Enthoven and enthusiastically promoted by Paul Ellwood (the “father of the HMO”), insurance industry executives, Bill and Hillary Clinton, and the editors of the New York Times, to name just a few of Enthoven’s most influential disciples.
[3] A 2006 edition of Medical Care Research and Review devoted entirely to the emerging P4P fad stated, “P4P programs are being implemented in a near-scientific vacuum.”
[4] We are seeing rare exceptions to the P4P groupthink only in the last two or three years. In January 2018, MedPAC formally voted to reverse its decision to recommend P4P at the individual physician level. Donald Berwick, a leading proponent of measurement, announced in 2016 that it was time to reduce the reporting burden on doctors by 50 to 75 percent and to eliminate P4P at the individual level.
[5] The IOM, for example, has peddled measurement and control of providers for decades on the basis of no evidence, yet it maintains a “roundtable” of P4P disciples the IOM deems to be “science-driven.”
[6] “Observation stays” were designed for Medicare beneficiaries who were not clearly in need of inpatient care but who were not clearly ready to go home either. Such patients are typically placed on the same wards with admitted patients but not treated.
Kip Sullivan is a member of the Health Care for All MN advisory board, and of MN Physicians for a National Health Program.
Categories: Uncategorized
I find few compositions that capture so well what is happening in health care, particularly to what remains of primary care where most Americans are most behind. Stagnant to declining revenue, losses of what remains of the few lines of revenue, and increasing usual inflationary costs of delivery not covered are bad enough, but new and relatively higher micromanagement costs for these practices are the nail on the coffin.
Witness 2621 counties lowest in health care workforce forever that have only 20% of primary care spending (38 billion in 2008) that supports poorly 25% of the primary care workforce in this 40% of the population with 45% of population complexity. And this population has the worst social determinants, health literacy, social supports, workforce levels, access levels, employers, health plans, and outcomes. If the health care designers tried to make the design worse, it would be hard to do.
My hope is that some day the health care designers will be held to the same accountability standards as physicians or human subject researchers. But they experiment constantly on the most vulnerable populations without beneficent intent (usually cost cutting focus) without adequate testing before implementation and without informed consent (since they have little idea what will happen). Since it took 50 years from Nuremberg to develop human subject research protections, we can count on decades of continued punishment by design.
By the way, these 2621 counties most behind are on pace to become 50% of the US population. The workforce, spending, and access improvements are all about the top and higher concentration counties where health care spending increases the most.
If there was any real focus on access for those most behind, or a focus health equity concern, or an understanding of the non-clinical factors that drive outcomes – this micromanagement train would have been derailed years ago. Instead it somehow is managing to integrate the social determinant concept – which is the antithesis of micromanagement via practices and providers.
I cannot count all the times that I have shared this post in the last few weeks with many more to come.
The current issue of Annals of Family Medicine has a primary care research future agenda article. The concepts developed are included, but the innovation focus is entirely micromanagement enmeshed. We have come a long way from the counterculture developed and established by the founding fathers of family medicine. We fail to see that the new directions all act to disrupt the most important relationships that health care must be about.
Interesting post. We frequently see the use of inaccurate or misleading data combined with a sloppy attempt to understanding it’s meaning leading to bad decisions, no doubt, but data itself, when properly analyzed, can be a powerful tool for better management. I’d be interested to know what you thing about “World Class: A Story of Adversity, Transformation, and Success at NYU Langone Health” by William A. Haseltine.
Great post. Thank you.
“Groups representing doctors and nurses must take the lead in rolling back measurement mania.”
Sadly, those running the groups representing doctors, particularly in primary care, are the most fanatic and deaf to evidence when it comes to quality measures.
I’m not talking about the public representatives, but the behind-the-scenes permanent executive staff. They couldn’t deal with the stress and uncertainties of clinical medicine, and now, with their limited minds, are trying to turn medicine into a box-checking, black and white, no gray areas, assembly line function.
Bad people doing bad things.
Thank you for this. Not the first circumstance where turning a fundamental human endeavor into math has led to disastrous consequences. Math is useful, but it is not reality. Reality is reality. Math is only useful as a model when it accurately reflects reality. The metric crowd does not get that. They are baffled when improvements in their numbers do not lead to tangible outcomes, or worse, cause harm. The food industries use of the calorie is a good example. Essentially food was turned into math and the foolishness of “calories in, calories out” followed. Our nutrition thought leaders then used this highly flawed assumption to lead our country into the obesity and chronic disease epidemics that we still endure to this day, causing the premature death of millions of people and costing our healthcare system billions of dollars.
P4P falls into the same trap and is working it’s way towards causing equal amounts of harm. Trying to take intangibles like “patient satisfaction” and turning this into a measurable item that determines payment has been a similar nightmare. This has been a significant driver of the opioid crisis. There is an accumulating body of evidence that these surveys are riddled with bias…frequently of the racist or misogynistic variety. These tools are then used to determine physicians salaries. That makes them quite bluntly illegal. So P4P doesn’t just punish physicians for treating poorer and sicker patients, it punishes them for the color of their skin, their accent, their gender, or whatever biases the patient population can concoct.
So turning patient care into math does help many stakeholders by creating a fake world that can be manipulated for their ultimate purposes. This is why the HRRP can be successful, despite killing people. It’s why data obsession has not led to improved outcomes in most circumstances. Hospitals complain about things like MIPS/MACRA on the front end…but behind the scenes, easier to make yourself look good in the fake word of metrics, than in the real world of outcomes and actual patient experience.
Tyrannical is a good way to summarize the deployment of these metric schemes. There was little or no input from the medical community (still is relative silence, for the reasons I listed above). There is no scientific validation or reproducibility. You wouldn’t think you could turn to a hard science like math to bypass science altogether, but there ya go.
AI eminence Judea Pearl (“The Book of Why”) admonishes “you are smarter than your data” in decrying faddish mindless devotion to being “data-driven.”
Recently, the concept of “clusters” has appeared more frequently in the research literature. Specifically, the genre is knowable as CLUSTER ANALYSIS. Doing a Google Scholar search indicates an increasing use of that term within its citation search tool. The pattern of the increase approximates the pattern of Kip’s figure. All of this could be recognized as a modern day version of Parkinson’s Law…”Work expands to use the resources available.” Meanwhile, annual health spending continues to increase unfettered and longevity at birth continues to decrease. As we haphazardly redefine our nation’s role within the world-wide community, we best acknowledge our nation’s loss of social cohesion and the community-based social capital that drives it.
For a community based concept for healthcare reform, see
https://nationalhealthusa.net/communityhealthforum/