By KIP SULLIVAN JD
Review of The Tyranny of Metrics by Jerry Z. Muller, Princeton University Press, 2018
In the introduction to The Tyranny of Metrics, Jerry Muller urges readers to type “metrics” into Google’s Ngram, a program that searches through books and other material published over the last five centuries. He tells us we will find that the use of “metrics” soared after approximately 1985. I followed his instructions and confirmed his conclusion (see graph below). We see the same pattern for two other buzzwords that activate Muller’s BS antennae – “benchmarks,” and “performance indicators.” 
Muller’s purpose in asking us to perform this little exercise is to set the stage for his sweeping review of the history of “metric fixation,” which he defines as an irresistible “aspiration to replace judgment based on personal experience with standardized measurement.” (p. 6) His book takes a long view – he takes us back to the turn of the last century – and a wide view – he examines the destructive impact of the measurement craze on the medical profession, schools and colleges, police departments, the armed forces, banks, businesses, charities, and foreign aid offices.
Foreign aid? Yes, even that profession. According to a long-time expert in that field, employees of government foreign aid agencies have “become infected with a very bad case of Obsessive Measurement Disorder, an intellectual dysfunction rooted in the notion that counting everything in government programs will produce better policy choices and improved management.” (p. 155)
Muller, a professor of history at the Catholic University of America in Washington, DC, makes it clear at the outset that measurement itself is not the problem. Measurement is helpful in developing hypotheses for further investigation, and it is essential in improving anything that is complex or requires discipline. The object of Muller’s criticism is the rampant use of crude measures of efficiency (cost and quality) to dish out rewards and punishment – bonuses and financial penalties, promotion or demotion, or greater or less market share. Measurement can be crude because it fails to adjust scores for factors outside the subject’s control, and because it measures only actions that are relatively easy to measure and ignores valuable but less visible behaviors (such as creative thinking and mentoring). The use of inaccurate measurement is not just a waste of money; it invites undesirable behavior in both the measurers and the “measurees.” The measurers receive misleading information and therefore make less effective decisions (for example, “body count” totals tell them the war in Viet Nam is going well), and the subjects of measurement game the measurements (teachers “teach to the test” and surgeons refuse to perform surgery on sicker patients who would have benefited from surgery).
What puzzles Muller, and what motivated him to write this book, is why faith in the inappropriate use of measurement persists in the face of overwhelming evidence that it doesn’t work and has toxic consequences to boot. This mulish persistence in promoting measurement that doesn’t work and often causes harm (including driving good teachers and doctors out of their professions) justifies Muller’s harsh characterization of measurement mavens with phrases like “obsession,” “fixation,” and “cult.” “[A]lthough there is a large body of scholarship in the fields of psychology and economics that call into question the premises and effectiveness of pay for measured performance, that literature seems to have done little to halt the spread of metric fixation,” he writes. “That is why I wrote this book.” (p. 13)
A short history of Obsession Measurement Disorder in medicine
I read Muller’s book because I share his astonishment at the persistence of the measurement craze in the face of so much evidence that it is not working. Over the three decades that I have studied health policy, I have become increasingly baffled by people who promote various iterations of managed care in the face of evidence that they don’t work. In search of an explanation, I have, as Muller has, read books and news stories about the misuse of measurement in other fields, particularly education and banking. I have been especially baffled by the managed care movement’s enthusiasm for measuring the cost and quality of all actors in the health care system, an enthusiasm that emerged in the late 1980s when it was obvious that the propagation of HMOs, the movement’s founding project, was failing to control inflation. 
By the 1990s the enthusiasm for documents that handed out grades to insurance companies and providers on “consumer satisfaction,” mortality rates, etc. had become an obsession. Proponents of “report cards,” as these documents were called, hoped that “consumers” would read them and reward the good actors with their business and punish the bad actors by leaving them. That, of course, did not happen.
Frustrated by consumer disinterest in report cards, managed care proponents, such as the Medicare Payment Advisory Commission (MedPAC) and the Institute of Medicine (IOM), declared in the early 2000s that it was time to punish doctors and hospitals directly by rewarding them if they got good grades on crude measurements and punishing them if they didn’t. The term they used to describe this direct method of punishment was “pay for performance,” a phrase borrowed from the business world. By about 2004, that phrase had become so common in the health policy literature it was shortened to “P4P.”
The complete absence of evidence that P4P would improve the quality of medical care didn’t matter to MedPAC and other P4P advocates.  As evidence has piled up over the last decade indicating P4P doesn’t reduce costs and has mixed effects on quality, P4P proponents, true to form, have ignored it. 
Taylorism: Ground zero of the epidemic
It is impossible to identify a single Typhoid Mary responsible for the metrics-fixation epidemic, but it is fair to say a very important Typhoid Mary was Frederick Winslow Taylor. Muller identifies the rise of “Taylorism” in manufacturing in the early 1900s as a primary cause of the epidemic. Taylor, an American engineer, studied every action of workers in pig iron factories, estimated the average time of each action, then proposed to pay slower workers less and faster workers more. According to Taylor, determining who was slow and who was fast and paying accordingly required “an elaborate system for monitoring and controlling the workplace,” as Muller puts it. (p. 32) Taylor called his measurement-and-control system “scientific management.”
“Scientific management” assumed that managers with clipboards could distill the wisdom of their work force into a set of rules (later called “best practices,” another buzzword catapulted to stardom in the 1990s) and enforce those rules with pay-for-performance. The outcome of “scientific management,” according to Taylor, was that “all of the planning which under the old system was done by the workmen, must of necessity under the new system be done by management in accordance with the law of science.” (Muller, pp. 32-33) Here we see the beginning of the double standard now prevalent in health policy: People who flog faith-based P4P schemes hold themselves out as the bearers of “scientific” values (“evidence-based medicine,” to use the lingo invented in the early 1990s), while doctors who criticize metrics madness are said to be stuck in a “paternalistic culture.” 
The obvious corollary to “scientific management” was that leaders of corporations didn’t need any hands-on experience or training in the production of whatever it was their corporation produced. If you had a degree from a business school that taught “scientific management,” it shouldn’t matter to Sunbeam, for example, that “Chainsaw” Al Dunlap had no knowledge of how appliances are made. As long as he knew “management,” he was qualified to be Sunbeam’s CEO. Decades after Taylorism arose, this same logic would justify allowing managers of insurance company executives, Fortune 500 companies, and government insurance programs who never went to medical school to measure and micromanage doctors.
By the 1950s, this notion that standardized data in the hands of managers trumped experience had become deeply embedded in American business culture. By the 1960s, reports Muller, it had spread to the US military (Robert MacNamara’s background in accounting got him a job running a car company, and from there he jumped to the Pentagon where he and his “whiz kids” told the generals to count enemy corpses). By the 1980s it had infected other government agencies and much of the non-profit world, and by the late 1990s it had infected the services sector, including medicine.
Measuring the doctor and patient from afar
“Nowhere are metrics in greater vogue than in the field of medicine,” writes Muller. (p 103) The following statement by report-card and P4P guru Michael Porter, which Muller took from an article Porter co-authored for the Harvard Business Review, is a good illustration of how P4P proponents think and talk.
Rapid improvement in any field requires measuring results – a familiar principle in management…. Indeed, rigorous measurement of value (outcomes and costs) is perhaps the single most important step in improving health care. Wherever we see systematic measurement of results in health care … we see those results improve. [p. 107]
From this excerpt plus other sections of the Harvard Business Review article, we learn that Porter is absolutely convinced it’s possible to measure “outcomes and costs” accurately, and then divide cost into quality to derive “value.”
Note first the voice-of-God tone. God doesn’t have to document anything, and neither does Porter; there are no footnotes in this lengthy essay. Note next the grand assumption that improvement is only possible if “results” are measured. How do we know this? We just do. It’s a “principle of management,” says Porter (no doubt going all the way back to Frederick Taylor). Third, note the misrepresentation of the evidence. It simply isn’t true that “wherever” managers conduct “systematic measurement” of “performance” by doctors and hospitals, costs go down and/or quality goes up.
Muller compares the groupthink represented by Porter with research on both report cards and P4P schemes. The small body of research on report cards finds they have no impact on “consumer” behavior or patient outcomes. The large body of research on P4P indicates it may be raising costs when the costs providers incur to improve “performance” is taken into account, and it has at best a mixed effect on measured quality.
Muller suggests that the net effect of P4P on the health of all patients, that is, those whose care is measured and those whose care is not measured, is negative. Sicker patients are the ones most at risk in a system where P4P is rampant. Because the measures of cost and quality upon which P4P schemes are based are so inaccurate (because scores cannot be adjusted with anything resembling accuracy to reflect factors outside provider control), it induces a variety of “gaming” strategies, the worst of which are avoiding sicker patients and shifting resources away from patients whose care is not measured to those whose care is measured (“treating to the test”).
To illustrate how P4P damages sicker patients, Muller devotes two pages to the damage done by Medicare’s Hospital Readmissions Reduction Program (HRRP). This program, which began in 2012, punishes hospitals that have an above-average rate of 30-day readmissions (admissions that occur within 30 days of a discharge from a hospital) for patients with a half-dozen diagnoses. Muller reports that the HRRP has clearly had two negative effects. First, it has incentivized hospitals to keep sick patients away for at least 30 days after discharge, and if that’s not possible, to let them in but to put them on “observation” status, which means they are not counted as “readmitted.”  Second, it has led to the punishment of hospitals that treat sicker and poorer patients.
When Muller publishes a second edition of this book, he’ll no doubt add a page describing research done since his book was published showing that the HRRP appears to be killing patients with congestive heart failure (CHF). CHF was one of the three diagnoses that has been measured by the HRRP since it began (readmissions for heart attack and pneumonia were the other two).
Reversing the epidemic
Muller ends his book with a series of recommendations. He suggests, for example, that measures be developed from the bottom up and that financial rewards and penalties should be kept low if they are to be used at all. He does not attempt to offer political solutions. For this I do not criticize him. His book, which must have required years of research, is a valuable contribution to the largely one-sided debate about P4P in medicine, a debate which has only recently become more audible.
Here are my two cents on the politics of this issue. Groups representing doctors and nurses must take the lead in rolling back measurement mania. Doctors and nurses have great credibility with the public, and they have to cope every day with the consequences of measurement mania. They should focus on rolling back the P4P schemes now inflicted on the fee-for-service Medicare program because Medicare is so influential (“reforms” inflicted by Congress on Medicare are typically mimicked by the insurance industry). Groups working to reduce the cost of health care or improve quality of care for patients should also join the fight. They too have an interest in undermining the tyranny of metrics.
Of course, it would be nice if those who make a living promoting the inappropriate use of measurement would practice what they preach and examine their own behavior to see how it could be improved. Here’s a question that people in that business might pose to themselves now and then: Would you like your work to be subjected to measurement of its cost and quality by third parties, and would you like those third parties to alter your income based on the grades they decide to give you?
 Just to test NGram, I entered other terms. “Automobile,” for example, rises up from zero mentions just before 1900 to a peak during about 1938-1942, then declines rapidly so that the rate by 2000 (the last year on the graph) is equal to the rate of 1910. “Database,” on the other hand, stays at zero mentions until about 1970, then skyrockets in the late 1970s.
 Accurate measurement of the cost and quality of insurance companies and providers was an essential element of “managed competition,” a proposal introduced in 1989 by Alaine Enthoven and enthusiastically promoted by Paul Ellwood (the “father of the HMO”), insurance industry executives, Bill and Hillary Clinton, and the editors of the New York Times, to name just a few of Enthoven’s most influential disciples.
 A 2006 edition of Medical Care Research and Review devoted entirely to the emerging P4P fad stated, “P4P programs are being implemented in a near-scientific vacuum.”
 We are seeing rare exceptions to the P4P groupthink only in the last two or three years. In January 2018, MedPAC formally voted to reverse its decision to recommend P4P at the individual physician level. Donald Berwick, a leading proponent of measurement, announced in 2016 that it was time to reduce the reporting burden on doctors by 50 to 75 percent and to eliminate P4P at the individual level.
 The IOM, for example, has peddled measurement and control of providers for decades on the basis of no evidence, yet it maintains a “roundtable” of P4P disciples the IOM deems to be “science-driven.”
 “Observation stays” were designed for Medicare beneficiaries who were not clearly in need of inpatient care but who were not clearly ready to go home either. Such patients are typically placed on the same wards with admitted patients but not treated.
Kip Sullivan is a member of the Health Care for All MN advisory board, and of MN Physicians for a National Health Program.