Steven Goldberg is probably best known for the controversial “Billions of Drops in Millions of Buckets: Why Philanthropy Doesn’t Advance Social Progress.” In this post he looks at the ways in which success and failure are measured in his field. Healthcare audiences will note many familiar themes. What should we measure? How should we measure it? How much weight should we give the results? And perhaps most importantly: what other questions should we be asking? — John Irvine
Conventional wisdom holds that randomized control trials (RCT) are the “gold standard” of evaluation. In fact, RCTs only make sense under very strict conditions that can rarely be met in the real world. Most of the time, RCTs produce inconclusive results and simply aren’t worth the time and money. As the social sector assumes greater responsibility for improving the lives of many more people, it should focus less on pseudo-scientific “proof” that programs work and focus more on making good programs better.
Now that the Social Innovation Fund (SIF) appears to have survived the “transparency” commotion, the eleven chosen intermediary grantmakers have less than six months to select their portfolios of nonprofit grantees.
As a commendable exercise in “evidence-based” grantmaking, SIF requires the intermediaries to incorporate evaluation into every step of their awards, from the initial competitive solicitations all the way through final payments and renewals. Applicants will be required to explain how their success should be measured and demonstrate their capacity to do so, and awards will be contingent upon the establishment of meaningful performance metrics, the timely collection and reporting of reliable data, and the faithful implementation of sound evaluation protocols.
The evidence-based rubric represents a significant advancement for the social sector, which has historically relied on anecdotal indicators of success, that is, entirely subjective and largely unreliable assessments of whether programs work. Now that philanthropy and nonprofits are pursuing more ambitious goals for improving educational achievement, extending economic opportunity and alleviating poverty at scale, there is growing acceptance of the need for more objective measures of effectiveness, reinforced by the fact that scarce tax dollars are going to nongovernmental organizations promising to deliver more effective solutions. As a general matter, this heightened appreciation for serious evaluation is an encouraging development.
There is a countervailing risk, however, that having ignored systematic evaluation for too long, the social sector will now overcompensate in the other direction, requiring evaluations conducted under “laboratory conditions” that generally cannot be met in the field and that are biased toward concluding that funded programs have no apparent impact. Before we set ourselves up for failure, this is a good time to pause and think carefully about how high we set the evaluation bar.
The problem arises from the seemingly innocuous assumption that judging whether an expenditure of public funds (or, in the case of SIF, a combination of public and charitable funds) was “worth it” requires a determination of whether the funding produced a measurable amount of social benefit. That is, can the grantee show that the observed results were attributable to the funded program? If so, we are told, the expenditure succeeded; if not, it failed. Otherwise, the thinking goes, we’d fall back to our old unaccountable ways in which programs that seemed at first to work later failed to produce enduring or significant benefits, particularly with increased funding and growth.
When auspicious programs fail to produce long-term or transformative benefits, the underlying assumption is that the observed short-term results were either illusory or the product of factors other than the funded intervention, or both. There is another explanation, however: the impacts were real albeit modest, but the program was not properly nurtured to reproduce and extend those benefits.
These alternative explanations comprise two quite different views of how we can bring about social progress. In one view, the burden of proof is placed squarely upon those who claim that the programs “worked”: if their effectiveness cannot be quantified, and the cause-and-effect relationship between the program and the apparent results cannot be proven affirmatively, it would be unscientific and therefore irresponsible to continue or expand their operation. That is to say, social programs are presumed ineffective unless they can be proven effective; the failure to demonstrate effectiveness is taken as proof of ineffectiveness. Let’s call this the “purist” perspective.
The fundamental problem with this approach is that it assumes we have good tools to determine whether programs are effective and how much impact they produced. If so, it would indeed make sense to place the burden of proof on program proponents to show their efforts were successful. After all, they’re the ones asking for funding and they control how the funds are spent and programs are implemented.
But if the assumption is false — that is, if we can’t reliably verify and measure the effectiveness of social programs — then imposing the burden in that way merely stacks the deck against social innovation and leaves program advocates without any effective means to make their case. In that case, we’d just be denying ourselves valuable innovations for no good reason.
The second and more “pragmatic” approach to evaluation starts from the premise that the causes and cures of social problems are simply too elusive and complex to measure or explain precisely. Pragmatists believe that, at best, policy makers can make informed judgments about which interventions are probably more effective and which are probably less effective, and social policy should try to identify the more effective policies and continuously improve them.
Decades of academic research have shown there is most definitely a place for the purist approach. It has usefully prevented the acceptance of ineffective and dangerous medical treatments, unsound scientific theories and spurious practices in business and industry. There are even some examples, such as the mighty Nurse Family Partnership, where RCTs have demonstrated the powerful impacts of social innovation. But it would be foolish to give that approach more credit than it deserves or to apply it where the circumstances do not warrant.
The justification for the purist approach is methodological rigor. By setting a high bar for proving that social programs work, we thereby reduce the risk of concluding that an ineffective program actually succeeded, what evaluation professionals call a “false positive.” If we mistakenly invest additional resources in a false positive, not only are we going to be disappointed down the road, but the longer we do so, the more disappointed we will be. In addition, the chorus of “we-told-you-so’s” by program opponents will become correspondingly louder, coupled with the fact that investing more resources in effective programs means that fewer resources would be available for alternatives approaches that might have been more effective. So adopting an evaluation methodology designed to minimize false positives has a lot going for it.
On the other hand, false negatives aren’t much fun, either. A false negative means that a successful or least promising program was allowed to die on the vine. It represents an opportunity lost at a time of desperate and worsening social and economic need.
From the perspective of methodological rigor, false positives and false negatives are exact mirror images of each other. Neither is inherently better or worse. In both cases, an erroneous decision is made and resources are misallocated. For false positives, money is wasted on programs that don’t work; for false negatives, money isn’t spent on programs that do (or might) work. Avoiding false positives reflects “do no harm” thinking, while avoiding false negatives, carried to extremes, can lead to “do no good.” By stacking the deck against false positives, we reduce the risk of being too gullible by increasing the risk of being too skeptical.
The problem, of course, is that it’s hard to tell the difference between false positives and false negatives, just as it’s hard to tell whether programs work or not. What evaluation science tries to do is to provide techniques that calibrate the risks of erroneous decisions in ways that make sense. Unfortunately, nostrums like “RCTs are the gold standard of evaluation” are often misused in ways that don’t make sense at all.
A control group, of course, is a means of isolating the effects of a certain “treatment.” If two groups of people are identical in every meaningful way except that one gets the treatment and one doesn’t (or gets a placebo), it’s fair to conclude that any difference in the results were attributable to the treatment. But in the real world, practical obstacles intrude.
First, even under ideal laboratory conditions, perfect randomness can be difficult to achieve. In the field, where researchers are dealing with specific communities of people with virtually unlimited and sometimes indeterminate or hidden characteristics, creating truly random experiments is maddeningly difficult, time-consuming and expensive. There’s also a political dimension: for example, it’s not so easy to explain to poor families why their children were randomly assigned (consigned would be more accurate) to a school that everyone knows is lousy so some clipboard-wielding social scientist can decide whether that hot new charter school that everyone’s talking about can help lucky kids from some other disadvantaged families avoid a life (sentence) of educational inequity.
Second, with an RCT, you can’t adjust the experiment along the way to make improvements. Suppose the charter school decides, half-way through the five-year evaluation, that it wants to adopt a terrific new curriculum for teaching fourth-grade math. Now you don’t have one five-year experiment, you’ve got two two-and-one-half year experiments with half as many kids in each, which might or might not still be random relative to the control group, and you might not have enough data in any treatment group to produce meaningful results.
The third limitation of RCTs, the misuse of statistical measurements, is both the most nefarious and the least understood. It begins with the peculiarities of the word “significance,” which means entirely different things in English and in statistics. In English, significance refers to importance; in statistics, significance relates to validity, but its claims of iron-clad validity are often doubtful.
Statistics is a mathematical science that allows general conclusions to be drawn from specific cases. If I want to find out if charter schools can improve the educational performance of fourth graders, as a practical matter I can’t conduct an experiment with every fourth grader in the country (i.e., the “universe” of fourth graders). Instead, I have to select a manageable number of students, called a “sample,” in a way that enables me to draw reasonable conclusions about how similar charter schools might help other fourth graders who weren’t part of the sample. Statistical science establishes procedures that enable such generalizations to be made from small experiments, and statistically rigorous studies follow what are called true “experimental designs,” of which RCTs are one example. (For purposes of simplicity, I’m pretending that all charter schools are the same, which of course they’re not.)
Here’s the basic problem inherent in using true experimental designs in evaluating social programs. The only way to perfectly measure the impact of charter schools on fourth grade students would be to conduct an experiment with the entire universe of all fourth-graders, in which a perfectly random half went to charter schools and the other perfectly random half went to traditional public schools. While such a study would result in 100% certainty about the charter-school treatment effect, we can’t conduct such an experiment for many reasons, not the least of which is there aren’t enough charter schools to serve that many students.
So we’re going to have to compare a manageable sample of charter students to a manageable sample of non-charter students. Even assuming that perfectly random assignments were made to the two groups so that there were no meaningful differences between them, neither sample would perfectly embody all of the characteristics of the entire universe from which it was drawn. We could choose another sample and divide them randomly between the treatment and control groups, and they would be different in some indeterminable ways from the first sample.
If we follow sound statistical practice, the differences among samples shouldn’t be large enough to invalidate the results. Not surprisingly, the primary factor in these undetectable variations among samples is the size of the sample: the larger the sample, the more it should be like the universe; the smaller the sample, the greater the chance that it will be quite different from the universe.
The genius of statistics is that it enables valid conclusions to be drawn about universes from pretty small samples, which is good because experiments with large samples are expensive and difficult to manage. And statisticians can estimate the amount of variation among different sizes of samples. This enables experimental designs to draw conclusions about treatments within identifiable probability ranges.
But when you’re dealing with probabilities, everything’s indefinite. Among all of the possible outcomes, some are more probable than others, but it’s hard to say exactly how likely any particular outcome is in any particular case. So how do you make an “evidence-based” decision when imprecision is unavoidable? When can you say, “this result is likely enough for us to say this works,” while “this result is just too speculative for us to accept”?
There’s nothing inherently wrong with estimating probabilities, as long as you acknowledge that’s what you’re doing. Which brings us back to the word “significance.”
In the statistical lexicon, an observed difference between two groups is considered “statistically significant” if the probability that the difference is due to purely random factors rather than to the treatment falls below some accepted threshold of evidence. In other words, since we can’t conduct universal experiments, there’s always some chance that a particular experiment will lead us to conclude that the treatment worked, when the difference between the treatment and the control group was actually due to some unpredictable and undetectable fluctuation in the sample we happened to pick. But that’s actually the beauty of statistics: we can use small samples to make informed judgments about how a treatment is likely to affect the universe, even though there’s always going to be some amount of uncertainty that can be reduced but not eliminated entirely.
Experimental designers make a living by conducting RCTs that have acceptably small risks of random error. But how small is small enough? Accepting practice says that sometimes it’s as small as a 1% chance of random variation (meaning there’s a 99% probability that the observed difference is due to the treatment), sometimes as small as 5% (a 95% probability that the treatment caused the difference), and sometimes a 10% difference is deemed acceptable (a 90% probability). Anything more than a 10% chance of random variation is almost always considered “statistically insignificant.”
Why? Because starting in 1925, Sir Ronald A. Fisher, the renowned English statistician and evolutionary biologist, declared that
“it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not…. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. [Researchers should] ignore entirely all results which fail to reach this level.”
Translation: the 99% or 95% or 90% probability thresholds for statistical significance are rules of thumb, nothing more. They’re convenient conventions that statisticians have agreed to use to distinguish between reliable and unreliable experimental results. Instead of 90%, it could be 11%, or 9%, but it’s not. So an 89% probability that a program worked is considered “insignificant,” while a 90% probability is considered significant. As Stephen T. Ziliak and Deirdre N. McCloskey note in their snarky but scholarly book, The Cult of Statistical Significance, Fisher’s reasonable but arbitrary line “appeals to scientists uncomfortable with any sort of … indefinite approximation…. To avoid debate they seek certitude such as statistical significance.”
RCTs don’t run into trouble when they produce “significant” results, but real-world problems arise from the fact that most people don’t understand the significance of “insignificance.” Non-statisticians think that an RCT showing that a program result is “statistically insignificant” means that the study “proved” that the treatment “doesn’t work.” That’s completely wrong. All it means is that, due to unexplainable variation in the data, the study couldn’t determine whether or not the program worked based on the particular sample chosen with a probability of 90% or higher. Maybe it could make that determination at an 89% probability, or a 75% probability, but the statistics arbiters tell us that’s not good enough.
Ziliak and McCloskey offer an illustration of when focusing on “significance” can be silly. Suppose we conduct a study of body weight based on a person’s height and the amount of exercise they get, and suppose further that the data show that height is statistically significant but exercise is not. “A doctor would not say to a patient, ‘The problem is not that you’re fat — it’s that you’re too short for your weight.’” That is, just because the exercise data from the particular sample chosen wasn’t precise enough to be considered statistically significant does not mean that exercise isn’t a factor in weight. It doesn’t mean that fat people are too short. It means that this study doesn’t enable reliable conclusions to be drawn about the relationship between weight and exercise. As Carl Sagan famously observed, “The absence of evidence is not evidence of absence.”
Statistical insignificance is a finding about the precision of a sample, not the impact of a program. Insignificance comes from too much unaccounted variation in the data, often resulting from a sample size that’s too small, poor experimental design, or other inexplicable factors that come into play when we try to conduct laboratory experiments in the real world. RCTs often produce “insignificant” results, which means nothing more than the experiment, no matter how expensive, laborious and time-consuming, was inconclusive. And inconclusiveness runs both ways: the study didn’t prove that the program worked and it didn’t prove that the program did not work.
Mistreating insignificance as proof that treatments don’t work has real consequences. Ziliak and McCloskey, whose book is subtitled, How the Standard Error Costs Us Jobs, Justice, and Lives, report on a 1980s study that found that Illinois saved $4.29 for every dollar it spent providing a training subsidy for unemployment insurance recipients, but the savings estimate was only significant at the 88% confidence level, just shy of the 90% cut-off. The program was deemed a failure, even though the defect related only to the fuzziness of the data sample. Another study hypothesized that stiffer penalties for dangerous driving in the United Kingdom could have saved 100,000 lives over ten years, but the estimated results were significant at 95% probability but not at 99%. In such an experiment, the choice of significance levels would be decisive.
The purist and pragmatist schools of thought about evaluation techniques reflect different perspectives about the purpose of evaluation. As we’ve seen, the purists focus on precision: they want the most accurate probability estimates possible, even if that leads them to reject results that look strong but don’t cross the magical but arbitrary threshold of “statistical significance.” As The Acumen Fund’s Brian Trelstad wrote in an insightful 2008 paper, “Simple Measures for Social Enterprise,” “Metrics and evaluation are to development programs as autopsies are to health care: too late to help, intrusive, and often inconclusive.”
I agree that purists are like doctors who let patients die because autopsies provide the most accurate cause of illness. Pragmatists care more about figuring out a cure than nailing down the cause. As the standard bearer of the pragmatists, Mark Kramer of FSG Social Impact Advisors says, “the real value of evaluation is its usefulness as a management tool to refine strategy and improve implementation over time.”
“Within the field of Social Entrepreneurship … the primary goal is to catalyze change rapidly on as massive a scale as possible. The measures that matter most are practical indicators that can be tracked and acted on in real time to spread ideas or build strong organizations that can reach more people more cost-effectively.”
Like Kramer, Acumen’s Trelstad favors a “a performance management process that would ‘take the pulse’ of our work: frequent, simple measures that would allow us to refine our thinking, change our course, and diagnose problems before they become too significant,” using that word in its non-statistical sense, of course. They both agree that evaluation should focus “on the pragmatic question of how to help more people sooner” (Kramer) for “the primary purpose of supporting and scaling each enterprise.” (Trelstad)
Another limitation is that statistical methods only test one hypothesis at a time: does the variation in the data indicate with a high enough probability that the treatment caused the result or not? They don’t provide guidance among multiple treatment options, which is exactly what we need when considering alternative policy choices. Binary choices — yes/no, true/false — don’t help much. We need to know how well a program worked, whether it worked better than other potential approaches, and how effective programs can be improved. Acumen, for example, wants to know how programs they fund “compare more or less favorably to the ‘best alternative charitable option’ available to our donors,” that is, “how else the donor could have invested their money.” Trelstad frames this well:
“The search for absolute impact or performance measures is elusive and in my mind irrelevant. Performance is always relative to what you had been doing before (past), to what your competition did over the same time period (peers), and to what you should have done (projections).”
It’s important for conscientious social entrepreneurs, intermediaries and funders to maintain a sense of perspective as they try to shift the paradigm in the admirable direction of greater accountability for performance. Fortunately, there are many sound evaluation models that are much more practicable than RCTs which provide results that provide more than sufficiently reliable results on which to base reasonable policy decisions. For example, Kramer offers a 12-cell “evaluation matrix” which captures three different kinds of measures — monitoring, process and impact — for each of four levels — grantee, donors, program area, and foundation. From these 12 combinations, he identifies six kinds of evaluation that reflect different objectives: formative, summative, donor engagement, cluster evaluation, overall foundation performance, and administrative processes.
“Over 50 public comments were received on the use of evidence of effectiveness and impact in the SIF. Many of the comments encouraged the Corporation to be more inclusive about the types of evaluation that would produce strong evidence of impact. The Corporation will capture these insights in its Frequently Asked Questions (FAQ), a companion document to the NOFA. The FAQ will clarify that the Corporation expects subgrantees to demonstrate some level of impact in order to receive a grant, but does not expect that most initial subgrantees will have the strongest level of evidence. The SIF is designed to build the evidence base of programs over time using rigorous evaluation tools that are appropriate for the intervention. The Corporation is committed to ongoing discussion about evidence moving forward through learning communities and other forums.” (Emphasis added.)
Now, CNCS’s reference to “the strongest level of evidence” makes me a bit queasy as it seems to echo misplaced notions of gold standards. RCTs are often impossible to conduct in the field or are available only at prohibitive cost. As Brian Trelstad noted with some understatement, “it is impractical to spend $250,000 researching the impact of a $500,000 investment …” At the risk of beating the dead horse yet again, I’ll just comment that it’s hard to see how evidence can be “the strongest” if you can’t actually get it. In my experience, hypothetical evidence isn’t all that strong. Rather, “the strongest” evidence is the most rigorous evidence that you can actually get at justifiable effort and cost. Putting that semantic quibble aside, CNCS seems to understand that it should not let the unattainable perfect become the enemy of the readily attainable good, particularly when it talks about evaluations that are “appropriate for the intervention.”
CNCS’s pragmatic approach makes particular sense given SIF’s focus on more mature nonprofits selected by growth-oriented intermediaries. As Kramer has observed, “Over the organizational life cycle, however, expectations for management performance, cost effectiveness, and scale of impact increase rapidly, requiring very different evaluation criteria at different stages of maturity.”
In the case of a promising but untested new innovation, it makes sense to ask “Does this work?” and “Is it better than existing approaches?” But once an innovation has accumulated some evidence of impact, as will be true for all SIF grantees, the more important question becomes, “How can we make this more widely available?”
Trust me, we’re going to hear that SIF was a waste of time and money to the extent that it didn’t use RCTs. Ziliak and McCloskey observe that RCT’s “arbitrary, mechanical illogic … [is] currently sanctioned by science and its bureaucracies of reproduction …” and “the sociological pressure to assent to the ritual is great.” But when it comes to making important choices about which social policies to fund and expand when families’ lives and welfares are at stake, insisting on an arbitrary 90% or higher standard of “statistical significance” is a luxury we don’t have. If there’s an 89% or 80% or 75% chance that a given program probably accounts for, say, the improved grades that one group of students received, we should think carefully about keeping and improving that program. I agree with Ziliak and McCloskey that it would be irresponsible to abandon such a program based on insignificance alone:
“Accepting or rejecting a test of significance without considering the potential losses from the available courses of action is buying a pig in a poke. It is not ethically or economically defensible.”
No less a figure than W. Edwards Deming put it plainly: “Statistical ‘significance’ by itself is not a rational basis for action.” And Gara LaMarche, chief of The Atlantic Philanthropies wrote in The Financial Times, “both funders and the organisations they support need more humility about cause and effect.” Trelstad reminds us that “the expectations for what one can measure and what one can prove diverge from the reality of practice.”
Of course, if there’s only a 50% chance, that is, if it’s just as likely that the higher grades were caused by random differences among samples, then, sure, that’s not very encouraging. But at some point, it’s foolish to believe there’s some bright line of probability that can rescue us from having to make difficult judgments about what works and what doesn’t. RCTs aren’t a silver bullet, a gold standard or some kind of “on-off switch for establishing scientific credibility.” In exceptional cases, they’re worth doing; in most cases, they’re not. Fortunately, there are many other good ways to evaluate nonprofit organizations and programs that don’t involve complete guesswork or wishful thinking. Those are generally the best techniques available and we should embrace them enthusiastically to help us make timely choices among encouraging alternatives, which is just what we need in the pursuit of “scaling what works.”
Steven Goldberg is an author and consultant based in Needham, Massachusetts. He is the author