If you are a practitioner or official from a governmental or non-governmental organization (an "implementer"), this guide is intended to help you to take advantage of opportunities to collaborate with external researchers ("evaluators") to evaluate your organization's policies or programs. Researchers working on evaluations can also benefit from this guide by understanding how to take implementers' needs into account.
As an implementer, what should you expect or request when collaborating with an external evaluator? How can that evaluation be designed to help your organization and the broader field to learn and innovate? What conversations should you have with an evaluator upfront so that the evaluation runs smoothly? How can you work with the evaluator to decide on and communicate actionable and convincing analyses about the impact of a program? How will you and the evaluator share what you learned to improve the practice of your organization and others? How can you ensure that what you learn helps the whole community of practice so that overall social welfare improves? This short guide offers questions to guide conversations with prospective evaluators, points to some practices that have worked well elsewhere, and explains why they might be important or useful in future evaluations.1
This may seem blatantly straightforward–the evaluation is to see if the program worked. Yet most policy evaluation efforts involve multiple goals. The program may be in a pilot stage, and you want to gather preliminary evidence before expanding the program. The program may have worked in one context (country, community), and you would like to know how generalizable it is. You may want to know who the program works for. Does it work differently for men or women? Marginalized communities or wealthy and powerful groups? You may want to know how part of your intervention affects other aspects of your existing interventions (e.g., cash vs. cash + training) or what combination of activities are most cost effective. You may want to know if you should try to scale the program. Or you may want to know not so much whether or not the program works, but why it works.
Additionally, many programs have multiple components. The program may not only be delivering a service (e.g., distributing information on how to access health services), but it may also be trying to build the capacity of a local agency; (e.g., help a public health agency to identify households for additional forms of support ). As a result, there are potentially both individual-level (i.e., micro) questions–the effect of school lunches on children–and more macro questions–how are local governments better able to identify vulnerable children?–that need answers. While both may be asked in one evaluation, the evaluators need to understand how the implementer prioritizes the two questions. This prioritization will affect the evaluation design (see the next Conversation).
As implementers, it's important to be clear on your goal of what you want to learn among yourselves and be clear about those goals with the evaluation team. The evaluation team or person, particularly if they are an academic or publish their work more widely, will also have goals about what they want to learn. Without being clear about what you and your organization need to learn, the evaluator may not design the most appropriate evaluation method. This takes us to the next question.
Since evaluation efforts may serve multiple purposes for different audiences, no single research design is always best. We suggest that evaluation and implementation teams not take for granted any given approach: some evaluation teams specialize in human-centered design research which is very similar to ethnographic research in the social sciences; other teams specialize in randomized controlled trials (RCTs). The fact that any given team has expertise in one particular mode of research should not outweigh the need to connect purposes to evaluation design. We have seen such conversations yield realizations that, in fact, an "evaluation" ought to be a multi-step process with learning about the context and the intervention occurring along the way –- more akin to adding evidence to hone a learning agenda than to report on the estimated causal effect of a given intervention in a given place and time.
Openness about the evaluation approach can also prepare teams to adapt as new information arises. For example, at the end of an RCT one of the authors conducted in Nigeria, we learned that the program diffused from direct participants to some non- participants. However, we did not know why and could immediately think of competing explanations after the fact: Was it a result of non-participants witnessing cooperation among direct participants? Did the people who directly participated talk about their experience with non-participants? Did norms in these communities change? Pairing this work with interviews and observations earlier in the process would have helped us understand why we found the effect, and then we could have better tested these new ideas with follow-up studies.
RCTs can be used to test insights that are derived from more unstructured observation. For example, the Los Angeles-based campaign specialist, David Fleischer, noticed from his own experience and those of others that certain interactions during door-to-door canvassing appeared to be particularly compelling for changing people's minds about sensitive topics. Communications between him and a pair of academics led to a series of RCTs to verify and unpack the mechanisms behind this phenomenon (called "deep canvassing").
This conversation about approach, method, and maybe iteration, between the evaluator and the implementer, can also help identify whether the intervention can be conducted in a way to better answer the questions of interest. For example, is there a way to phase in or layer parts of the program so you could test mechanisms? How might you recruit individuals into the program so that it is more representative of the populations you care about, or so that the results will be generalizable? Our general theme here (and elsewhere in this guide) is that the most useful evaluations often occur when both the evaluation and the program design are created collaboratively (see Conversation 6 below).
Often there are multiple stakeholders both within the organization whose work is subject to the evaluation, but also externally (not to mention the evaluator themself, and the audiences they may want to influence by publicizing the results). Within an organization, some people work across contexts and want to understand how a program may vary in its results across those contexts. Some people implement in a specific context and want to know how to do it better in that place. Then there may be executives who are focused on external influence and raising the profile of the organization versus the program managers who are trying to use the information to adapt their programming. For example, when one of the authors was working on an evaluation related to host and refugee relationships, the HQ team cared most about the pooled results across Lebanon and Jordan. Such results allowed them to speak to the larger issue of host-refugee relations. In contrast, the field teams cared more about country-specific data so they could adapt their programming. The evaluation team worked with both the field and HQ teams to prioritize which results to produce first. In this case, the different needs within an organization did not change the evaluation design. But one could easily imagine a situation where an evaluation designed to be sensitive to the effects of a program in one place would be inappropriate in another context. If a multi-context analysis is of primary importance, then one will be glad to have had this conversation early in the design process rather than discovering that one cannot easily combine datasets.
Both implementers and evaluators who are committed to benefiting the larger social good will also want to influence external stakeholders, particularly policymakers. Discussing who these people are and what questions and information they will find most persuasive needs to happen not just after the data is analyzed, but at the design phase. Otherwise, the implementing organization may be disappointed that the results are not able to speak to certain debates and as a result may be less likely to invest in evidence generation in the future. We discuss more about publicizing results below.
An organization decides to evaluate the performance of a new (or old) policy because its members desire to learn how to improve. We state this point first and foremost because resistance to evaluation often arises from different stakeholders who think of the word "evaluate" and connect it with "grade" or "rank" or other attempts to measure that often distribute rewards or punishments. An organization that primarily evaluates as a form of ranking or grading will quickly run into resistance and all of the problems associated with replacing internal motivations with external carrots and sticks — carrots and sticks are blunt instruments that rarely lead to the best performance from anyone.
If, however, an organization uses evaluations to learn, then the structure of an evaluation should take the desire to learn into account. Some questions that might well orient any such evaluation include:
Even when the purpose is to learn, not grade, some within an organization may still be fearful of the results, as they believe their careers are tied to achieving strong results. If they are not bought into the learning agenda, they may resist cooperation with the evaluators or refuse to use the results of the evaluation, limiting the evaluation's utility. Ensuring that the questions the team closest to the ground wants to learn about are incorporated into the evaluation helps ensure the success both in evaluating the program, and that people will use the results of the evaluation to learn and improve future programs.
In our experience, when a new policy design arises from a creative collaboration between evaluators (who are often academics) and implementers, everyone wins. Often, for example, a better evaluation design can be incorporated into the implementation if discussed early, in a way that creates fewer burdens on the implementer and/or those benefiting from the new policy (e.g., aligning data collection or phasing in of intervention sites). Another benefit arising from such creative co-creation is that different people will bring different perspectives and evidence-bases to the table for the design of both the program and the evaluation. For example, the U.S. government's Office of Evaluation Sciences (OES) has institutionalized this process. They write:
Our collaborators, who are civil servants with years of experience working to deliver programs across the government, are experts on how their programs work and often have the best ideas for how to improve them. OES team members support their efforts by bringing diverse academic and applied expertise to more deeply understand program bottlenecks and offer recommendations drawn from peer-reviewed evidence in the social and behavioral sciences.
The Immigration Policy Lab at Stanford uses this co-creation process as a basis for their collaboration with implementers. They work with organizations not only to research policies and programs but work with the implementer from the start. Through this process, evaluators bring the latest evidence related to a specific policy issue (e.g., refugee resettlement) and co-design the intervention with the implementer utilizing the best evidence with on-the-ground experience. Together, they then evaluate the intervention, adding to the knowledge on immigration policies.
Other evaluations involve an evaluator brought on after the fact — perhaps the funders and/or the implementer desire a fresh perspective on the data and design, or the original implementation was rushed. Either way, "how and on which parts should we collaborate" is still a crucial conversation to have.
Collaboration requires communication. This means that the parties to an evaluation agreement must agree to speedy feedback in mutually agreed upon forms and processes for how to do so. For example, some complex evaluations might create a quick web dashboard so that sample sizes and implementation can be easily seen by the whole team. Other, simpler projects may agree to use existing tools for project management and communication. Not everyone checks email all the time. And many people silence their phones as they try to focus. Having this conversation about communication allows new partners to avoid misunderstanding email or Slack silence.
Calendars and deadlines are also an important part of this conversation. During certain periods, evaluators may be less available (e.g, when grades or a grant application is due); the implementer may need at least preliminary results for an important donor meeting. Knowing these time windows can help understand availability, know when responses may be slow, and allow both sides to plan accordingly.
The communication discussion is also an opportunity to envision the final products: the one-pager, the anticipated three challenges and hoped for three successes for the external-facing report, the number of drafts of the different final projects the team expects to go through, and who is expected to take the lead on which ones. This might be a time where evaluators and implementers share draft reports or report templates that they have liked in the past.
A well-done public evaluation report is a gift to humanity.
Consider who will benefit from a public report of positive, negative, or null results:
Publishing results of evaluations also helps your organization in two important and related ways. 1) It enhances your organization's influence with donors and other policymakers, as you can inform key debates with evidence. 2) It enhances your organization's reputation as trustworthy and willing to provide a public good to improve the field. When one of the authors of this guide was working inside an implementing organization, donors often complimented the organization's commitment to sharing results that contradicted conventional wisdom.
Moreover, publishing all results helps quell other criticisms that can be aimed at organizations that are contributing to knowledge generation. For example, some criticize evidence-based policy by calling it "policy-based evidence." The criticism suggests that evaluations serve only to add a veneer of respectability to the pre-conceived notions of organizational leaders. This idea — that careful research and analyses are merely rhetoric — can diminish trust in individual organizations and in government and science as institutions. Instead, if an organization says, "We publish all of our evaluations and invite others to scrutinize our results and join us in learning how to better serve the public," then it is hard for detractors to claim that the organization is hiding the truth and hard for others to pressure the organization to hide the truth in turn.
The timelines that implementers and evaluators, especially ones who come from academia, have for publishing results are likely different. Your organization may want to use or share the results as soon as possible, so that you can improve your programming or influence policy. However, cleaning and analyzing data takes time. And a review of the results before publishing can improve the quality of the presentation and analysis. Some academic publications prefer that the results not be shared publicly before publication in other formats. Given these benefits and constraints, it's important to talk with the evaluator about how to share the results and when and how.
In conducting an evaluation, numerous things may not go according to plan. There are many operational issues, and since an evaluation can be a multi-month or multi-year process, complications always arise. There may be staffing changes, and their commitment to the evaluation may vary. Or new staff members are interested in different questions. The context may change and the teams have to adapt. The Covid pandemic is a good example of this, as programming and how data was collected had to change practically overnight. Good communication within the team (see above) is essential for navigating these unexpected issues that will arise in some form, and ensuring the integrity of the program and evaluation.
In addition to these operational issues, we may find unexpected results from the evaluation. The evaluation may find that the program or policy had little effect. While null results can be disappointing, in of themselves, they can provide important learning. If the results show null effects, then the organization can re-evaluate and generate more ideas for how to implement the program. Null results can encourage more learning than one might think when they are combined with prior beliefs. Also, null results can arise for reasons that have more to do with sample size or other artifacts of the evaluation (e.g., how the outcomes were operationalized) than with the impact of the intervention. A null result arising from an intervention that, in theory, really should have had a big effect, can be particularly fruitful for science — it sends scientists back to the drawing board and forces re-evaluation of well-established theories. Perhaps those theories had only been tested in the context of university laboratories rather than in the real world. Or perhaps the real world in which the theories had been assessed in the past has now changed. See for example the handout on How to Use Unexpected and Null Results by the OES and our 10 Things Your Null Result Might Mean methods guide.
While nulls or other unintended impacts provide important learning for an organization, they can create uncomfortable conversations with the evaluator and internally within an organization. As a result, thinking through these possibilities ahead of time can be very useful so that people are not surprised. There are a couple of tools that help the implementer and evaluator think through these together.
One approach arising from project management in business involves the creation of pre-mortems in which the implementer and evaluator begin by imagining that the evaluation has returned a null or negative result. What might have led to this result? The teams list the possibilities that they can imagine and thereby become prepared for those and similar problems.
Another related approach from public health uses the name 'Dark Logic' to evoke the idea that one uses imagination and logic together to create negative scenarios, which in turn may (1) help an organization and evaluator avoid worst-case scenarios and (2) prepare an organization to react to such scenarios should they occur.
A third approach that is becoming the norm for experimental research in political science, economics, and social psychology is to use a pre-analysis plan. People can sometimes disregard an evaluation if the results do not confirm their beliefs. A pre-analysis plan helps prevent criticism of the results of the evaluation based on methods or analysis choices. See 10 Things to Know about Pre-Analysis Plans for more on why and how. See also Preregistration as a Tool for Strengthening Federal Evaluation from OES.
A pre-analysis plan also can help stakeholders within an organization, perhaps stakeholders with conflicting prior beliefs, think through what analyses they would find convincing before the study has been fielded/data have been analyzed. For example, an evaluator could generate hypothetical tables and figures and show them to stakeholder meetings asking: "What would your reaction be to a figure like this?" This process helps stakeholders begin to assess the evaluation questions in more detail, ensure that measures are operationalized appropriately, and consider the level of implementation that is needed for the desired results. It also helps stakeholders become aware of the possibility that we may learn something unexpected.
Much of this document is inspired by the Project Process of the OES as well as discussions hosted by the Causal Inference for Social Impact project at CASBS and the Evidence in Governance and Politics network. See the MIT Gov Lab Guide to Difficult Conversations for more guidance about academic-practitioner collaborations as well as the Research4Impact findings about cross-sector collaborations. We anticipate that this document will be open source and revised over time based on your comments and suggestions. Thanks much to Carrie Cihak, Matt Lisiecki, Ruth Ann Moss, Betsy Rajala, Cyrus Samii, Rebecca Thornton,, and folks at the organizations listed above for helpful comments.↩︎
We do recognize that some donors may not allow this (though many are becoming proponents of it), and the competitive nature of fundraising may make sharing data seem risky, especially as a first mover.↩︎