Main

Scientific evidence has an important role in policy decisions1. This has been increasingly true of evidence from the behavioural and social sciences2, particularly for public health policy throughout the global COVID-19 pandemic4,5. One broad challenge in this process is that there is no universally endorsed approach to determine which scientific insights should inform policy6. Recommendations may be made on an ad hoc basis, may be based on relationships between certain researchers and policymakers, and may fail to factor in an appropriate level of uncertainty7,8,9. This is further complicated by the sheer volume and heterogeneity of evidence, making appropriate identification and synthesis a substantial challenge10.

One major example of science impacting policy comes from April 2020, when 42 academics from 8 countries and multiple academic disciplines published a review containing a series of hypotheses about factors that were likely to shape collective behaviour during a pandemic3. Topics included threat and risk perception, social norms, science communication, emphasizing the importance of individual and collective interests, leadership, stress and coping. The paper also included a list of broad behavioural insights (which we refer to as ‘claims’) deemed most relevant to the pandemic. The article received unprecedented attention: in only 2 years, it was cited over 3,000 times, and by December 2022 had an Altmetric score in the highest 0.0001% of all articles ever published. This was, in part, because governments around the world formulated pandemic policy strategies explicitly11 on the basis of the behavioural concepts highlighted in the paper12,13,14,15,16,17,18. However, it was also because social and behavioural scientists viewed the pandemic as a critical focus for their attention19 and research20,21.

Naturally, with such levels of visibility, concerns were raised about various claims made by Van Bavel et al.3. This led to some concerns about academics making recommendations despite not all having previous experience in the domain of public policy or public health22,23. In the context of the so-called replication crisis in psychology, it has also been suggested that the article focused too much on evidence from WEIRD (Western, educated, industrialized, rich, democratic) populations, took insufficient account of the heterogeneity of effects, overstated the validity of existing evidence and was opportunistic24,25,26,27.

This paper responds to those concerns by retrospectively evaluating the quality of the claims by Van Bavel et al.3. Although, more broadly, it presents a valuable opportunity to mobilize many independent experts using a structured approach to assess the appropriateness of scientific evidence being considered for application to policy. This is important not only to inform theorizing about behaviour during a pandemic but also to establish the relevance of these claims for future emergencies and more generally across public policy. It is an equally important complement to addressing replicability in science, by scrutinizing evidence produced during a crisis while also revisiting previous claims made by experts.

Concerns related to the readiness or robustness of evidence for application to policy are neither new nor unique to COVID-19. Terms such as ‘evidence-based policy’ have long been used to apply research to major decisions in government, institutions, schools and businesses28. The lack of consensus concerning what counts as sufficient evidence for policy decisions29 is a particular problem in an emergency when policymakers must take urgent actions with sometimes limited evidence. As such, they may seek the lowest-risk or most-effective approach rather than a perfectly informed approach (which is often unavailable or unclear)30, or they may risk delays that create even greater harm31,32. Consequently, the COVID-19 pandemic presented an opportunity to assess evidence in a way that encouraged more evaluations of academic policy recommendations in the future.

With such a substantial amount of evidence now available, our goal was to provide a descriptive synthesis of evidence available relevant to 19 claims from the Van Bavel et al. article (see the first column in Table 1 from this paper for the full statements). In essence, we evaluated the extent to which those statements provided valid policy guidance, privileging empirical evidence of consistent real-world impact. The present work is not a comprehensive evaluation of all behavioural science related to the pandemic, which would be beyond the scope of a single article, but only of the key statements made in the 2020 paper.

Table 1 Nineteen claims about social and behavioural science during the COVID-19 pandemic

We assessed 19 behavioural policy recommendations through evaluating available articles based on the level of evidence they include. Ratings range from purely opinion or theory to large-scale, replicated field studies, as well as the size and direction of effects reported (see ‘Procedure’ in the Methods section). Our evaluations primarily focused on the scale and scope of empirical findings directly related to the claims, although the compiled data (available at https://psyarxiv.com/58udn) also highlights methods, geographical settings and specific behaviours. We then synthesized the evidence within each claim to formulate a summary evaluation. For this exercise, we included both original authors and an independent team of evaluators to select and assess evidence relevant to these claims (see ‘Evaluation teams’ in the Methods section), all of whom were blinded to names and assessments. This allowed us to leverage the expertise of the original authors while also adding a diverse group of scholars who were not involved in the original paper to provide an independent, objective evaluation of the evidence.

Our primary motivations are to (1) transparently evaluate ex post evidence for a set of highly influential claims regarding behaviour during a pandemic, (2) implement a pragmatic, expert-driven method for evaluating and synthesizing evidence that is suitable for informing public policy (both related to COVID-19 and future applications), and (3) make those assessments public in a way that promotes transparency and builds trust with the public33. The first and second aims are broadly relevant across all scientific research. The third aim is specifically relevant to assessing policy recommendations, which is why we decided to provide a descriptive summary rather than focus on methods or causal inference (more highly valued in science). The first aim is also especially critical given substantial concerns about public trust in science in general34 and raised directly in the context of COVID-19 (refs. 35,36,37,38,39).

Evaluating predictions of academic experts is an important exercise to protect against questionable research practices, mistakes and overconfidence38,40,41, which were a common concern specifically in behavioural science during COVID-19. Those concerns fed into cautions raised about systematic reviews of available evidence42,43. Furthermore, by mobilizing a large group of fully independent reviewers, we ensure that no single paper (whether highly powered or merely highly visible) or person can have unchecked influence on the evaluation of policy-relevant evidence.

Evidence for 19 pandemic behaviour claims

As outlined in the Methods (‘Procedure and evidence used for evaluations’), all 747 articles were reviewed by at least two reviewers from each team (four total); 518 articles received at least one rating (see Extended Data Fig. 1). One-hundred and eighty-six articles were unanimously rated as not directly relevant or informative to the claim; 43 were found to be duplicated work, typically papers that had changed titles from preprint to publication. Of the 19 claims, 18 had at least some empirical evidence to assess (see full descriptions in Tables 26), with only one claim lacking any empirical research. Of those 18 claims, 13 claims were assessed as having been studied empirically, although only in surveys or limited laboratory settings (see Extended Data Table 1 for a breakdown of articles).

Table 2 Evidence assessments for four claims on sense of identity
Table 3 Evidence assessments for two claims on trust and leadership
Table 4 Evidence assessments for four claims on messaging and language
Table 5 Evidence assessments for claim 7 (five sub-claims) on messaging and language
Table 6 Evidence assessments for four claims on culture, politics and misinformation

Thirty-four studies report samples related to number of countries, studies, secondary datasets or other indirect observations. Sample sizes of the 463 original data studies included were large. The mean sample size of 418 papers specifically involving human participants was 16,848 (median of 1,699), with individual studies ranging from 52 to 1,429,453 participants. We present somewhat conservative estimates for both mean and median by including only those specifically involving human participants. We do not include in these estimates three studies (with samples from 3.7 million to 654 million) that used social media posts or accounts. Many studies also provided only vague indicators (for example, “more than ten thousand”) or aggregated groupings (for example, numbers of states, provinces or countries). Links in Data Availability give access to raw and interactive datasets for further exploration. However, only three studies reviewed had samples of fewer than 100 participants, whereas 279 had 1,000 or more. One-hundred and forty-two countries were included in one or more studies (see Evidence used for evaluations in Methods).

As depicted in Fig. 1, the direction of effect or correlation suggested by most claims were generally supported. Of the 18 claims that had at least some empirical evidence available for evaluation, 16 (89%) claims were generally supported in the direction of the original statement. Of the 16 claims that were supported by the research literature, ten were considered to show small effects, five were considered to show medium effects and one was considered to show a large effect. We did not find any meaningful effects in support of the remaining two claims, which stated that messages that emphasize benefits to the recipient (claim 7i) and focus on protecting others (claim 7ii) tend to be persuasive. This may be because several studies showed that there were moderators of which emphasis (self versus others) was more effective44.

Fig. 1: Reviewer-assessed effect size for each claim and the qualitative rating.
figure 1

The y axis shows the reviewer-assessed effect size for each claim; the x axis shows the qualitative rating (from theory only to widely tested). Each set of claims is represented by a different icon. Most claims were confirmed as having small-to-medium effect sizes, including those tested and replicated in real-world contexts. The strongest finding is indicated by the globe on the right, near the top, which shows the claim about culture (see Table 1) was widely tested in multiple studies and the results were consistent with the original Van Bavel et al. paper at roughly a medium effect size. A legend for each icon to represent the 19 claims is presented below the graph.

Six claims (2, 4, 9, 12, 13 and 14) backed by empirical evidence demonstrated medium or large effects. Claim 4 (that norms of prosocial behaviour are more effective when coupled with the expectation of social approval and modelled by in-group members who are central in social networks) was mostly tested on observational data, but the effect sizes found in these data were notably strong.

Importantly, no effects were in the opposite direction from the original predictions. This means that no recommendations from the Van Bavel et al. paper led to a consistent backfire effect. Those 19 statements proposed behavioural domains that were likely to be of interest during the COVID-19 pandemic. Some claims were general about potentially relevant behaviours, whereas others were more prescriptive about potentially more effective intervention approaches.

Overall, our review indicates that the Van Bavel et al. article generally identified highly relevant topics of study in the pandemic and, to an extent, the direction of associated findings. In particular, it identified (1) relevant behaviours during the pandemic (both positive and negative), (2) likely barriers to mitigating the spread of the disease, and (3) major social challenges that would be faced by policymakers. The following text summarizes these in general groups, citing articles viewed by assessing teams as being most consequential for their final assessments. The ratings for each are specified in Tables 26.

All behaviours studied, whether specific to a claim or not, are listed in Table 7. We have included this table as a reference in the future for considering behaviours to expect or target.

Table 7 Behaviours directly studied in the initial literature review

Sense of identity

Four claims made in 2020 focused on how social identities would be highly relevant during the pandemic, particularly how they aligned with either community benefits or social norms. These expectations generally appeared to be accurate (see Table 2), with scores of studies concluding that connectedness with communities or aligning with morals were a predictor of behaviours and efforts to control the spread of illness45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60. However, one challenge that is typically present for research on subjective and latent constructs such as identity, prosociality and connectedness is that most research was conducted through surveys. Few studies attempted to isolate the causal effect of identity on pandemic behaviours, and no experimental studies manipulated identity or the sense of collective purpose in a real-world setting. In some cases, well-powered studies directly assessing the claims were limited to asking about intentions to receive vaccines60. Although such findings are very valuable, there is clearly additional benefit in validating those findings in consequential settings, or even carrying out retrospective studies to determine whether behaviours or infections were measurably associated with connectedness where studies were conducted.

Trust and leadership

There was a large amount of evidence from peer-reviewed research on the role of leadership during the pandemic. Two claims from 2020 specifically outlined expectations for how trusted sources and leadership may be relevant to promoting public health guidelines. There was a substantial amount of research (see Table 3) supporting both of these expectations, although the best-quality evidence from consequential settings was replicated only in relation to the claim that the most effective messaging comes from trusted sources61,62,63,64,65,66. Consistent with original expectations, evidence has supported the value of highlighting the cooperation of other people to promote health behaviours, although evidence was limited to surveys and correlational studies.

Messaging and language

Perhaps the most widely studied topic during the pandemic was public health messaging. This was clearly anticipated by Van Bavel et al. as 9 of the 19 claims explicitly discussed the role of messaging and language in developing effective public health interventions. Not surprisingly, this also produced the most heterogeneous set of evidence ratings (see Tables 4 and 5). Observational studies in natural settings concluded that messages directly emphasized benefits to individuals or protecting others had no measurable effect on behaviours67. There was some evidence for the effect of benefit-based approaches on behavioural intentions, although some studies suggested that self-benefit versus other-benefit messages were differentially effective for different types of people. Although benefit-based messaging has been found to be effective in general68, it is possible that there may be limits to its effect on behaviour in the context of novel health threats where the benefits of preventative behaviours are not well established and recommendations are evolving.

Messaging related to partisan concerns was also widely studied, although not often in consequential settings. For the studies with the most policy-relevant evidence, messages emphasizing consensus and general agreement about public health behaviours were more effective in promoting these behaviours than those considered to be polarizing or partisan in nature (in survey studies)69. A small number of related studies in the context of marginalized communities has found that direct engagement and direct messaging were more effective70.

Claim 10 on the use of physical distancing being preferred to social distancing yielded no evidence. Although eight articles were identified and required some time to review, none included direct evidence of any effect on making this change, and we have excluded it from Table 4.

Social cohesion and misinformation

Claims specifically related to polarization and flawed sources of information were widely validated, with some caveats71. Across more than 200 published articles, polarizing and disingenuous messaging were consistently associated (see Table 6) with negative outcomes in terms of the effectiveness of public health interventions72. However, direct causal evidence was relatively scarce73. Studies with the highest levels of evidence have validated these patterns in consequential settings, often with medium effect sizes, indicating that greater division in messaging and lack of social cohesion were associated with lower effectiveness of public health messaging48,73,74,75,76,77,78,79,80,81,82,83,84,85. Encouragingly, inoculating against manipulation techniques86 and prompting users to consider accuracy before sharing news87 have some positive effects. Again, both lines of study would benefit from replications in consequential settings, with some additional validating work on this emerging after our review started88.

Major themes not explicitly assessed

Several themes emerged during the search phase of this project that were discussed by Van Bavel et al. but not necessarily formalized in terms of a specific claim. These included the clear relevance of threat and risk perception, the role of inequality89 and racism90, skepticism towards science91, incentivizing behaviours beyond simply describing benefits (for example, by providing financial rewards for vaccination)92,93,94 and the absence of clear leadership95,96,97,98,99,100.

Threat perception

Although ignoring threats and risk were concerns raised in the 2020 article, the statement that we assessed as a claim (11) did not yield a substantial amount of evidence to review. However, this was not because there was an absence of evidence relating to the general issue of threat perception. In fact, substantial research indicated that threat perception — and wilful decisions to ignore risks to self and others — were a major factor during the pandemic101,102,103. However, we chose not to create an additional generic claim to assess evidence related to this, not only to maintain consistency in the method but also because much of the research on this topic has been heavily associated with the polarization, messaging and misinformation themes. Still, there is clear and compelling evidence that deliberate decisions to ignore health information had negative impacts during the pandemic102,103.

Nudging

Nudging was a widely attempted method for behavioural interventions during the pandemic. Huge increases in attempted nudges have arisen since 2019, largely due to the highly behavioural nature of pandemic policies. Although not explicitly framed as a claim in the 2020 article, nudging was highlighted as a practice likely to have a substantial bearing on pandemic-related behaviour.

Overall, interventions presented as nudges had mixed effectiveness during the pandemic. Encouraging evidence found that simplifying choice architecture and making options salient (for example, through personalized text messages), as well as making it easy to become vaccinated led to reductions in vaccine hesitancy104,105. The same has been found for improving availability of locations to receive a vaccine101. Accuracy prompts have also shown some promise as nudges that aimed to limit sharing of misinformation87,106, although replications have found generally small effect sizes for these nudges87,107. However, attempts at making use of lotteries to increase vaccination rates had no overall effect92, along with studies reviewed in claim 7 on messaging, which also had little impact67.

Because of the extreme number of trials, there is no single summary assessment that would appropriately cover the highs and lows, or complexity, of nudging during the pandemic. Several systematic reviews108,109 have explored the overall effectiveness of nudges, and more narrow systematic reviews have considered the effectiveness of nudges that target specific behaviours, such as vaccination105. In light of this mixed picture, we strongly encourage focused systematic reviews of all nudging carried out in the context of COVID-19 and urge nuance in determining which nudges work and which do not (treating them as equivalent does not seem to be supported by the data).

Stress and coping

Unfortunately, the fear that isolation and lack of social connectedness would lead to a pandemic of mental illness largely played out110,111,112. Although much of daily life around the world adapted to major changes, the risk of prolonged and severe effects on mental health were widely identified, with large increases in depression, anxiety, stress and other common mental disorders reported globally113. In some cases, these effects were moderated (or at least attenuated) by being isolated along with close others114, whereas other studies have found dramatic increases in intimate partner violence and violence against women115,116. Some positive mental health outcomes had direct links to collective mindset and perspective117, although not able to circumvent all aspects, consistent with the 2020 article. Those patterns indicated a need to take more multidimensional approaches to well-being and mental health to find opportunities not only to treat or prevent illness but also to promote positive outcomes118.

Major pandemic behavioural themes

Although matters such as polarization and vaccine hesitancy were discussed in the 2020 article and turned out to be clearly relevant, other themes not specified originally have been widely studied. For example, political divisions were not the only reasons individuals refused or delayed vaccination119,120 as there has also been evidence of general wilful refusal to follow public health guidelines (whether masking, social distancing, isolating when sick, avoiding unnecessary travel, vaccine hesitancy, and so on) in some individuals121. In this regard, explicit, manifest behaviours based on demographics and individual differences122 should also be reviewed as they have not been covered here. Those patterns are not inconsistent with perceptions, beliefs and social division, but we have not explicitly assessed that evidence. Even though some evidence pointed to the benefits of communicating good and effective policies directly to the public123, more needs to be done to explore how to achieve this when there is active, deliberate intent to criticize and disrupt those policies, without inadvertently giving greater visibility to those disruptive forces.

Other major themes not covered in Van Bavel et al. include more specific predictions about what outcomes may be associated with behaviours or policy interventions. For example, although there has been some mention of isolation impacting mental health, volumes of research looked at how school closures124 and curfews125 might influence children by limiting opportunities for interaction, playing and development, weighed against their likely effect on mitigating the spread of illness. Similarly, beyond social media, ways to address isolation might have involved better ways to engage communities in volunteering126 or other civic contributions for those that desired a more active role during periods of extended isolation.

Another theme not discussed was how traditional forms of mass media might have undermined the potentially helpful role of descriptive norms by giving disproportionate attention to anti-vaccination, conspiracy and other beliefs that did not reflect expert or even majority opinion in the general public53,127. Many countries, particularly those covered in the original article and where evidence was available for this paper, had vaccination rates above 70% (and sometimes above 90%). In these settings, messaging that focuses on the problem of vaccine refusal could mean giving a minority behaviour the same amount of attention as facts and evidence about widespread uptake and the benefits of vaccination65. In this regard, efforts of academics and public health officials may be thwarted if media policies around ‘equal coverage’ are implemented in ways that amplify false norms and harmful, fringe ideas, given how easily it is to manipulate or control narratives that are not rooted in evidence.

General discussion

We approached this research with an appreciation that throughout the pandemic, especially early on, decisions had to be made on the basis of imperfect evidence (see ‘Building policy from imperfect evidence’ in the Supplementary information for further discussion). Our assessment focused on the quality, generalizability and policy relevance of available data rather than the average effect size of research related to the claims, using an expert-driven assessment of claims rather than formal statistical meta-analysis.

Our two teams of 72 total reviewers assessed 747 scientific articles covering 19 highly influential claims made in 2020 about human behaviour in the pandemic. Of the 747 articles, 463 studies included original empirical evidence and 418 had human participants (mean sample size of 16,848). Two independent teams evaluated the studies available for each claim (see ‘Author contributions’ for specific lists). Both teams found evidence in support of 16 of the 19 claims (84%), with no evidence available for one claim. For two claims, teams found only null effects. Overall, our review found that the Van Bavel et al. article generally anticipated meaningful topics that became relevant for research during the pandemic, and in the majority of cases, the direction of their associated research findings. In particular, it identified relevant behaviours during the pandemic (both positive and negative), likely barriers to mitigating the spread of the disease and major social challenges that would be faced by policymakers.

Aside from cultural effects on policy effectiveness, the most strongly supported intervention claim was how combatting misinformation and polarization would be vital to promote effective public health guidelines. Effective messaging, particularly in the form of engaging trusted leaders and emphasizing positive social norms, was also heavily supported in the literature. Broadly speaking, survey data strongly supported how critical it is for policy to understand collective behaviour, shared values and effects on marginalized populations to be effective at minimizing harms during a pandemic.

We strongly endorse a full systematic review of all behaviours studied during the pandemic, which would map not only what behaviours were observed (or not) and across how many studies but also where and through what methods. However, we also share the concern that many studies carried out during the pandemic, particularly in the early stages, had low power42,43. That has produced problematic meta-analyses, which was one reason why we chose an alternative approach.

Applying lessons for science and policy

To make use of additional evidence that stemmed from this very broad reading of available literature, but which did not apply only to a single claim, we have consolidated critical recommendations for science and policy in Table 8. The purpose is to help researchers and policymakers respond to future pandemics, disasters or other exogenous shocks. These range from largely scientific practices, such as more study of global populations, and being more specific in formulating testable questions, to primarily policy topics related to communications and managing public uncertainty. Some of these recommendations may appear obvious or generic, but were not universally adopted in the studies that we reviewed.

Table 8 Recommendations for researchers and policymakers

In light of the challenges discussed, we offer two additional recommendations. These are specifically relevant to future public health emergencies but also apply broadly towards advancing the readiness of scientific evidence being applied to policy.

Follow surveys with field research

Behavioural scientists, including those studying the pandemic, often use online data collection tools where there is substantial control over the treatment and, more generally, the research environment. We do not discount the privilege or benefits of having access to these tools or their contribution to foundational evidence that could be deployed rapidly. Online survey experiments could also offer insight on causality that, for example, descriptive or correlational field studies lack. Moreover, it is particularly encouraging that so much of that evidence converged, and this further clarified the added value of having access to those resources.

Despite those benefits, we recommend seeking opportunities to progress faster from concept testing to real-world testing and implementation in future crises. In the earliest days of the pandemic, there was no clear way to do this, and large numbers of studies were derailed, postponed, abandoned or forced to be modified substantially. This also presented challenges for validating evidence generated early in the pandemic (both at that time and during this review). In addition, new research designed to speak to the crisis had to be conducted using the tools available.

It is commendable that so much valuable and applicable social science emerged despite the chaotic circumstances, loss of resources and the uncertainty in the early days and months of the pandemic. Nevertheless, of the 518 studies given a ratings assessment as part of the present exercise, more than 400 were empirical studies conducted in laboratories or online settings (which may have also been skewed by both the nature of the claims and disciplines of individuals identifying relevant studies). Unfortunately, this also meant that many claims were only ever tested in surveys. Rather than this being a criticism, we state this as a strong encouragement to seek, promote and fund partnerships that can function in consequential settings, even (and perhaps especially) in the face of public emergency.

Forge alliances

There are myriad challenges to linking scientific research to real-world practice, which were amplified during the pandemic. Although academics and institutions found ways to overcome these, practical constraints were evident throughout, whether based on resource and personnel limitations, or threats to health and safety. Other limitations may have included simply not knowing the appropriate communication channels to link researchers and policymakers.

In the future, we encourage academics that have not previously worked in such consequential environments to proactively engage with organizations delivering public services to find out where and what input they would value128. This will help to develop partnerships with local government offices129, hospitals130, banks131, schools132, local military units providing emergency personnel133 or other potential end-users of scientific evidence. Surveys are not a replacement for studying real behaviours in the field, such as blood donations, vaccines, assigning volunteers, facilitating remote work, keeping people safe while shopping or voting. Furthermore, researchers should recognize the potential for policy impacts at many levels: do not overly focus on highly visible policymakers or government employees who are difficult to access and may lack the subject-specific expertise to recognize all relevant research134.

For policymakers, managers, teachers and other institutional leaders, we also strongly recommend opening lines of communications with academics that research your professional area. Professionals seeking to apply insights generated can experience frustration at the reluctance of researchers to offer practical advice for evidence-based policies. Academics are typically not trained in ways to bridge the science–practice–policy gaps, and there are opportunities for impact in the future that may be in place if institutions also take the initiative to engage experts before emergencies. This is also a message for funding bodies: invest in initiatives that support cross-domain collaboration and translational research activity.

Both parties may feel reluctant and uncomfortable about such a collaborative effort, making them hesitant to consider working together as a viable option. Adopting a broader set of tools and research contexts would help both sides to see how they can collaborate more productively by expanding and applying their knowledge of important psychological phenomena and behavioural mechanisms in practice.

Limitations

Tables 26 include multiple aspects of the evidence review, including the ratings, direction, effect sizes and a summary note. These were each included because no single rating or value can fully reflect the many dimensions of each behavioural domain. We do not intend or claim to offer a perfect ‘score’ of evidence or research during the pandemic. Instead, our goal was to provide a general assessment of evidence to guide future research and policy applications. Although some policymakers were asked to participate in this study, our focus was largely on synthesizing academic expertise to inform policy. We therefore note that, at the highest level, evaluations may prioritize scientific perspectives over insights most relevant to decision-makers and practitioners. Multiple, detailed discussion on the limitations of the approach, findings and interpretations of the work is provided in Supplementary Methods.

Conclusion

Despite the absence of a consensus approach to selecting and synthesizing evidence, scientific research has a valuable role in public policy. The present study evaluated researcher claims and recommendations at the start of a global crisis. The synthesis of evidence suggests that researchers can be a viable source of policy advice in the context of a crisis and our recommendations highlight how this can still be improved across scientific disciplines.

Our synthesis and evaluation also speak to the value of revisiting the claims (predictive, indicative or otherwise) that scientists make about events that are of substantial relevance to policy once there is sufficient evidence to assess their validity12,24,135,136. This has the capacity not only to contribute to the sort of transparency that builds trust in science and public health but also to directly inform the development of relevant knowledge and tools for the next pandemic or other crisis.

Our final recommendation is vital and is directed to all scientists, especially behavioural and social, as well as to policy institutions: do not wait until the next crisis to form partnerships. It is always a good time to build relationships between organizations, clinics, schools, governments, media or any institution with which there may be mutual benefits towards building effective policies. This enables us to develop a robust and relevant evidence base and to ensure that we marshal our collective energies and resources so that we are able to use science to best effect in the service of serving, protecting, promoting and prioritizing the well-being of populations.

Methods

Our approach assessed evidence related to the central statements or hypotheses (claims) in the original Van Bavel et al.3 article. For the purposes of evaluation, we treated these claims as testable hypotheses, then rated the level, direction and magnitude of findings relevant to each claim. Evaluation was conducted by seventy-two evaluators, including authors of the original study as well as an independent group of behavioural scientists and policymakers. Their assessments focused on whether evidence from the first 2 years of the pandemic supported, refuted or left unclear the validity of the claims (see the online database available at https://psyarxiv.com/58udn for details).

Claims evaluated

We evaluated the ten claims highlighted in Table 8 of the original article, as well as five additional claims made in the main text. Those additional claims related to behaviours, themes or policies that ended up being especially relevant during the pandemic, such as vaccination choices and the influence of political polarization, but which did not clearly overlap with one of the ten primary claims. All other claim-like statements in the text were either already covered in the original ten or were not precise enough to assess. One of the original ten claims actually comprised five distinct claims, creating a total of 19 claims.

Some claims were more general and not well suited to be treated as hypotheses, but rather as recommendations. For example, claim 6 (see Table 1) first suggests that targeted messaging and then that partnerships with community organizations are valuable. The first part of that statement is more suited to treat as a hypothesis, whereas the second part states only that community partnerships may be worth looking into, but with no implicit impact. In this case, we disregarded the second part and focused on evidence that could inform the first part.

Evidence used for evaluations

We identified articles and reports used for the assessment through extensive systematic and manual searches by all evaluators, with the primary criterion being that they were publicly available before 1 June 2022. Searches included using the pre-formatted systematic review code produced by PubMED-NCBI for research specifically on COVID-19 (available at https://pubmed.ncbi.nlm.nih.gov/help/#covid19-article-filters), as well as checking preprint servers (OSF, PsyArXiv and SSRN), multiple repository search engines (Google Scholar, PsycInfo and EconLit), crowdsourcing with forms to share articles (on social media and through targeted email lists) and snowballing of relevant articles (including articles that cited the original paper). There was no restriction for locations or language (see later for a discussion of the ways in which diversity of authorship enabled broader searching). This approach yielded approximately 3,000 articles initially. In a triage phase, team members checked for relevance of articles, removing duplicates and articles that did not meet the criteria (such as articles not relevant to the pandemic, being published outside the inclusion window or simply not relating to any of the 19 claims). Most removals were due to articles that were clearly not relevant but had been submitted for general relevance to behavioural science during the pandemic. Those articles were typically easy to identify; any ambiguous articles were left in to allow the full reviewers to determine relevance later. After that process, 747 articles were used in the initial assessment.

Across the 747 papers, 142 countries were represented (meaning at least one article existed where a country was studied directly). Total volumes by country ranged from 291 papers involving the USA to 1 (24 countries). Other countries with large numbers of papers included the UK (109), Germany (78), Italy (60), China (43) and Brazil (35). Full lists of the total geographical coverage of studies are included in the Supplementary information.

To ensure that we did not miss any major studies, all reviewers were asked to search for any potential additional articles after their assessments were submitted to the lead author (K.R.). Those articles had to meet the same publication deadline and were only included if they substantively influenced the overall assessment. One such article was identified69, whereas one set of interrelated studies was updated to include both original papers, letters to editors and responses to letters45,67,104. The latter three articles were added to the final list for posterity, but did not impact assessments. We therefore only refer to 747 articles unless stating otherwise.

Our aim was to identify the highest level of evidence available for (or null, or against) each claim, with higher ratings going to evidence from field research (that is, studies conducted outside the laboratory or survey), particularly highly powered evidence from multiple field studies and settings. This approach addressed two dimensions of evidence quality assessment akin to ecological validity. We use the word ‘support’ here when discussing evidence in which the original claim appears to be valid. However, major findings could also simply inform understanding, such as a highly powered but null finding. In this case, support indicates those findings that correlate positively with the intended meaning of the original claim.

We chose not to conduct a systematic review for several reasons. First, we were primarily interested in compiling evidence related to claims and reporting those. Conducting a systematic review would have required refining each of the claims to be more specific than originally written, which risks excluding a large amount of potentially relevant evidence137. There were also concerns that early-stage COVID-19 research was not suitable for meta-analyses due to being rushed, small samples (that is, underpowered), weak correlations or lacking appropriate methods, such as randomized controlled trials42,43. Therefore, we took a structured but more pragmatic and inclusive approach to selecting studies that addressed claims. This allowed us to cover a broader range of evidence for both insights and limitations of the work conducted during the pandemic.

Similarly, we asked reviewers in both teams to decide what evidence should or should not be included as part of the process. In a policy context, this means there may be disagreement over what evidence informs or does not inform a particular issue. Those disagreements cannot be resolved through selection criteria and extracting data alone (that is, disagreements would still occur in setting the selection criteria). We also wanted to minimize the possibility that a singular criterion confound (such as only permitting messaging on social media but excluding those in emails or letters) might overly bias expert assessments in the same direction. Because of this approach, we do not provide a PRISMA diagram, as each reviewer had different articles that they felt were or were not valid indications of evidence. Further details and limitations about this are provided in the supplement.

Finally, we included preprints in the study, allowing reviewers to determine the quality and robustness of material alongside material that was published after peer review. This decision was made because, for better or for worse138, preprints were extremely visible during the pandemic and often treated (at least by the public) as equivalent to published articles139. In addition, by including preprints, we reduced some concerns of publication bias by ensuring articles that may have not been published due to null findings could be considered.

Evaluation teams

Two teams conducted the evidence review: 33 authors from the original paper3 were in one team and 39 independent reviewers made up the second team. The use of two independent teams was meant to minimize bias and increase the diversity of perspectives.

The lead author on this article was chosen based on experience in reviewing and reporting behavioural science in public policy contexts140,141, particularly in public health142,143,144,145, for having coordinated large-team research143,144,145 and for having led the development of an evidence standard146 for evaluating research for policy. To ensure full independence, the lead author — who was not involved in the original paper — only contributed to assessments by compiling all reviewer ratings, then reviewing articles identified as being ‘best evidence’. If any differences existed, the lead author then followed up to reconcile. In most instances, this involved discussion on what constituted ‘real world’, and the PI proposed a standard for discussion. There were no substantial disagreements with final ratings presented. This approach was intended to promote trust and integrity through transparency in assessment of previous work35 by involving multiple reviewers and many papers in the evaluation.

Reviewers represented institutions from more than 30 countries. There was some representation from every continent, although reviewers were predominantly based in Europe and North America. Searches spanned more languages than English-only articles, particularly white papers and other institutional reports on interventions during the pandemic that might inform the review. The claims that reviewers were assigned to assess were entirely confidential to reduce bias or influence, as were the names of all experts participating in the review. Because all reviewers were trained in the methods and coding system, names were visible in small groups. However, only the lead author knew who had been assigned to assess particular claims.

Procedure

All reviewers followed a standardized approach to assessing each of the claims, which is included in the Supplementary information along with a brief tutorial. In each instance, reviewers received a set of articles assigned by the lead author. As the volume of papers varied substantially across claims (fewer than ten were found for claims 8 and 10; more than 100 were identified for claim 7), reviewers had different numbers of articles to review. For the larger volume claims, some procedural adjustments were made in which reviewers only assessed a subset of articles (a plan was established to address any issues created in the event someone then missed highly relevant material, but this only occurred once and was easily resolved). Each reviewer was required to read the articles in the list, noting four primary aspects:

  1. (1)

    Was the article relevant to the claim?

  2. (2)

    What was the level of evidence?

    • No evidence, only opinions, perspectives, general theory or anecdotes

    • Some empirical evidence but in limited settings (laboratories, surveys and online)

    • A field study in a real-world setting

    • Replicated evidence in field studies or other natural settings

    • Wide-scale evidence from multiple field studies, policy evaluations or other natural settings

  3. (3)

    If there was any empirical evidence, was it in support (positive), null or against (negative) the claim?

  4. (4)

    If there was any empirical evidence, what was the general effect size (null–small–medium–large)?

After reviewing all articles, reviewers were asked to produce a summary of evidence for the claim, covering the same four themes. To focus the summary claim evaluation on the highest levels of evidence available, reviewers were asked to rate specifically based on the highest-quality studies reviewed. In other words, rather than averaging all available evidence (as in a meta-analysis), reviewers focused their summary assessments of the highest levels of evidence. For this work, ‘highest’ was defined by the 1–5 scale, supported in terms of statistical power, the appropriateness of method and scope (broadness of settings). This procedure is more relevant in a policy context, in which it is preferable to evaluate the highest-quality evidence available than to estimate average effects from all available evidence (all individual article ratings are available). Causality was not directly factored into the rating given that it is not typically weighted more heavily in policy, and because it might have sidelined rare-but-valuable field studies during the pandemic and only taken stock of well-controlled online experiments.

All reviewers were actively encouraged to assign their own ratings for articles and claims, and there was no attempt to force unanimity across raters. Despite this, average inter-rater agreement (percent of times within each article that reviewer scores agreed) was high (77.5%). This value is generally good, but especially so considering that some claims had so many relevant papers that it was not possible for the same reviewers to assess them all. In addition, several papers were rated by some assigned reviewers but not others when the reviewers disagreed regarding the relevance of the paper to the claim (those judging a paper to be irrelevant were not asked to evaluate it).

We emphasized effect sizes and direction of associations rather than statistical significance to address major concerns regarding validity and replicability. Relying only on P values would have increased the likelihood that publication bias and misrepresentation could have overstated evidence in the papers that we reviewed. Following well-known concerns147,148, our guidelines did not involve P values; only effect sizes were discussed given that they are more predictive of replicability149. This approach was largely supported by the consistently large sample sizes of studies included (see Tables 26).

Furthermore, including preprints in the assessment reduced the potential role of publication bias. Similarly, we included the direction of effects (positive or negative) to both clarify whether the finding generally agreed with the claim and to account for possible backfire effects, and the harmful side effects that might entail (for example, if a study showed a messaging intervention did have the expected effect, but also created more severe harms, it was possible for the assessment to reflect this).

Evaluations were submitted to the lead author; no reviewer was allowed to see other reviews. The lead author anonymized evaluations to a central coordinating team that checked evaluations for mistakes or inconsistencies (for example, ratings that did not align with the article noted as highest evidence or with the summary statement). All material from these processes have been compiled and posted in an interactive format for public use (https://tabsoft.co/3xZwIbD). Note that all values related to sample sizes have been provided as a general indication within and across claims, reflecting a simplified summary of studies reviewed (where samples were clearly reported). As discussed in ‘Evidence for 19 pandemic behaviour claims’, there can be debate as to what should be considered a sample size as well as challenges in extracting values. Since these were not the focus of the research, we provide them for general context and have posted the compiled spreadsheet for those that may wish to revisit or recalculate.

A detailed discussion of all procedures and their limitations is provided in the Supplementary information.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.