Replication studies revolve around the issue of sameness and difference. This is most obvious when it comes to the design of a replication. The NWO call for its replication studies funding program specified that it only covered “new data collection with the same research protocol as the original study” (what is usually called direct replication).1 As we noted in our interviews and observations with these replicators, it can be difficult to determine what should count as ‘the same research protocol’. Often the details of the original protocol are sketchy in the published article, and the original researchers are not always able to fill in the details when the replicators inquire how exactly the experiment was set up. Sometimes the original researchers insist on elements of the original protocol that seem theoretically irrelevant to the replicators. Sometimes better methods or instruments are available than were at the disposal of the original researchers, and discussion arises whether using them would still count as ‘the same research protocol’. Only in selected cases did we observe a difference that was made on purpose. In one case, the replicators focused on testing within a different study population. The underlying reason was to test whether the effect holds in another country, and ultimately for national policy making.
But there is also another way in which sameness and difference play a role in replication studies, namely when it comes to the results. When should we count a result as ‘the same’ as the original? What criterion should we apply? When the Open Science Collaboration reported the results of its famous (or infamous, it depends who you ask) replications of 100 original psychological experiments, the authors noted that “there is no single standard for evaluating replication success” (Open Science Collaboration, 2015, p.943). Depending on which statistical measure was chosen, for example effect sizes or p-values, the replication rate differed, although it was always around 40%.
Sometimes the original study had several outcome measures, and the question arises which outcome measure to use as the criterion for replication success. This is the case in one of our studies, with the outcome still pending. Another study we follow attempts to replicate an original that reported three outcome measures. What to do if there is a replication on one measure, but not on the other two? Which measure is the most important? The replication study is still under way, but the replicators worry about this already.
In statistics, variation between the outcome measures of one study or between the results of similar studies is called ‘heterogeneity’. Multi-lab replications of a single original study present some interesting challenges: heterogeneity can be difficult to interpret. In the famous (or infamous, it depends who you ask) multi-lab replication of the pen-in-mouth facial feedback effect, none or just one of the twelve participating labs found a significant effect, depending on the specific statistical test (Wagenmakers et al., 2016).2 Such homogeneity seems clear enough evidence against the original result.3 But one of the multi-lab studies that we followed has results that are much less clear. Here, most of the participated labs reported a successful replication, but some did not. That is, according to the pre-specified criteria, which only focused on the final phase of the experiment. However, the patterns of the participants’ behaviour during the experiment were very similar to the original study, also in those labs where the replication ‘failed’ because the result was not statistically significant. Because of the similarity of behaviour patterns across the studies some of the team members therefore considered the preregistration criteria too strict and argued that it actually was a successful replication in all the labs. The discussion about the heterogeneity of the results eventually became a back and forth about the precise wording of the results in the final paper. The team settled on something like ‘the original findings were replicated in x number of labs, in y number of labs the results were weaker but still followed the general pattern’. When we asked them about the possible cause of the heterogeneity in their data, the researchers said that they could only speculate. As one of them explained: ‘the project is basically concluded when the results of the replication are published.’
In contrast to the psychology studies, for most of the medical replication studies that we follow, heterogeneity between contexts is an explicit point of attention in the design of the study. Medical researchers typically expect or at least do not rule out that the effect of an intervention varies between populations. Depending on what exactly they investigate, they therefore often think it is important to replicate the experiment or observations in different locations. This can either lead to different policies or clinical guidelines between populations or is seen important to assess the robustness of the results for their application across populations. But for the majority of the psychological replication studies, the focus is primarily on the main effect. Although several researchers collected extra data (questionnaires, extra tasks) to allow some exploratory analyses to look for sources of heterogeneity, the main question was whether the original results could be reproduced at all, even when they were multilab studies. So far, none of the replication teams were very clear on what to do with this extra data, and if they would use it at all.
For proponents of direct replication in psychology, the first question to be asked of empirical claims is whether the effect is real. As Olsson-Collentine, Wicherts, & Van Assen (2020, 922) write, “First, belief in the existence of an effect is established. Second, the effect’s generalizability is examined by exploring its boundary conditions.” Only after it has been ascertained that the effect exists does it become useful to explore the precise conditions under which it appears: the moderators or boundary conditions. Olsson-Collentine et al. analysed a sample of preregistered, multilab direct replications to see if there was heterogeneity between the labs. They found only limited evidence of heterogeneity. For most of the effects, the setting of the experiment and the sample of participants made little difference for the effect size. They conclude that their finding “is an argument against so called ‘hidden moderators,’ or unexpected contextual sensitivity”, often put forward as a possible explanation for why an original result cannot be reproduced in a direct replication (Olsson-Colentine et al., 2020, 936). If there is no heterogeneity in the results of a multilab study, so the argument goes, apparently the context does not matter.
Yet others have argued that such multilab direct replications do not necessarily contribute much to the discussion about the context-dependence of psychological phenomena. Although the labs are in different locations, and usually in different countries, multilab replication projects typically employ samples of participants disproportionally drawn from the WEIRD (Western, educated, industrialized, rich, democratic) cultures that dominate all psychological research, not just replications (Henrich, Heine, Norenzayan, 2010). To properly test for heterogeneity of an effect, you have to specifically look for it, preferably guided by a theory about possible sources of heterogeneity, and choose your locations and samples accordingly (Schimmmelpfennig et al., 2023).
Bryan, Tipton, and Yeager (2022) have argued that behavioural science in general suffers from a ‘heterogeneity-naive, main-effect-focussed approach’. There is too much attention to the main effect of an experimental manipulation, and not enough to the variation across different contexts and populations. This lack of sensitivity to context stands in the way of the application of behavioural science to solve real-world problems. Main effects have theoretical importance and replication studies are necessary to detect false positives, but laboratory manipulations can only become useful real-world interventions if it is known precisely in which contexts and for which people they work. Reducing false positives with direct replications is a starting point, not an end point, Bryan, Tipton, and Yeager emphasize.
Like other psychologists referred to in this blog post, Bryan, Tipton, and Yeager focus on heterogeneity between studies rather than within studies. The latter is a form of heterogeneity that is little talked about in psychology, including in the replication studies in our sample. The variation in the individual scores on the dependent variable is commonly, but not always, reported as the standard deviation, but it is seldom seen as an interesting result in its own right. As social psychologist Michael Billig (2013) noted, the individual is largely absent in (social) psychology, which instead focuses on comparing the group means of the control and experimental group. This is not surprising given that since its origins statistics is often used to make this comparison. Nevertheless, this type of doing statistics could today also be seen as a straitjacket because it may distract from exploring phenomena in all their depth. For example, the question why some participants reacted differently (or not at all) to the experimental manipulation is rarely posed in psychological studies. Variation between individuals is, usually tacitly, attributed to chance variation or measurement error. If individual data points are very different from the rest they are often removed as outliers, with two standard deviations a common criterion. Thus, individual variation is either statistical noise or an anomaly. Only one of the replication studies that we follow explicitly focuses on individual effects. We are looking forward to its results.
1 The quote is from NWO’s third funding call. The program also covered “repeated analysis of the data from the original study, with the same research question” (what is often called reproduction), but none of the studies in our sample are of this kind.
2 Either a one-sided default Bayes factor hypothesis test or a replication Bayes factor hypothesis test.
3 The original researcher, Fritz Strack, was quick to point to potentially meaningful differences in effect sizes among the 17 replications (Strack, 2016), which he thought might indicate that the effect depends on particular circumstances.
Bryan, C. J., Tipton, E., & Yeager, D. S. (2021). Behavioural science is unlikely to change the world without a heterogeneity revolution. Nature Human Behaviour, 5(8), Article 8. https://doi.org/10.1038/s41562-021-01143-3
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The Weirdest People in the World? Behavioral and Brain Sciences, 33(2–3), 61–83. https://doi.org/10.1017/S0140525X0999152X
Schimmelpfennig, R., Spicer, R., White, C., Gervais, W. M., Norenzayan, A., Heine, S., Henrich, J., & Muthukrishna, M. (2023). A Problem in Theory and More: Measuring the Moderating Role of Culture in Many Labs 2. PsyArXiv. https://doi.org/10.31234/osf.io/hmnrx
Olsson-Collentine, A., Wicherts, J. M., & van Assen, M. A. L. M. (2020). Heterogeneity in direct replications in psychology and its association with effect size. Psychological Bulletin, 146(10), 922–940. https://doi.org/10.1037/bul0000294
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
Strack, F. (2016). Reflection on the Smiling Registered Replication Report. Perspectives on Psychological Science, 11(6), 929–930. https://doi.org/10.1177/1745691616674460
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Albohn, D. N., Allard, E. S., Benning, S. D., Blouin-Hudon, E.-M., Bulnes, L. C., Caldwell, T. L., Calin-Jageman, R. J., Capaldi, C. A., Carfagno, N. S., Chasten, K. T., Cleeremans, A., Connell, L., DeCicco, J. M., … Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11, 917–928. https://doi.org/10.1177/1745691616674458