Why are replications important in experimental studies

Significant or not?When studies don't stand up to a second look

A preschooler sits alone in a room with a marshmallow on the table in front of him. The child thinks about it: he could eat the sweet piece of marshmallow. Or it could wait for the experimenter to return with a second marshmallow in hand. The child only receives this if they have not already eaten the first piece by then.

The marshmallow test dates back to the 1960s and 70s. It is not only intended to measure a child's willpower and self-control ability. A long-term study from the 1990s stated that these traits were also the key to success in later life: for example, children who resist the temptation for a quick reward would achieve better educational qualifications.

When US researchers repeated the study in a slightly modified form last year, they made a surprising discovery: the predictive power of the marshmallow test was significantly lower than in the original study. And when you took the children's educational background into account, the link almost completely disappeared. The original results were not reproducible. And this although reproduction, i.e. the repetition of experiments to confirm the results, is an inevitable part of scientific work.

"About half of what is published as knowledge in psychology is apparently incorrect"

Arndt Reuning: The marshmallow test is one of the most famous attempts in psychology. And that's why this example is particularly impressive. However, large-scale replication studies suggest that this could be just the tip of the iceberg. What is the status of the repeatability of scientific investigations? This is the question that we want to address today.

My name is Arndt Reuning. I extend a warm welcome. And psychologist Dr. Felix Schönbrodt, private lecturer at the Faculty of Psychology and Education at the Ludwig Maximilians University. Good day!

Felix Schönbrodt: Good day!

Reuning: Mr. Schönbrodt, you participated in the most recent large-scale replication study in psychology, with a focus on social psychology, in the "Many Labs 2" study. The project examined 28 original works and tried to repeat them. What was the result like: Has it been confirmed that many of the works show different results if they are then repeated?

Schönbrodt: Yes that's right. A number is often given as a result. That's that ominous fifty percent. Exactly half of these 28 studies could be replicated and the other half could not. Now it has to be said that the numbers should not be overinterpreted because these 28 studies were not drawn at random. It could be, so to speak, that these 28 looked a bit fishy from the start. That means that the fifty percent cannot necessarily be generalized to the whole of psychology. Whereby we know from other replication studies that sampled the studies quite by chance: They end up between forty and fifty percent. In other words, this actually confirms the picture that around half of what is published and sold as knowledge in psychology is apparently incorrect.

Reuning: So, just to make the dimensions clear again: Many experts even believe that they have identified a replication crisis in science. Can you confirm that from your point of view? Is it justified to speak of a crisis?

Schönbrodt:I think it depends on your personal judgment of when to use the word "crisis". I have to say: for me personally it was absolutely a crisis.

"I thought: What are we doing all this for?"

Reuning: Yes why?

Schönbrodt: The crisis began to hit psychology around 2011. I had just finished my PhD. And now with the knowledge that about half of what we publish seems to be wrong, that totally changed my image as a scientist - it was at odds with it. So, I really had a personal crisis where I thought, this is not what I want to do as a scientist. I thought to myself: what are we doing all this for?

Reuning: Was that what motivated you to take part in the "Many Labs 2" study?

Schönbrodt: Correct. At some point I thought to myself: Either you try to change the system and make it better. Or you get out of this science system. And I've decided to fight for a better science, so to speak. And taking part in this "Many Labs 2" study was just one of them.

Many authors help replicate their own study

Reuning: Let's look again at their approach. How exactly were the original works replicated or reproduced in this project?

Schönbrodt: That was a very standardized procedure. The attempt was made to contact all the original authors of the studies, provided they were still available, for example still alive, in order to really get the replication as true to the original as possible. In a great many studies you really got the okay from the original authors that they said: Yes, and if you do the study like this, then the same result should come out again. Then all studies have been pre-registered. In other words, before collecting the data, you determined how you would like to conduct the study and how you would like to evaluate it, each of the 28 studies. In other words, there were no longer any degrees of freedom, but it was predetermined. Another difference to the original studies was that the sample sizes were much larger. So on average there were sixty-four times as many subjects in each study as in the original study. That means: If the effect actually exists, then it should also come out in the replication study.

Reuning: We'll turn to the causes of this replication crisis in a moment. But first let's take a look at the Netherlands. In 2016, the third-party funder nwo, roughly comparable to the DFG here with us, provided three million euros for replication studies. And my colleague Anneke Meyer, she had such an investigation shown.

Replication study in the Netherlands
Effect or not effect, that is the question
What has been scientifically proven is true - unfortunately not always. Studies that exactly repeat older studies are intended to clarify the reliability of research results. That sounds easier than it actually is.

Reuning: Anneke Meyer was the one about a study from the Netherlands. Dr. Schönbrodt: If a study shows different results in the repetition than in the original work, does that mean that the data or the evaluation in the original work were wrong? The replication itself could also be wrong.

Schönbrodt: Exactly right. At first you just have two contradicting study results. One cannot yet conclude from this which one is closer to the truth, so to speak. There are various quality criteria that could be used in order to decide: Do I now believe more in the original study or do I believe in the replication study. Depending on the individual case, the pendulum can swing in one direction or the other. In many cases by now I would tend to give more weight to replication. On the one hand from the fact that many replication studies often use larger samples than the original studies. And a second reason would be that the original study is preregistering for replication. This means that in the replication study you have significantly fewer degrees of freedom in the evaluation.

What does significant mean?

Reuning: Yes, in order to evaluate the informative value of a study, let's say in social psychology, there is one parameter, the significance. Colloquially, it means that something can be clearly recognized. So if I say, I'm a foot taller than my wife, that would be considered a significant difference. What about this concept of significance in statistics - does it also mean a clear difference?

Schönbrodt: Yes, that is a very difficult question. Because the significance is mathematically clearly defined. But that is very difficult to translate into colloquial terms. And in fact, we know from research that often very many researchers do not really know exactly what this concept of significance means. Significance tells me whether there is a difference that I find, for example, between two groups, for example an experimental group receiving a drug versus a control group, whether this difference could have come about purely by chance, or whether there is more, so to speak. So, we're trying to separate a signal from the noise. And this significance value gives us an indication of whether this tends to be noise or whether we are dealing with a signal.

Reuning: Or to stick with body size: You would measure a large number of people, include them in an evaluation, and then determine, for example, the size difference between men and women. And the significance would reveal how reliable this mean is then.

Schönbrodt: Precisely. Whenever I collect data, I will always have random fluctuations in it. One group will always be a bit bigger than the other. The question is always whether in a population, i.e. if I were to measure all men and all women now, whether there is a difference and whether it is also reflected in this sample.

Taming chance with statistics

Reuning: In numerical terms, the significance is expressed in the so-called p-value. And if the p-value is less than 0.05, then a study is apparently ready for publication. This value says: The probability is less than five percent that the result is based on chance. Is that correct?

Schönbrodt: No, unfortunately that is not true. Whereby: That is the typical interpretation that is given quite often. Technically speaking, this means that if there was no difference between the two groups, then we would have a five percent chance of getting this data or even more extreme data. I know: that doesn't sound very intuitive. And that's because it's just not intuitive.

Reuning: That said, but it also means that a certain percentage of publications are actually random?

Schönbrodt: Exactly. If we were to use significance testing as it is written in the textbook, we would typically control our error level at five percent. That would mean that we only make one mistake in five percent of all scientific statements.

Reuning: One twentieth of all studies should therefore be viewed critically?

Negative results are undesirable

Schönbrodt: Correct. The problem now is that the scientific journals almost only print significant studies at all. Because only what is new, what has worked, is interesting. This means that we have a certain filter that only those with a significant result are ever made public. And among those, the probability that a study is wrong is much higher than five percent.

Reuning: Yes, why is that?

Schönbrodt: Let's imagine we do a thousand studies. And in all of the thousands of studies there is actually no effect. It's all just noise. Because of our five percent error level, fifty of the thousand studies become significant by chance alone. They then look as if there is an effect, but that is actually just noise. And only exactly these fifty of the thousand are printed in the magazines. If an inexperienced reader answers and looks at this magazine, he will see fifty exciting studies where something is coming out. Of course he thinks: Well, there must be something behind it. But it is actually just incidental findings that made it through the publication filter into the journal.

There is an incredible number of degrees of freedom

Reuning: As we have seen from the replication studies: The rate of publications that cannot be replicated is significantly higher than five percent. So that's a factor that still plays a role. Or do the researchers also have a certain degree of freedom when evaluating their data?

Schönbrodt: There are an incredible number of degrees of freedom, almost too many. And what we have learned in the last few years is the awareness of how easy it is to get the desired result through certain data massages, I'll call it. And the difficulty with this is that it is not a direct fraud that would be noticed immediately, but rather some of the techniques that are accepted in a certain field. Well, as a student I still learned at the time: Felix, when you do a study, you don't just take an outcome measure. So when it comes to well-being, for example, don't just measure well-being, but also measure positive affect, negative affect and perhaps psychosomatic symptoms as well. Because: If your first variable doesn't work, if nothing comes of it, then take a look at the second. And if it doesn't work there, you still have a chance with the third. And so on. And it was actually sold to me as good scientific practice. And the lecturer was certain: It's good practice to do science like this. Today we know, or we have known it for a long time, but today there is awareness that precisely this is one of the best techniques for tricking oneself into false positive results.

Reuning: One speaks of so-called p-hacking, i.e. calculating the p-value until it results in something significant?

Schönbrodt: Exactly right.

Reuning: First of all, many thanks to Felix Schönbrodt from LMU Munich. He is a psychologist and so far we have mainly talked about studies in psychology. But the verification of experimental results through reproduction is also important for other disciplines. Because the replication crisis has already hit the life sciences, for example. Anneke Meyer looked over the shoulder of researchers in Hesse.

Quality assurance in research
Repetition in science as a safety checkTwo out of three study results in psychology are not reliable. It doesn't look any better in other disciplines: research results often cannot be confirmed with repeated measurements. But scientific work does not work without reproduction.

Reuning: That was a contribution from Anneke Meyer. Felix Schönbrodt: Which academic disciplines are particularly hard hit by the replication crisis?

Schönbrodt: I think those most affected are those who look most closely. Many disciplines do not have a replication crisis simply because they are not replicating. So they don't know if they have one. Psychology is apparently very badly affected. But you have to add that I think we are the discipline that has looked the most closely and has gone through self-reflection.

Humanity - a problem for research

Reuning: What does it look like: Does replication mean something different in psychology than in the life sciences, for example?

Schönbrodt: I think one difference is that the subject of our research, namely humans, are simply very complex beings, where a lot of factors play a role. That means when I look at molecules or inanimate matter in physics, I think there is less variability. So it should be easier to replicate things than in psychology.

Reuning: In a study in psychology, it should also play a role how the test group is composed, for example their cultural or regional background. How strongly do the results depend on the context in which the data are collected?

Schönbrodt: That was actually the exciting result of "Many Labs 2" to investigate this context dependency. Maybe in short: why is this so important to us psychologists? If there has been a failed replication in the past, one of the first responses from many upstream authors was: "Well, that didn't work because ..." And there were very, very many reasons: because they carried out the study in a different country was because the study was carried out ten years later, because they translated the material, etc. So a context dependency was postulated - in some cases one had the impression to save the original findings. That said, there was a very lively discussion in psychology as to whether this context dependency exists. And the exciting finding with Many Labs 2 was that in those sixty labs that all did the same study, there was basically little variability in the effects. So the effects that actually existed existed in every laboratory, in every country, in every culture with hardly any variance.The effects that didn't exist weren't there anywhere. So it is apparently not the case that psychological effects are so unstable that they are sometimes there and sometimes they are not there. And that's reassuring for us as a discipline. Because if psychological effects were so shaky that they collapse with the slightest change in context, then we as a discipline would have completely different concerns.

Transparency as a solution

Reuning: Are there approaches that are suitable for all disciplines in order to avoid the non-replicability of results?

Schönbrodt: Yes, a very strong movement that exists in various disciplines is the call for open science, i.e. transparent science. And we believe that transparency and openness, both in terms of the research process and, for example, the raw data and results, are very important building blocks for improving the reproducibility and credibility of research.

Reuning: How would that look in concrete terms? In the first post we heard about the Registered Report, i.e. that you determine beforehand what you want to evaluate, which data you want to collect. Is that a tool in this open science toolbox?

Schönbrodt: Exactly right. That's one of the most important tools actually. Because that takes away all these degrees of freedom with which you can creatively evaluate the data until something comes out. Another tool in the toolbox would be open data. This means that the raw data is also supplied. So far, it has been very unusual to do this in psychology, but also in many other subjects. If I disclose the data, first of all others can look it up, can check what I have done, can perhaps run alternative approaches over it. And it also increases efficiency because I can build on what others have already done. So I can just take their research results and continue from there and don't have to reinvent the wheel from scratch.

"Anyone who wants to make a career needs positive results"

Reuning: What about the publication culture? We already talked about it briefly earlier: The incentive for researchers is to publish as much as possible, to publish only new things and, if possible, to publish only positive results.

Schönbrodt: Right, in my opinion the other structure is perhaps one of the fundamental problems for the replication crisis because, as you have already said, anyone who wants to make a career in science needs a lot of positive, almost perfect results. To make matters worse, almost all positions below the professorship in Germany are temporary at the university. These are often one-, two- or three-year contracts. And now it's not just about: Am I making a little more or a little less career in science. Rather, it is existentially more about: Can I even do my dream job? Or I'll be kicked out after three years because I haven't done the publishing that I needed to continue. And of course that creates enormous incentives to produce the best possible results.

Reuning: Against this background, do you believe that science can even succeed in overcoming this replication crisis?

Behavior Optimistic

Schönbrodt: I am now optimistic. If you had asked me four or five years ago, I might have been more ambivalent. But this movement towards open, credible science has grown very strong now. Very subject-dependent, more in some subjects, less in others. But efforts are also being made to change the incentives. For example, in our department in Munich we have supplemented the professorship announcements with a paragraph that says that our department values ​​replicable, credible research and that we support this. And thus explicitly send a signal that we are not only interested in pure quantity, but in the quality and credibility of research. And from the candidates who apply to us, and all candidates, should include a short paragraph with the application about the extent to which they are already implementing these goals or are planning to implement them in the future. And that actually made the difference for some professorships.

Reuning: Does that mean that you are optimistic that this principle of open science will prevail?

Schönbrodt: Behave optimistically, yes.

Reuning: Thank you very much, it was Felix Schönbrodt from the Ludwig Maximilians University in Munich. There he is the academic managing director at the LMU Open Science Center. And that brings us back to the end of the show. "Significant or not? When studies fail a second look." That was the topic today. And with that, Arndt Reuning says goodbye in the studio.