Do women academics need to work 2.4 times harder to succeed?

From the LSE Impact Blog

Reading Time: 5 minutes

13/03/2026

Nearly thirty years ago a paper was published claiming women academics need to work 2.4 times harder than their male counterparts to attain the same status. Re-examining this paper, Ulf Sandström questions its validity and suggests what higher education policymakers can learn from this overreliance on a single study.

This article is shared from the LSE Impact blog the article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science or Dementia Researcher. Shared under the Creative Commons Attribution 3.0 Unported (CC BY 3.0) the orginal publication can be found at https://blogs.lse.ac.uk/impactofsocialsciences/2025/08/06/natures-decision-to-publish-positive-peer-review-reports-only-gives-half-the-picture/

In 1997, a short paper in Nature by Christine Wennerås and Agnes Wold produced a number that has echoed through academic life ever since: 2.4. The authors argued that women applying for postdoctoral fellowships from the Swedish Medical Research Council needed to be “2.4 times more productive” than men to receive equivalent competence scores.

The number spread rapidly because it did three things at once. It compressed complex peer-review procedures into a single ratio. It captured a lived experience many women recognised. And it appeared in one of the world’s most prestigious journals. Over time, “2.4×” became shorthand for systemic unfairness in science evaluation.

But what happens when a single influential estimate becomes part of the policy canon?

From finding to policy anchor

Higher education is full of “policy anchors”: emblematic studies that become reference points for reform. The Wennerås & Wold paper is a clear example. It appears in training materials, institutional reports, and policy documents across Europe and North America. It is cited not only as a contribution to academic debate but as quantitative evidence in debates about fairness, merit, and equality.

Higher education is full of “policy anchors”: emblematic studies that become reference points for reform

This influence can be constructive. It helped push funders and universities to examine peer-review procedures more critically. But it can also make the underlying evidence oddly fragile: once a finding becomes symbolic, it may be repeated more often than it is examined. In this way a single number can come to stand for a much wider set of arguments and findings.

What we did: reproduce, then reanalyse

In a recent paper we conducted the first full reproduction and reanalysis of Wennerås & Wold (1997) using archived records from the Swedish Medical Research Council.

Reproduction is a basic scientific practice: can the calculations be reconstructed? We found that the original computations were reproducible. But, the more important question is validity: does the model support the interpretation that has been attached to it?

Why structure matters: heterogeneity in evaluation systems

The Swedish evaluations involved different committees and disciplinary areas. Those committees did not operate in a uniform publishing environment.

This matters because the signals that peer review relies on are field-dependent. Publication volume, journal prestige, authorship patterns, and national versus international outlets vary widely across disciplines. In the 1990s, for example, preclinical biomedical fields tended to publish frequently in international journals. Some clinical fields had different rhythms and outlets. Behavioural and social medicine fields often published in national journals and report series.

Committees are aware of those differences and typically evaluate candidates within field norms. Problems arise when an analysis aggregates across committees without modelling those differences, effectively treating “productivity” and “competence scores” as uniform measures implying that ISI-indexed journals are the only legitimate channel for communication of research.

What changes when heterogeneity is modelled?

When we explicitly model committee-level and disciplinary heterogeneity, the famous “2.4×” gender effect largely disappears. The size and significance of the coefficient depend strongly on whether structural variation across committees and fields is accounted for.

This does not imply that gender bias does not exist in evaluation systems. It does imply that the headline estimate—treated as canonical for decades—is not stable under models that better reflect the structure of the evaluation context.

Why this matters for universities and funders

There are three policy-relevant lessons here.

1) Distinguish disparity from bias.

A difference in outcomes may reflect bias, but it may also reflect structural and disciplinary composition, career-stage differences, network effects, or evaluation design. Treating disparity as synonymous with discrimination can produce blunt reforms that fail to target the actual mechanisms producing inequality.

2) Avoid building reform on single-study canon.

Landmark studies often deserve their landmark status. But policy should not depend on a single number repeated across decades. The best reforms are built on converging evidence from multiple settings, transparent methods, and ongoing evaluation.

3) Treat reproducibility as an institutional responsibility.

Reproducibility is not only about scientific virtue; it is about trust in governance. When university leaders cite research to justify reforms, they inherit part of the responsibility for whether that evidence is robust. That doesn’t mean leaders must become statisticians. It means institutions should support cultures and infrastructures where influential claims can be revisited.

Why this is not a step back from equality

It is important to be clear: revisiting one iconic estimate is not an argument against gender equality work. Evidence-based equality policy becomes stronger when it rests on rigorous, transparent claims that survive scrutiny.

When a number becomes a narrative, verification matters more—not less.

In fact, the most durable reforms usually emerge when we move beyond binary debate (“biased or fair”) and instead ask: what features of evaluation systems produce unequal outcomes, and how can those features be improved?

That question invites better research and better policy.

The Wennerås & Wold paper mattered because it made a claim that many people needed to hear. But the scientific community’s credibility depends on its willingness to revisit even its most influential findings as new data and methods become available.

When a number becomes a narrative, verification matters more—not less. If higher education wants reform that lasts, it needs not only good intentions, but reliable evidence—and the confidence to revisit the evidence when it becomes policy canon.

📑This post draws on the author’s paper, Reliability and validity of a high-profile peer review study: Probing Wennerås and Wold’s data in Nature, published open access in Quantitative Science Studies.

Ulf SANDSTRÖM

About the author

Ulf Sandström is a Docent (Associate Professor) in Science and Technology Studies and an affiliated researcher at KTH Royal Institute of Technology (Stockholm, Sweden). His research focuses on research evaluation, peer review, and science policy, with particular interest in metrics and institutional structures that shape academic careers. He has published widely on bibliometrics, funding systems, and reproducibility in science studies, including works in Journal of Informetrics, PNAS, PLoS ONE, Scientometrics and other journals.