Drowning in data sets? Here’s how to cut them down to size

From Nature Careers

Reading Time: 8 minutes

24/03/2026

Drowning in data sets Heres how to cut them down to size - Nature

Indefinite data retention is neither financially nor practically possible, but there are ways to give your data maximal long-term value.

Within the next decade, a pair of giant radio telescopes in South Africa and Australia will be able to generate about 700 petabytes of data each year, the equivalent of about 149 million DVDs, a stack nearly 180 kilometres high.

The telescopes are part of the Square Kilometre Array Observatory (SKAO), which will include more than 100,000 Christmas-tree-like wire antennas in Australia and some 200 dishes in South Africa when it is completed in 2029. These telescopes will pick up radio signals from celestial objects, and their developers hope that they will shed light on some of astronomy’s long-standing questions, such as what dark matter is and how galaxies form.

But 700 petabytes is only about 1% of the data that the array could generate. Shari Breen, head of science operations at the SKAO in Jodrell Bank, UK, estimates that it could produce some 60 exabytes — 60,000 petabytes — each year if researchers used all of its systems continuously and retained all of the data.

“The amount of money that it would take to hold our rawest forms of data is insane — I don’t even know where we would fit that many computers,” says Breen. “So, we have to make some compromises.”

Disciplines such as astronomy and the Earth and biological sciences have long grappled with unwieldy data sets. As the volume, processing speed and variety of data continue to grow, the storage capacity is struggling to keep pace. At the same time, the boom in machine-learning and artificial-intelligence technologies is creating an incentive to hoard information. But unconstrained data retention is not financially viable and uses a great deal of energy.

“This is a problem that libraries have been dealing with for as long as libraries have existed,” says Kristin Briney, a librarian at the California Institute of Technology (Caltech) in Pasadena. “We cannot physically collect all the books that we want to collect, and in 50 years, the book may not be useful any more.”

Data sets, she says, are the same. “There has to be some curation that determines what is worth keeping and what is worth throwing away.”

Field-specific rules

There is no one-size-fits-all rulebook for data curation, and best practice often depends on the discipline and on the scale of a project.

The SKAO, for instance, will store the products that it makes according to what the scientists ask for in advance, says Breen. The products can range from raw data to highly processed images. So if an astronomer requests an image based on interferometry data, then the underlying data set will be discarded once the picture’s quality has been deemed sufficient, she says.

Breen, who is a principal investigator on a large astronomical survey, says that in the past, she would request raw data. “Now, I’m like, ‘No, please don’t!’,” she says. “The reality of these next-generation telescopes is that then you’ll spend all your time bogged down by enormous data sets rather than delivering the awesome science that was the whole point.” Instead, she typically asks for an interactive 3D array of pixels known as an image cube, which is easier to wrangle, she says.

Meteorologists, by contrast, still prefer to work with the raw data. The World Meteorological Organization (WMO) receives data from thousands of satellites, marine platforms, aerial surveys and ground-based stations around the world, which record parameters such as atmospheric pressure, wind speed, air temperature and humidity, often hourly.

“We have a principle in meteorology, which is that we have to archive all the original data in order to enable us to always produce any product we have ever produced out of the original data,” says WMO scientific officer Peer Hechler in Geneva, Switzerland. The meteorology community uses original data to create projections and models, but “it doesn’t make sense economically to store all these derivative data sets”, he says.

Similarly, the Wellcome Sanger Institute, a genomics research organization in Hinxton, UK, keeps most of the raw data it generates, says sequencing informatics team leader David Jackson. Its DNA database already contains some 90 petabytes of data. As a result, Jackson says, the organization needs clear data-retention policies, and soon. “You get to the point where the data becomes more of a liability than an asset,” he says.

What needs to be kept

Whatever the discipline, the first step in managing massive data sets is working out what needs to be kept and what can be thrown away. Although practices vary, librarians and data specialists say that there are some overarching principles.

Some data sets must be kept because they are irreplaceable or legal requisites. Others might have been used in a publication or for a government decision, and need to be stored so that future readers can see the evidence on which a decision was based.

Many funders, including the US National Institutes of Health, require that data remain available to other researchers. To do so, researchers can use shared repositories such as the generalist Zenodo and Dryad databases, or more specialist systems, including the Open Data Commons for Spinal Cord Injury. The Registry of Research Data Repositories provides an index of nearly 3,500 such resources.

The US National Science Foundation requires grant recipients to submit a data-management plan, including information about the size and storage plans for data sets, as well as how much of the grant will be allocated to this. It offers guidance that is tailored to different disciplines. For instance, the guidelines for biological sciences contain information about how to handle sensitive data relating to human participants, whereas those for mathematics have provisions for making code and software open source and contain suggestions about data formats.

The UK National Environment Research Council has developed a checklist that covers the data’s legal status, potential reuse and historical and scientific value, says Sam Pepler, curation manager at the Centre for Environmental Data Analysis in Leicester, UK. The list could be useful for other fields of research, too, Pepler says, but he cautions that it is subjective and that disciplines often have their own requirements.

One thing that is not subjective, however, is the importance of the metadata that describe the data set. Helen Glaves, a senior data scientist at the British Geological Survey in Nottingham, UK, says that metadata are “absolutely fundamental”. If data sets have poor metadata, she explains, their value for reuse might be limited.

Jackson says that the Sanger Institute has purged almost all of the data for which it doesn’t have sufficient metadata. For instance, he says, it discards data for human research if they don’t include information about the data-processing methods, the research participants and any associated legal and material-transfer agreements.

Glaves, who specializes in marine geoscience, remembers collecting seismology data in a harbour in Hong Kong about 20 years ago. One of the seismic profiles did not fit in with the others — something they weren’t able to explain at the time. Years later, she and her colleagues discovered that some of the equipment was faulty, and they could trace the issue to that data set, because the system’s serial number had been included in the metadata.

It is important to keep the metadata even if the data set is subsequently deleted, says Hugh Shanahan, a bioinformatician at Royal Holloway University of London who specializes in open science. He points to the FAIR Guiding Principles for scientific data management and stewardship (M. Wilkinson et al. Sci. Data 3, 160018; 2016). “At least people can say, ‘There was a data set that existed that had [these characteristics]’,” Shanahan says.

Glaves says that the environmental-sciences community has certified repositories for different data types. Geological measurements, for instance, are stored at the UK National Geoscience Data Centre, and oceanography and fisheries data are dealt with by the Norwegian Marine Data Centre. These resources “are run by domain experts, who understand what the data are being collected for”, she says. “They understand the potential for reuse. They understand what needs to be captured to maximize the reuse of those data, but it also means they are best placed to make informed decisions about what we need to keep.”

Coping with a data deluge

Even if data can be discarded, what does need to be stored can quickly outstrip archival capacity.

The SKAO, for example, will not be able to keep all of its astronomy products accessible continuously. Data products that are seldom used can be put on magnetic tape, Breen says. They’re slightly harder to get to and there will be a delay in accessing the information, but they will still be available, she says.

Other repositories are making similar decisions. Last year, a repository at Caltech called CaltechDATA began hosting and sharing big data, says Briney. “Our policy is that we will host it for five years, and then put it into frozen storage”, such as on tape. The data will then be kept for another five years. “I think we’re going to start seeing a lot of policies like that pop up.”

Another option is to increase data sharing so that researchers do not repeat the same observations or store several copies of a data set unnecessarily. This is an area in which domain specialists can be invaluable, says Glaves, because they can point scientists to the relevant resources. “If you have a data set that is an improvement, a higher resolution or better quality than a previous data set”, domain specialists can guide researchers on whether the older versions can be discarded, she says.

Data collection and storage, says Glaves, are quintessentially twenty-first-century problems because of the vast quantities of data that can now be created. Some researchers hope that they will be saved from having to weed out old data sets by technological leaps such as quantum-computing technologies and data centres that do not require as much energy to run. But those involved in data curation warn against keeping data for the sake of it.

It is not just about having a lot of data, says Briney. “It’s about being able to get your hands on it, understand it and then reuse it.”

Nature 651, 1121-1122 (2026)

Find the original and more great content on the Nature Careers Website at doi: https://doi.org/10.1038/d41586-026-00880-7