Ah, data. Data, data, data. The lifeblood of scientific research, the foundation of knowledge, and the key to innovation, problem-solving, and modelling the future. It’s what helps us strive for objectivity, minimise bias, and make informed decisions. And yet, data is also a relentless force – constantly generated, exponentially expanding, and showing no signs of slowing down. It is overwhelming, to say the least!
But just because there is a difference between data that exists and data that’s actually relevant to your scientific question, doesn’t mean navigating what’s available is easy. I remember when I started my PhD, it took me ages to figure out how to even begin identifying the right data to help answer my question. If you’re in that boat – wondering how to find, evaluate, and use data effectively – you’re in the right place!
So, where exactly should you start? With so much data out there, working out what is actually useful, and how to get hold of it, can feel like an impossible task. But breaking it down into clear steps makes it far more manageable.
What do you actually need? Or alternatively, what is your research question, and what do you require to answer it? What sorts of data are mandatory, what are nice to have, what are unnecessary? For an example, in my PhD and my first post-doc, I used MRI near-exclusively. Specifically, I have looked at both healthy brain ageing and dementias, and using data that had been previously acquired, thereby having to work within the framework of existing study designs. Which was hugely beneficial for me, ultimately, but the nuances of exact protocols can be obfuscated (often not intentionally!), if you’re a junior researcher on a massive learning curve.
Many a time, I thought a dataset would be perfect, only for it to be missing a scan I critically needed at that time, or to have the scan be available, but the quality unusable for my needs.
Where are you going to find it? Time for a lot of conversations with colleagues and possibility many subsequent searches. This might be the end of the road for your searching, for example if you’re into large-scale, UK-focused research data that is rich in longitudinal data, you probably want to turn to one of the most comprehensive biobanks in the world, UK Biobank. If, for example, you want to look at a more niche cohort, you might have to spend a significantly greater amount of time looking for data, but I cannot stress enough how worthwhile it is to be thorough at this stage. You could check government and research institutions as a starting point. The question you really want to be able to answer at the end of this searching – can we use an established dataset, or do we need to/is it better to collect our own data?
What exactly do you want and where are you going to put it? Firstly, make sure where you are putting it has plenty space to store it. Sounds obvious, right? You also want to check the permissions and licensing and ensure that you are adhering to things like GDPR, consent, and anonymisation requirements. Also consider how you will deal with updates and whether you need to implement a version control strategy. And, it sounds silly, but make sure you download the right type of data for your needs. MRI is, again, a good example of this, with different formats often used in clinical contexts (DICOM [.dcm]), and research (NIfTI, [.nii, .nii.gz]), and you may sometimes also see older formats like .hdr or .img, or specific file formats needed for specific analyses software.
Does the data need cleaned or preprocessed? Sometimes, when you download data, it’s not ready to be used straight away. I know, I know, that would be too simple! But this is where field standards certainly start to come in, and speaking with colleagues/a quick google search can start to clarify how to deal with things like missing data, standardisation, and dealing with duplicates.
Do you understand the data you have? I know it’s tempting to dive straight in, particularly when you may feel like you’ve had to jump through a lot of hoops to get to this point. But instead, get some summary statistics and early visualisations out, and start truly getting to know your data. Do you have outliers that can justifiably be removed? Do you notice anything that doesn’t make sense?
Do you have a specific analysis plan? Now you’ve got the data, you probably have some idea what you want to do with it. This is very field-specific, but choosing the right analytical method makes all the difference and is worth spending some time thinking about. Once you finalise this, you’re good to go!
How are you going to disseminate your findings? What’s the best way to visualise your results, how do you do justice to your data whilst also remaining transparent and clear?
Phew, that was quite a bit to take in! But hopefully, by now, you have a clearer sense of how to navigate the world of data. From acquisition and storage to selecting the right formats and ensuring privacy, it’s all about breaking it down into manageable steps.
Data might feel overwhelming at first, but with the right approach, it becomes much more manageable. So, go ahead – start exploring, and remember, you’ve got this!

Jodi Watt
Author
Dr Jodi Watt is a Postdoctoral Researcher at University of Glasgow. Jodi’s academic interests are in both healthy ageing and neurodegenerative diseases of older age, and they are currently working on drug repurposing for dementia. Previously they worked on understanding structural, metabolic and physiological brain changes with age, as measured using magnetic resonance imaging. As a queer and neurodiverse person, Jodi is also incredibly interested in improving diversity and inclusion practices both within and outside of the academic context.