Blog – BIG Data

Blog from Dr Yvonne Couch

Reading Time: 7 minutes

11/11/2022

I did a viva recently for a student who had used one of the dreaded ‘omics’ techniques. I was fortunate enough to be sharing the examining duties with a colleague who had similar views on this to me. She said she disliked it when people ‘just shoved a load of data in her face for no reason’ which I thought was a spectacular description of what a lot of people seem to be doing these days. So today we’re going to talk about the amount of data people are generating, whether we can harness it for good using things like artificial intelligence, and how to approach these big data generating techniques properly without turning them into one epic fishing expedition. This is going to be opinionated (would you expect anything less?) and all you snazzy sequencing people are probably going to hate me.

I’ll start by saying a lot of my fear and hatred is founded in ignorance. I am not a bioinformatician, nor am I even vaguely familiar with how R works. I make all my graphs in Prism and the maximum number of variables I ever try to have is three because I’m also not great at statistics.

Let’s start by talking about the ‘omics obsession and we’ll move onto big health data in a bit.

I will say that, when appropriately applied, I can see the benefit of some of these techniques. A good friend of mine had a paper out where they showed that you can distinguish between subtypes of multiple sclerosis, something which is notoriously challenging to do, using metabolomics. They followed this up showing you could find different types of cancer by looking at the blood metabolome. Amazing. Clinically relevant, easily translatable and easy to see the use. But the key phrase I used at the beginning of this paragraph is ‘when appropriately applied’.

The metabolome, unlike the genome, is going to be extremely variable. It measures metabolites which are going to be affected by whether you ate, what time of day it is, what time of the month it is, how long you slept for, whether you exercised recently, etc, etc. If something persistently comes out in a disease above all this potential noise, then you know it’s real and important and a useful biomarker. But you can see how, when someone presents me with data on metabolic changes in a mouse model at one single time point, I might be somewhat sceptical about what the purpose of these findings might be. They pluck out things like glucose, or acetoacetate as ‘significant’ and then I pipe up ‘yes….but what does that mean’ and they have no answer. It doesn’t matter, they claim, it’s significant.

With genomics, or exomics or transcriptomics, I can also see some benefits. The Cancer Genome Atlas Data Portal is a project that began in 2006 as a repository for genetic data associated with over thirty types of cancer. They have petabytes of data. I don’t even know what a petabyte is but I’m assuming it’s enormous. And all this data has enabled them to establish that there are clusters of driver genes and passenger genes which consistently have mutations in various cancers. And by understanding more about the pathways these genes encode, we may be able to approach tumour treatment in a more specific way.

But using the transcriptome rather than the genome is a bit like using the metabolome. You need to take into account the fact that there are biological factors which affect what genes are being transcribed. Time of day, local microenvironment the cell is sitting in, tissue the cell is sitting in, pathology or otherwise in the whole organism. Again, I could go on. So presenting me with single-cell transcriptomics in some cells you isolated from the brain doesn’t tell me anything beyond the fact that you took a snapshot of that cell at that time. If I took a picture outside my front door at the same time every day for a week I would end up with seven different pictures, even if we had a totally inbred week of very un-British sunshine.

And there is the crux of the issue I have with many of these techniques. A colleague of mine summed it up really well. He said:

‘We measured absolutely everything and some things changed’

And because of our inherent bias towards positive data we like the fact that some things changed and we do not pause to ask ourself why they might have changed or what it might mean. The same colleague went on to say ‘We discovered three new types of cell and even though they are only very subtly different to those other cells that Bloggs described, and even though his methods were different, we propose that everyone should adopt this new name. Also, this study will now stand as an atlas of all brain cell types, for all time‘. Brilliant.

So what do we do about this?

Well, I’m torn on this one. I think first we all need to think about our questions properly when we’re submitting research proposals. Is understanding what the subtle genetic changes in a specific cell type in a specific brain region at one time point in one cohort of animals after a particular brain injury going to tell you anything about the disease that will be useful? Or would it be a better use of your time to look at the genes other people have found that might be changed in similar diseases in humans, knock them down or out or target their proteins and actually see what physically happens to the progress of the disease?

But I also think we need to think about how we approach big data science. Some of this absolutely needs to be done. The cancer atlas is a great example of that. But the techniques we’re using need to be more robustly planned and executed in order to generate understandable and usable data.

Yamada and colleague have written an extremely dry but nevertheless excellent work in Nature Human Genetics on the interpretation of omics data. They highlight the major issue of intra-experimental variability. That if you run next generation sequencing on one sample twice the reads for the first sample will almost certainly be better than the reads for the second sample, even though it’s the same material. Similarly, metabolomics is known to be seriously affected by sample handling, so plasma prepped in one hospital may vary wildly from that prepped in a second.

Integrating publicly available data sets generated using these techniques comes with sets of challenges beyond simply dealing with intra-experimental variability. Huang and co in a Frontiers in Genetics article point out that even with the significant numbers of platforms now available to scientists, which integrate different techniques and combine them with survival data and so forth, that ‘biological knowledge guided integrative methods will continue to be desirable, with consideration of the interactive relationship among different omics layers’. Or in less fancy language you still need someone who understands the disease and biology to tell you what all the numbers and the maps mean.

And here is where we run into a problem. There is little incentive to put big data sets out in the public domain as independent things. There is no money in just running these experiments for fun and there is no kudos for making the data available to everyone. So developing more effective tools to integrate the data and to make it biologically meaningful using machine learning and AI approaches is really challenging. Some authors, like Perez-Riverol et al in their 2019 Nature Communications article, are more optimistic. They ‘envision that as more and more data is made publicly available, more standardisation will be implemented to cross-link resources, manuscripts, datasets and the final biological molecules, making the proposed framework more robust’.

For big health data this is slightly easier. Blood pressure and heart rate, the mini mental state exam, these have their inherent variabilities and white-coat issues but are broadly more standardised. Collecting them globally and depositing them somewhere – whilst being a little bit Big Brother – I think can only be a good thing long-term. The amount of useful data coming from large health sets like the Whitehall cohort, for example, is stunning. And when we manage to reduce the intra-experimental variability and encourage long-term omics monitoring of useful cohorts of people, we might be able to begin to assign similarly meaningful biological information to the kind of data currently being generated. But in the meantime, please stop shoving your data in my face for no reason.

Dr Yvonne Couch

Author

Dr Yvonne Couch is an Alzheimer’s Research UK Fellow at the University of Oxford. Yvonne studies the role of extracellular vesicles and their role in changing the function of the vasculature after stroke, aiming to discover why the prevalence of dementia after stroke is three times higher than the average. It is her passion for problem solving and love of science that drives her, in advancing our knowledge of disease. Yvonne has joined the team of staff bloggers at Dementia Researcher, and will be writing about her work and life as she takes a new road into independent research.

Follow @dr_yvonne_couch