April 24, 2015

The Exposome: Connecting Environmental Factors to Human Disease

Some factors in the expo some

The exposome incorporates factors such as the environment we inhabit, the food we eat, and the drugs we take.

Although genomics has dramatically improved our understanding of the molecular origins of certain human genetic diseases, our health is also influenced by exposures to our surrounding environment. Molecules found in food, air and water pollution, and prescription drugs, for example, interact with genetic, molecular, and physiologic features within our bodies in highly personalized ways. The nature of these relationships is important in determining who is immune to such exposures and who becomes sick because of them.

In the past, methods for studying this interface have been limited because of the complexity of the problem. After all, how could we possibly cross-reference a lifetime’s worth of exposures with individual genetic profiles in any kind of meaningful way? Recently, however, an explosion in the generation of quantitative data related to the environment, health, and genetics — along with new computational methods based in machine learning and bioinformatics — have made this landscape ripe for exploration.

At this year’s South by Southwest Interactive Festival in Austin, Texas, Department of Systems Biology Assistant Professor Nicholas Tatonetti and his collaborator Chirag Patel (Harvard Medical School) discussed the remarkable new opportunities that “big data” approaches offer for investigating this landscape. Driving Tatonetti and Patel’s approach is a concept called the exposome. First proposed by Christopher Wild (University of Leeds) in 2005, an exposome represents all of the environmental exposures a person has experienced during his or her life that could play a role in the onset of chronic diseases. Tatonetti and Chirag’s presentation highlighted how investigation of the exposome has become tractable, as well as the important roles that individuals can play in supporting this effort.

In the following interview, Dr. Tatonetti discusses some of the approaches his team is using to explore the exposome, and how the project has evolved out of his previous research.

“Big data” in biomedical research has gotten a lot of attention in recent years. As someone who works in this field, what do you see as the key differences between what you are doing and earlier scientific methods?

In some sense, computational approaches like the ones my lab uses are actually just a high-dimensional approach to the way science has always worked. Years ago Darwin sailed to the Galapagos Islands, and he drew a beautiful figure of a tree in his notebook that represents the relationship he observed between the beaks of finches and their geographic distribution on the islands. He was able to contain his observations and interpretation of the data on a single piece of paper. Today, every genome sequencing run generates a terabyte of data, and medical records contain petabytes worth of data. We can’t possibly hold all of the variables in these kinds of data matrices in our heads anymore, and yet we know that there must be valuable insights hidden in there somewhere.

Nicholas Tatonetti

What we’re trying to do is to bring the technology of digesting observations and producing good scientific hypotheses up to speed with our ability to generate and collect these data. Instead of just looking at a little bit of data and coming up with one hypothesis, we process terabytes and petabytes of data, generate thousands of high-confidence hypotheses, and then evaluate and validate them just like we would any other scientific hypothesis. If a hypothesis turns out to be true we follow it up with other kinds of computational and experimental studies. If we find evidence against it we throw it out and go to the next one.

How does your own research fit into this framework?

Although my early research received some attention for identifying adverse drug events and drug-drug interactions, the thread that holds my work together is my interest in coming up with new ways of analyzing observational data sets; that is, large collections of data that are gathered opportunistically. For example, when Google records search queries and results, it accumulates large numbers of observations, providing unique opportunities to objectively identify trends within the data.

The problem is that data sets generated in this way present challenges for analysis. For example, imagine that you saw a large uptick in searches for sexually transmitted diseases on a particular day. Does this mean that all of the people doing those searches contracted an STD? This is what Google Flu would assume. It’s actually more likely that there was a big news story on the topic that day that brought STD’s to people’s attention. A simple thought experiment like this suggests that making inferences from observational data can be incredibly tricky, because events happen in parallel with the data you are collecting, and you can’t measure what you don’t capture. My work focuses on finding better ways to analyze such observational data sets without jumping to the wrong conclusions.

If factors that are important to the data analysis aren’t actually contained in the data, how do you compensate for them?

In the example I just gave there is just one variable — the number of searches for STD’s per day. There’s no way to account for hidden information in a case like that. But if you can collect hundreds of different variables or, in the case of electronic health records, tens of thousands of different variables, you have lots of dimensions to explore.

The general strategy we’ve been using begins with the fact that a large data set offers many more dimensions to explore than you can actually use in your analysis. If you are interested in knowing whether a drug is correlated to an outcome or adverse drug reaction, you’re essentially using two variables out of a 10,000 variable data set. Our hypothesis is that those other 9,998 variables can tell you something about the underlying structure of the data.

To untangle the data we build a covariance matrix of all possible variables, and then look at variables that uniquely identify a patient population. For example, we might segregate the data set into whether a patient was exposed to a specific drug or not. We can then use this classification to identify good controls for the population. Then, we look for another subset in the data that is similarly structured and compare it to the control. In doing so, we assume that the biases inherent in the data — for example, missing information or confounding variables — are going to align. So we don’t correct for the bias, but look for another equally biased sample. As long as the biases align, the difference between the two populations should be the effect that we want to measure.

What does all of this have to do with the exposome?

Just like the genome is the complete collection of all of our genes, or the transcriptome is all of the RNAs that have been transcribed in the cell, the exposome is the complete collection of everything a person has been exposed to during his or her life. It could include things like pollutants in air or water, chemicals from eating fish or meats, effects of living in different environments around the world, or prescription drugs, just to name a few. It’s another kind of very large, high-dimensional data set.

My collaborator, Chirag Patel, who graduated from Stanford the same year I did, has been looking at factors in the environment that affect health. He famously conducted a study he called an EWAS — an environment-wide association study. This was the first time that every possible environmental factor collected by the National Health and Nutrition Examination Survey was correlated with all possible diseases. He found some very interesting results, focusing primarily on toxins, which are often structured like small molecule drugs. Meanwhile, I was working on characterizing interactions between small molecule drugs and the body. Fundamentally we’re actually working on the same problem, and so this collaboration seemed like a perfect fit.

The exposome could theoretically include pretty much anything, and so it presents a similar problem as the STD example I mentioned earlier. Our approach is to focus on exposures that we can measure and quantify. For example, even when sufficient data aren’t available for the concentration of specific toxins in the environment, quantitative measurements characterizing overall air quality often exist. As long as we can quantify something, it goes into the exposome, creating an observational data set we can then interrogate using computational methods.

With this approach, is it possible to connect exposures to genetic traits within individuals that might make them more or less susceptible to a particular risk factor?

It’s still very early, but we expect that there will be certain responses to environmental factors that are shared across all humans, and others that are unique to specific individuals. Not a lot is known about how the body interacts with the environment. Clearly, it does a lot to keep itself healthy in the presence of all types of pollutants and toxins. At the same time, though, there are some people who are missing a component that’s necessary to maintain health. This missing component might only become apparent when a person is exposed to a particular toxin, chemical, or drug and gets sick. At the same time, a person with a normal genotype might be perfectly fine in that polluted environment or when treated with that same drug.

"We expect that there will be certain responses to environmental factors that are shared across all humans, and others that are unique to specific individuals."

In the past, epidemiological studies have identified exposures, like smoking, that lead to changes in health, like lung cancer or diabetes. What's new in the exposome project is that we are developing data science tools that consider all environmental factors at once. This should help us to holistically understand how interactions among multiple factors influence individual responses to drugs and the environment.

What kinds of data do you use to construct your data sets?

In thinking about exposures like pollution in the environment, we can use databases developed by the Environmental Protection Agency that quantify levels of known toxins across the whole country, sometimes even at the county or city level. We can then integrate these data with health records from different locations and map their differential exposures down to the zip code. With these pieces in place, we can look at what diseases correlate to those exposures across geographical areas, highlighting potential interactions with environmental factors.

In my lab we are trying to collect enough data about the drugs, the exposures, and the diseases, so that we can form hypotheses about which conditions might be connected with genetic causes. We recently began collaborating with David Goldstein using Columbia’s medical records system, where we are identifying cohorts with unique diseases or drug responses that we suspect may be the result of an underlying genetic factor. We may then sequence and run proteomic analysis of these patients to identify previously unknown genetic variants in human disease.

How can the public at large help in contributing to our understanding of the exposome?

Electronic health records are obviously one important data source, but new personal health monitoring technologies like the Apple Watch or the Fitbit could also potentially have a role to play in this. One could imagine an app, perhaps enabled by Apple’s HealthKit, that allows patients to participate in research trials, lets them explore their long-term health statistics, and compare to their peers.

The goal would be to enable a critical mass of users who are actively collecting data about themselves, creating another kind of observational data set. This would allow us to spontaneously spawn research studies in response to new hypotheses. Ideally, we would like to improve on the current system where a finding goes to publication, then someone gets an idea for a way to implement it, then public health agencies pick it up. Currently, it just takes too long for a discovery to have an impact, and we would like to change that.

As we start on the exposome project we are bound to make a lot of mistakes along the way. But I think that is what makes a research project interesting. Hopefully in Austin we were able to begin getting people interested in participating, because the more data we can gather, the bigger effect we’ll be able to have on improving health.

— Interview by Chris Williams

Systems Biology Home Columbia University Department of Systems Biology

Columbia University Medical Center

News

The Exposome: Connecting Environmental Factors to Human Disease

“Big data” in biomedical research has gotten a lot of attention in recent years. As someone who works in this field, what do you see as the key differences between what you are doing and earlier scientific methods?

How does your own research fit into this framework?

If factors that are important to the data analysis aren’t actually contained in the data, how do you compensate for them?

What does all of this have to do with the exposome?

With this approach, is it possible to connect exposures to genetic traits within individuals that might make them more or less susceptible to a particular risk factor?

What kinds of data do you use to construct your data sets?

How can the public at large help in contributing to our understanding of the exposome?

Latest News

Archive

Research

Infrastructure

Education & Training

About the Department

Careers

People

Software

Intranet

Connect

Columbia University Medical Center