Big Data ×

News

 

Andrea Califano
Andrea Califano, Dr

The migration away from “one-size-fits-all” medicine, particularly in the areas of cancer detection and treatment, holds great promise for patients and the field of precision medicine. Demand and jobs are increasing for researchers, clinicians and professionals who are at home collecting, analyzing and using more and newer forms of data, according to a recent feature reported in Science magazine which spotlights Dr. Andrea Califano , founding chair of the Department of Systems Biology

In the field of oncology, innovations continue to grow rapidly in precision, or targeted medicine, as clinicians seek to find better treatments for specific kinds of cancer, rather than take a blanket approach via the traditional trifecta of radiation, chemotherapy, and surgery. To do so, they must test patients, note mutations, and identify biomarkers to determine what treatments could work best with the fewest side effects.

Scientific breakthroughs, in these areas and more, have led to greater understanding of genes and their functions and have created new opportunities for precision medicine—and for those with technical, research, and clinical skills eager to work in this ever-expanding field. Special consideration will be given to those job applicants who can perform big data analysis and multidisciplinary research. However, new jobs will also emerge in previously unseen areas, such as business, translational medicine, and genetic counseling.

New and powerful tools have aided the precision medicine movement. The Human Genome Project, the first complete mapping of human genes, published its preliminary results in 2001. The project’s numerous benefits include knowing the location of the approximately 20,500 genes identified in the body and gaining a clearer understanding of how genes areorganized and operate.

Topology data analysis of cancer samples

Shown here, topology data analysis of cancer samples; Image credit: The Rabadan Lab

The new Program for Mathematical Genomics (PMG) is aiming to address a growing—and much-needed—area of research. Launched in the fall of 2017 by Raul Rabadan , a theoretical physicist in the Department of Systems Biology, the new program will serve as a research hub at Columbia University where computer scientists, mathematicians, evolutionary biologists and physicists can come together to uncover new quantitative techniques to tackle fundamental biomedical problems.

"Genomic approaches are changing our understanding of many biological processes, including many diseases, such as cancer," said Dr. Rabadan, professor of systems biology and of biomedical informatics. "To uncover the complexity behind genomic data, we need quantitative approaches, including data science techniques, mathematical modeling, statistical techniques, among many others, that can extract meaningful information in a systematic way from large-scale biological systems." 

This new program is being built upon collaborative research opportunities to explore and develop mathematical techniques for biomedical research, leading to a deeper understanding of areas such as disease evolution, drug resistance and innovative therapies. Inaugural members of the new program include faculty across several disciplines: statistics, computer science, engineering and pathology, to name a few. The program also will provide education and outreach to support and promote members' work, including joint discussion groups, the development of cross-campus courses and scientific meetings. 

In honor of its launch, PMG will co-host a two-day symposium February 7 to 8 on cancer genomics and mathematical data analysis. Guest speakers from Columbia University, Memorial Sloan Kettering and Cornell University will present a comprehensive overview of quantitative methods for the study of cancer through genomic approaches. 

Integrating data sources

Clinical and molecular data are currently stored in many different databases using different semantics and different formats. A new project called DeepLink aims to develop a framework that would make it possible to compare and analyze data across platforms not originally intended to intersect. (Image courtesy of Nicholas Tatonetti.)

Medical doctors and basic biological scientists tend to speak about human health in different languages. Whereas doctors in the clinic focus on phenomena such as symptoms, drug effects, and treatment outcomes, basic scientists often concentrate on activity at the molecular and cellular levels such as genetic alterations, gene expression changes, or protein profiles. Although these various layers are all related physiologically, there is no standard terminology or framework for storing and organizing the different kinds of data that describe them, making it difficult for scientists to systematically integrate and analyze data across different biological scales. Being able to do so, many investigators now believe, could provide a more efficient and comprehensive way to understand and fight disease.

A new project recently launched by Nicholas Tatonetti (Assistant Professor in the Columbia University Departments of Systems Biology and Biomedical Informatics) along with co-principal investigators Chunhua Weng (Department of Biomedical Informatics) and Michel Dumontier (Stanford University), aims to bridge this divide. With the support of a $1.1 million grant from the National Center for Advancing Translational Science (NCATS) the scientists have begun to develop a tool they call DeepLink, a data translator that will integrate health-related findings at multiple scales.

As Dr. Tatonetti explains, “We want to close what we call the interoperability gap, a fundamental difference in the language and semantics used to describe the models and knowledge between the clinical and molecular domains. Our goal is to develop a scalable electronic architecture for integrating the enormous multiscale knowledge that is now available.”

Cluster computer

Students participating in a new course gain experience using the Department of Systems Biology's computing cluster, a Top500 supercomputer dedicated to biological research.

As more and more biological research moves to a “big data” model, the ability to use high-performance computing platforms for analysis is rapidly becoming an essential skill set. To prepare students to work with these new tools more successfully, the Columbia University Department of Systems Biology recently partnered with the Mailman School of Public Health in launching a new graduate level class focused on providing a strong grounding in the fundamental concepts behind the technology.

Monthly disease risk

Columbia scientists used electronic records of 1.7 million New York City patients to map the statistical relationship between birth month and disease incidence. Image courtesy of Nicholas Tatonetti.

Columbia University Medical Center reports on a new study in the Journal of American Medical Informatics Association led by Nicholas Tatonetti, also an assistant professor in the Department of Systems Biology.

Columbia University scientists have developed a computational method to investigate the relationship between birth month and disease risk. The researchers used this algorithm to examine New York City medical databases and found 55 diseases that correlated with the season of birth. Overall, the study indicated people born in May had the lowest disease risk, and those born in October the highest. The study was published this week in the Journal of American Medical Informatics Association.

“This data could help scientists uncover new disease risk factors,” said study senior author Nicholas Tatonetti, PhD, an assistant professor of biomedical informatics at Columbia University Medical Center (CUMC) and Columbia’s Data Science Institute. The researchers plan to replicate their study with data from several other locations in the U.S. and abroad to see how results vary with the change of seasons and environmental factors in those places. By identifying what’s causing disease disparities by birth month, the researchers hope to figure out how they might close the gap.

Some factors in the expo some

The exposome incorporates factors such as the environment we inhabit, the food we eat, and the drugs we take.

Although genomics has dramatically improved our understanding of the molecular origins of certain human genetic diseases, our health is also influenced by exposures to our surrounding environment. Molecules found in food, air and water pollution, and prescription drugs, for example, interact with genetic, molecular, and physiologic features within our bodies in highly personalized ways. The nature of these relationships is important in determining who is immune to such exposures and who becomes sick because of them.

In the past, methods for studying this interface have been limited because of the complexity of the problem. After all, how could we possibly cross-reference a lifetime’s worth of exposures with individual genetic profiles in any kind of meaningful way? Recently, however, an explosion in the generation of quantitative data related to the environment, health, and genetics — along with new computational methods based in machine learning and bioinformatics — have made this landscape ripe for exploration.

At this year’s South by Southwest Interactive Festival in Austin, Texas, Department of Systems Biology Assistant Professor Nicholas Tatonetti and his collaborator Chirag Patel (Harvard Medical School) discussed the remarkable new opportunities that “big data” approaches offer for investigating this landscape. Driving Tatonetti and Patel’s approach is a concept called the exposome. First proposed by Christopher Wild (University of Leeds) in 2005, an exposome represents all of the environmental exposures a person has experienced during his or her life that could play a role in the onset of chronic diseases. Tatonetti and Chirag’s presentation highlighted how investigation of the exposome has become tractable, as well as the important roles that individuals can play in supporting this effort.

In the following interview, Dr. Tatonetti discusses some of the approaches his team is using to explore the exposome, and how the project has evolved out of his previous research.

Chris Wiggins

In a “Most Creative People” feature, Fast Company magazine recently interviewed associate professor Chris Wiggins, a faculty member of the Department of Systems Biology and Center for Computational Biology and Bioinformatics, about his new appointment at one of the world’s most respected outlets for digital journalism. In this role, he will lead the development of a machine learning team that will help the New York Times to better understand how its audience is using and navigating its content.

In the interview Dr. Wiggins explains why machine learning is becoming increasingly important in the age of big data, and about the shared challenges that the natural sciences and the media are now both facing.

Searches for hyperglycemia-related terms

Percentage of users in each of the three user groups searching for hyperglycemia-related terms, computed per week over 12 months of search log data. Background refers to the fraction of all searchers who search for hyperglycemia-related symptoms or terminology independent of the presence of the drugs in the users’ search histories.

Although the US Food and Drug Organization and other agencies collect and analyze reports on adverse drug effects, alerts for single drugs and drug-drug interactions are often delayed due to the time it takes to accumulate evidence. Columbia University Department of Systems Biology faculty member Nicholas Tatonetti, in collaboration with investigators at Stanford University and Microsoft Research, hypothesized that Internet users can provide early clues of adverse drug events as they seek information on the web concerning symptoms they are experiencing. A new paper explains their results.

As a test, Tatonetti and colleagues asked whether it would be possible to detect evidence of an interaction between the antidepressant paroxetine and the anti-cholesterol drug pravastatin by analyzing web search logs from 2010. As a postfoc at Stanford, Tatonetti and colleagues used a data mining algorithm to analyze FDA adverse event reporting records, and retroactively found this combination to be associated with hyperglycemia (high blood sugar) in some patients. In this new project, the researchers analyzed the search logs of millions of Internet users from a period before the above association was identified to see how often they entered search terms related to hyperglycemia and to one or both medications under investigation. (Participants in this study opted in by voluntarily installing a web browser extension that tracked their activity anonymously.)