Data Science ×

News

At first, Xuebing Wu , PhD, was on track to pursue a research career in computer engineering. After taking a course by Dr. Yanda Li, a pioneer of bioinformatics, Dr. Wu’s interest quickly shifted and he soon got hooked on genomics research and computational biology.

Xuebing Wu, PhD
Xuebing Wu, PhD

“Around that time—2003 to 2004—the human genome project had just been completed, and there had been lots of enthusiasm about using computational approaches to decipher the human genome,” he said. “I was excited to dive into this field that seemed wide open for research possibilities.”

Dr. Wu joined Columbia University’s Department of Systems Biology in the fall of 2018, with a joint appointment in the Department of Medicine’s Cardiology Division . He also is a member of the Herbert Irving Comprehensive Cancer Center at NewYork-Presbyterian/Columbia and the Columbia Data Science Institute , and his lab straddles basic science and computational biology. Dr. Wu and collaborators often consider how their work can make an impact in novel therapeutics. 

At the center of his interests is understanding the fundamental principles of gene regulation in human cells through integrative genomics approaches. His previous work has uncovered important roles of RNA sequence and structure signals in controlling the expression and evolution of the mammalian genome. His lab currently studies RNA-centric gene regulation, focusing on mRNA structures and mRNA translation. Dr. Wu and his team are increasingly turning their attention to the development of genomic technologies such as the revolutionary CRISPR/Cas system and a high throughput analysis technology called massively parallel reporter assays (MPRA), as well as novel computational tools and deep learning models to study gene regulation at a global scale. 

New Book Coauthored by Raul Rabadan, PhD
Dr. Raul Rabadan coauthors new book that introduces techniques of topological data analysis, a rapidly growing subfield of mathematics. (Cambridge University Press)

The deluge of data in the diverse field of biology comes with it the challenge of extracting meaningful information from large biological data sets. A new book, Topological Data Analysis for Genomics and Evolution, introduces central ideas and techniques of topological data analysis and aims to explain in detail a number of specific applications to biology.

“High-throughput genomics has profoundly transformed the field of modern biology and has made it possible for scientists to make rapid scientific advances,” says the book’s co-author Dr. Raul Rabadan, professor of systems biology and founding director of Columbia University’s Program for Mathematical Genomics. “The explosion of data has hit biology, and as a result, we need new, more innovative analytical and computational tools to make sense of it all.”

Co-authored with Andrew J. Blumberg, PhD, professor of mathematics at University of Texas at Austin, the new book discusses techniques of topological data analysis, a rapidly developing subfield of mathematics that provides a methodology for analyzing the shape of data sets. The book offers several examples of these techniques and their use in multiple areas of biology, including the evolution of viruses, bacteria and humans, genomics of cancer, and single cell characterization of developmental processes.

Nicholas Tatonetti, PhD
Nicholas Tatonetti, Phd

Nicholas Tatonetti , PhD, solves problems. He has always enjoyed it, and as the informatics community has discovered, he is both creative and proficient in his methods.

Dr. Tatonetti, who was recently awarded tenure and promoted to the rank of Associate Professor in the Columbia Department of Biomedical Informatics (DBMI) and Department of Systems Biology , focuses on the use of advanced data science methods, including artificial intelligence and machine learning, to investigate medicine safety. Using emerging resources, such as electronic health records (EHR) and genomics databases, his lab is working to identify for whom these drugs will be safe and effective and for whom they will not.

His path to Columbia wasn’t a traditional one, but that fits his work. Since joining in 2012, Dr. Tatonetti has used non-traditional thinking to benefit both health and healthcare.

Utilizing both data mining of medical records and prospective lab experiments, Dr. Tatonetti created a methodology for both finding and validating adverse drug reactions and drug-drug interactions. During a two-year collaboration with Pulitzer Prize-winning journalist Sam Roe of the Chicago Tribune , Dr. Tatonetti discovered that the drugs ceftriaxone and lansoprazole, when taken together, induces an arrhythmia in the heart.

The data mining identified adverse effects, while the lab experiments established causality. Dr. Tatonetti wasn’t specifically looking for a negative reaction of those particular drugs; he had no reason to suspect them.

“We are able to find things that nobody expects to happen because the world of hypotheses we consider is basically everything,” he said. “We consider every possible combination, a type of analysis that would be impossible without a huge data set and significant computational power.”

Phyllis Thangaraj
Phyllis Thangaraj, MD/PhD student (Tatonetti lab)

Aspiring physician-scientists from Columbia's Vagelos College of Physicians and Surgeons presented their research posters at the 14th annual MD-PhD Student Research Symposium on April 25. Their research delved into a range of topics, including Alzheimer’s disease, stroke, and stem cells. The event included a guest lecture by an alumna about her own career path as a physician-scientist, and culminated in the poster session judged by MD-PhD alumni who currently work at the University. Department of Systems Biology’s Phyllis Thangaraj, an MD/PhD student in the Nicholas Tatonetti lab , was named one of five poster winners at the event. 

She presented work on applying machine learning methods to phenotype acute ischemic stroke patients in the electronic health records. In cohort research studies, it is essential to identify a large number of subjects in an accurate and efficient manner, but often this requires time-consuming manual review of patient charts. 

“We applied machine learning methods to data within a patient’s electronic health records to develop a high-throughput way to define research cohorts,” explains Thangaraj. “Our test case is in acute ischemic stroke. We extracted clues within a person’s medical record that required minimal data processing to classify those who have had a stroke. In a separate cohort, the UK Biobank, we were able to use our model to identify patients with self-reported stroke but no mention in their medical data with 65-fold better precision than random selection of patients.” Although stroke was the test case in this particular work, she explained that their workflow could be applied to identify patients for cohorts of other diseases, particularly when the dataset has missing data. 

Tatonetti Heritability Image

Each subgraph in this image is a family reconstructed from EHR data: Each node represents an individual and the colors represent different health conditions. (Figure: Nicholas Tatonetti, PhD, Columbia University Vagelos College of Physicians and Surgeons).

Acne is highly heritable, passed down through families via genes, but anxiety appears more strongly linked to environmental causes, according to a new study that analyzed data from millions of electronic health records to estimate the heritability of hundreds of different traits and conditions. 

As reported by the Columbia Newsroom, the findings, published in Cell by researchers at Columbia University Irving Medical Center and NewYork-Presbyterian could streamline efforts to understand and mitigate disease risk—especially for diseases with no known disease-associated genes.

“Knowledge of a condition’s heritability—how much the condition’s variability can be attributed to genes—is essential for understanding the biological causes of the disease and for precision medicine,” says study co-leader Nicholas Tatonetti, PhD , the Herbert Irving Assistant Professor of Biomedical Informatics at Columbia University Vagelos College of Physicians and Surgeons and an assistant professor of systems biology. “It is clinically useful for estimating disease risk, customizing treatment, and tailoring patient care.”

From Code to Cure

Columbia Magazine

Published Spring 2018 cover story , Columbia Magazine

As reported by David J. Craig, senior editor at Columbia Magazine , we are living in the age of big data, and with every link we click, every message we send, and every movement we make, we generate torrents of information. In the past two years, the world has produced more than 90 percent of all the digital data that has ever been created. New technologies churn out an estimated 2.5 quintillion bytes per day. 

Today, researchers at Columbia University Irving Medical Center (CUIMC) are using the power of data to identify previously unrecognized drug side effects; they are predicting outbreaks of infectious diseases by monitoring Google search queries and social-media activity; and they are developing novel cancer treatments by using predictive analytics to model the internal dynamics of diseased cells. These ambitious projects, many of which involve large interdisciplinary teams of computer scientists, engineers, statisticians, and physicians, represent the future of academic research.

Craig covers Dr. Nicholas Tatonetti's work involving prescription drug safety and his innovative use of digital health and clinical records and Dr. Andrea Califano's unconventional computational approaches in advancing cancer research.

To read the full article , visit the online issue of Columbia Magazine

Nicholas P. Tatonetti, PhD, has recently been named director of clinical informatics at the Institute for Genomic Medicine (IGM) at Columbia University Medical Center. In this new role, he is charged with planning, organizing, directing and evaluating all clinical informatics efforts across the Institute. In particular, he will focus on the integration of electronic health record data for use in genetics and genomics studies.

Dr. Tatonetti, who is Herbert Irving Assistant Professor of Biomedical informatics with an interdisciplinary appointment in the Department of Systems Biology, specializes in advancing the application of data science in biology and health science. Researchers in his lab integrate their medical observations with systems and chemical biology models to not only explain drug effects, but also further understanding of basic biology and human disease. They focus also on integration of high throughput data capture technologies, such as next-generation genome and transcriptome sequencing, metabolomics, and proteomics, with the electronic medical record to study the complex interplay between genetics, environment, and disease.

At the Institute for Genomic Medicine, researchers are focused on innovative approaches to genomic medicine. Their multi-tiered approach to genomic medicine utilizes large scale genomic sequencing and analysis, paired with functional biology to advance the diagnosis, characterization, and treatment of genetic diseases. IGM is playing a critical role in Columbia’s overall Precision Medicine Initiative, a major University-wide effort to provide medical diagnosis, prevention and treatment based on an individual’s variation in genes, environment, and lifestyle. 

Dr. Tatonetti, who joined Columbia in 2012, is also affiliated with the Center for Computational Biology and Bioinformatics, the Department of Medicine, the Department of Biomedical Informatics, and the Center for Cancer Systems Therapeutics.