Columbia Scientists Receive Grant to Integrate Clinical and Molecular Data

Integrating data sources

Clinical and molecular data are currently stored in many different databases using different semantics and different formats. A new project called DeepLink aims to develop a framework that would make it possible to compare and analyze data across platforms not originally intended to intersect. (Image courtesy of Nicholas Tatonetti.)

Medical doctors and basic biological scientists tend to speak about human health in different languages. Whereas doctors in the clinic focus on phenomena such as symptoms, drug effects, and treatment outcomes, basic scientists often concentrate on activity at the molecular and cellular levels such as genetic alterations, gene expression changes, or protein profiles. Although these various layers are all related physiologically, there is no standard terminology or framework for storing and organizing the different kinds of data that describe them, making it difficult for scientists to systematically integrate and analyze data across different biological scales. Being able to do so, many investigators now believe, could provide a more efficient and comprehensive way to understand and fight disease.

A new project recently launched by Nicholas Tatonetti (Assistant Professor in the Columbia University Departments of Systems Biology and Biomedical Informatics) along with co-principal investigators Chunhua Weng (Department of Biomedical Informatics) and Michel Dumontier (Stanford University), aims to bridge this divide. With the support of a $1.1 million grant from the National Center for Advancing Translational Science (NCATS) the scientists have begun to develop a tool they call DeepLink, a data translator that will integrate health-related findings at multiple scales.

As Dr. Tatonetti explains, “We want to close what we call the interoperability gap, a fundamental difference in the language and semantics used to describe the models and knowledge between the clinical and molecular domains. Our goal is to develop a scalable electronic architecture for integrating the enormous multiscale knowledge that is now available.”

“We want to close the interoperability gap, a fundamental difference in the language and semantics used to describe the models and knowledge between the clinical and molecular domains."

At the same time that high-throughput experimental technologies are producing growing volumes of genetic and molecular research data, the clinic has been transformed by the use of electronic patient health records and other repositories of disease classifications, clinical trial outcomes, and adverse event reports. Currently these databases are housed in various locations and using different electronic systems. The National Institutes of Health’s NCATS Biomedical Data Translator program was launched to support the development of a comprehensive system for accessing and integrating these data, with the goal of accelerating the rate at which the information they contain will be translated into improvements in human health.

DeepLink is one early component of this effort. Using new technologies such as semantic knowledge graphs the scientists intend to harmonize data found in the most commonly used resources for health records and molecular data, and then use a framework called Linked Data to make connections between data that were not originally unified. Tatonetti and his colleagues expect that the grant should enable them to build a scalable prototype that would work out the numerous challenges involved in dealing with multiscale data in this way, and could potentially intersect any categories of interest across clinical and molecular data repositories.

The scientists propose that such a resource should be able to address queries by both clinicians and basic researchers. Clinicians might ask, for example: What is the best treatment for a disease given a patient’s genetic, metabolic, or proteomic profile? Or what functional variants in a particular cell type are associated with different disease outcomes? Or what metabolic differences in a certain cell type are associated with different subtypes of a particular disease? At the same time, DeepLink could enable basic scientists to answer questions like: What are all of the clinical effects of a change in function in a particular protein? Or which biological pathways are affected by a pathogenic genetic variant in a particular disease? Or what patient data are available to evaluate a molecularly derived clinical hypothesis? Such insights could provide guidance for both research and treatment.

Nicholas Tatonetti & Chenhua Weng
Columbia University Medical Center's Nicholas Tatonetti (Department of Systems Biology and Department of Biomedical Informatics) and Chenhua Weng (Department of Biomedical Informatics) are co-principal investigators in the DeepLink project.

The scientists also intend to build DeepLink in such a way that it can grow as available data sources change. Currently, bioinformatics faces the problem that research is often carried out by analyzing a specific snapshot of an evolving collection of data. This raises a question of how reliable and reproducible its findings are, as results could potentially be different when applied to different data sources, or even to the same databases as data are added or removed over time. Important to DeepLink’s development, then, will be to include methods for data provenance tracking, data updating, synchronization, and quality assurance.

“In recent years,” Tatonetti says, “biomedical research has become an incredibly data rich science. The open question is how to make sense of this mind-boggling number of observations and translating what we know between the lab and the clinic. We still have a lot of work to do, but we hope that once it is done DeepLink will give researchers everywhere a way to turn data into real understanding.”

— Chris Williams