Algorithms×

News

PrePPI inputs
PrePPI predicts the likelihood that two proteins A and B are capable of interacting based on their similarities to other proteins that are known to interact. This requires integrating structural data (green) as well as other kinds of information (blue), such as evidence of protein co-activity in other species as well as involvement in similar cellular functions. PrePPI now offers a searchable database of unprecedented scope, constituting a virtual interactome of all proteins in human cells. (Image courtesy of eLife.) 

The molecular machinery within every living cell includes enormous numbers of components functioning at many different levels. Features like genome sequence, gene expression, proteomic profiles, and chromatin state are all critical in this complex system, but studying a single level is often not enough to explain why cells behave the way they do. For this reason, systems biology strives to integrate different types of data, developing holistic models that more comprehensively describe networks of interactions that give rise to biological traits. 

Although the concept of an interaction network can seem abstract, at its foundation each interaction is a physical event that takes place when two proteins encounter one another, bind, and cause a change that affects a cell’s activity. In order for this to take place, however, they need to have compatible shapes and physical properties. Being able to predict the entire universe of possible pairwise protein-protein interactions could therefore be immensely valuable to systems biology, as it could both offer a framework for interpreting the feasibility of interactions proposed by other methods and potentially reveal unique features of networks that other approaches might miss. 

In a 2012 paper in Nature, scientists in the laboratory of Barry Honig first presented a landmark algorithm and database they call PrePPI (Predicting Protein-Protein Interactions). At the time, PrePPI used a novel computational strategy that deploys concepts from structural biology to predict approximately 300,000 protein-protein interactions, a dramatic increase in the number of available interactions when compared with experimentally generated resources.

Since then, the Honig Lab has been working hard to improve PrePPI’s scope and usefulness. In a paper recently published in eLife they now report on some impressive developments. With enhancements to their algorithm and the incorporation several new types of data into its analysis, the PrePPI database now contains more than 1.35 million predictions of protein-protein interactions, covering about 85% of the entire human proteome. This makes it the largest resource of its kind. In parallel with these improvements, the investigators have also begun to apply PrePPI in new ways, using the information it contains to provide new kinds of insights into the organization and function of protein interaction networks.

Cell Types in Autism

By inventing a new computational pipeline called DAMAGES, Chaolin Zhang and Yufeng Shen showed that brain cell types on the left of the plot are more prone to have rare autism risk mutations than cell types at the right. Narrowing the focus to these types of cells also helped to identify a molecular signature of the disorder that involves haploinsufficiency. Figure: Human Mutation.

Autism, a spectrum of neurodevelopmental disorders typically identified during early childhood, is widely thought to be the result of genetic alterations that change how the growing brain is wired. Nevertheless, despite a substantial effort in the field of autism genetics, the specific alterations that place one child at greater risk than another remain elusive. Although the list of alterations associated with autism is growing, it has been difficult to conclusively distinguish those that truly increase disease risk from those that are merely coincident with it. One troubling reason for this is that research so far seems to indicate that specific genetic abnormalities associated with autism risk are extremely rare, with many being found only in single patients. This has made it hard to reproduce findings conclusively.

In a paper recently published in the journal Human Mutation, Department of Systems Biology faculty members Chaolin Zhang and Yufeng Shen describe a method and some new findings that could help to more precisely identify rare autism-driving alterations. A new analytical pipeline they call DAMAGES (Disease Associated Mutation Analysis using Gene Expression Signatures) uses a unique approach to identifying autism risk genes, looking at differences in gene expression among different cell types in the brain in order to focus more specifically on mechanisms that are likely to be relevant for autism. Using this approach, they identified a pronounced molecular signature that is shared by disease risk genes due to haploinsufficiency, a type of genetic alteration that causes a dramatic drop in the expression of a particular protein.

Yufeng Shen
Yufeng Shen's lab is interested in developing better computational methods for identifying rare genetic variants that increase disease risk.

On the surface, birth defects and cancer might not seem to have much in common. For some time, however, scientists have observed increased cancer risk among patients with certain developmental syndromes. One well-known example is seen in children with Noonan syndrome, who have an eightfold increased risk of developing leukemia. Recently, researchers studying the genetics of autism also observed mutations in PTEN, an important tumor suppressor gene. Although such findings have been largely isolated and anecdotal, they raise the tantalizing question of whether cancer and developmental disorders might be fundamentally linked.

According to a paper recently published in the journal Human Mutation, many of these similarities might not be just coincidental, but the result of shared genetic mutations. The study, led by Yufeng Shen, an Assistant Professor in the Columbia University Departments of Systems Biology and Biomedical Informatics, together with Wendy Chung, Kennedy Family Associate Professor of Pediatrics at Columbia University Medical Center, found that cancer-driving genes also make up more than a third of the risk genes for developmental disorders. Moreover, many of these genes appear to function through similar modes of action. The scientists suggest that this could make tumors “natural laboratories” for pinpointing and predicting the damaging effects of rare genetic alterations that cause developmental disorders.

“In comparison with cancer, there are relatively few patients with developmental disorders,” Shen explains, “For geneticists, this makes it hard to identify the risk genes solely based on statistical evidence of mutations from these patients. This study indicates that we should be able to use what we learn from cancer genetics — where much more data are available — to help in the interpretation of genetic data in developmental disorders.”

Factors affecting protein activity
Following gene transcription and translation, a protein can undergo a variety of modifications that affect its activity. By analyzing downstream gene expression patterns in single tumors, VIPER can account for these changes to identify proteins that are critical to cancer cell survival.

In a paper just published in Nature Genetics, the laboratory of Andrea Califano introduces what it describes as the first method capable of analyzing a single tumor biopsy to systematically identify proteins that drive cancerous activity in individual patients. Based on knowledge gained by modeling networks of molecular interactions in the cell, their computational algorithm, called VIPER (Virtual Inference of Protein activity by Enriched Regulon analysis), offers a unique new strategy for understanding how cancer cells survive and for identifying personalized cancer therapeutics.

Developed by Mariano Alvarez as a research scientist in the Califano laboratory, VIPER has become one of the cornerstones of Columbia University’s precision medicine initiative. Its effectiveness in cancer diagnosis and treatment planning is currently being tested in a series of N-of-1 clinical trials, which analyze the unique molecular characteristics of individual patients’ tumors to identify drugs and drug combinations that will be most effective for them. If successful, it could soon become an important component of cancer care at Columbia University Medical Center.

According to Dr. Califano, “VIPER makes it possible to find actionable proteins in 100% of cancer patients, independent of their genetic mutations. It also enables us to track tumors as they progress or relapse to determine the most appropriate therapeutic approach at different points in the evolution of disease. So far, this method is looking extremely promising, and we are excited about its potential benefits in finding novel therapeutic strategies to treat cancer patients.”

cQTLs modify TF binding

Cofactors work with transcription factors (TFs) to enable efficient transcription of a TF's target gene. The Bussemaker Lab showed that genetic alterations in the cofactor gene (cQTLs) change the nature of this interaction, affecting the connectivity between the TF and its target gene. This, combined with other factors called aQTLs that affect the availability of the TF in the nucleus, can lead to downstream changes in gene expression.

When different people receive the same drug, they often respond to it in different ways — what is highly effective in one patient can often have no benefit or even cause dangerous side effects in another. From the perspective of systems biology, this is because variants in a person’s genetic code lead to differences in the networks of genes, RNA, transcription factors (TFs), and other proteins that implement the drug’s effects inside the cell. These multilayered networks are much too complex to observe directly, and so systems biologists have been developing computational methods to infer how subtle differences in the genome sequence produce these effects. Ultimately, the hope is that this knowledge could improve scientists’ ability to identify drugs that would be most effective in specific patients, an approach called precision medicine.

In a paper published in the Proceedings of the National Academy of Sciences, a team of Columbia University researchers led by Harmen Bussemaker proposes a novel approach for discovering some critical components of this molecular machinery. Using statistical methods to analyze biological data in a new way, the researchers identified genetic alterations they call connectivity quantitative trait loci (cQTLs), a class of variants in transcription cofactors that affect the connections between specific TFs and their gene targets.

Nicholas Tatonetti
Nicholas Tatonetti is an assistant professor in the Department of Biomedical Informatics and Department of Systems Biology.

A team of Columbia University Medical Center (CUMC) scientists led by Nicholas Tatonetti has identified several drug combinations that may lead to a potentially fatal type of heart arrhythmia known as torsades de pointes (TdP). The key to the discovery was a new bioinformatics pipeline called DIPULSE (Drug Interaction Prediction Using Latent Signals and EHRs), which builds on previous methods Tatonetti developed for identifying drug-drug interactions (DDIs) in observational data sets. The results are reported in a new paper in the journal Drug Safety and are covered in a detailed multimedia feature published by the Chicago Tribune.

The algorithm mined data contained in the US FDA Adverse Event Reporting System (FAERS) to identify latent signals of DDIs that cause QT interval prolongation, a disturbance in the electrical cycle that coordinates the heartbeat. It then validated these predictions by looking for their signatures in electrocardiogram results contained in a large collection of electronic health records at Columbia. Interestingly, the drugs the investigators identified do not cause the condition on their own, but only when taken in specific combinations.

Previously, no reliable methods existed for identifying these kinds of combinations. Although the findings are preliminary, the retrospective confirmation of many of DIPULSE’s predictions in actual patient data suggests its effectiveness, and the investigators plan to test them experimentally in the near future.

DeMAND graphical abstract
By analyzing drug-induced changes in disease-specific patterns of gene expression, a new algorithm called DeMAND identifies the genes involved in implementing a drug's effects. The method could help predict undesirable off-target interactions, suggest ways of regulating a drug's activity, and identify novel therapeutic uses for FDA-approved drugs, three critical challenges in drug development.

Researchers in the Columbia University Department of Systems Biology have developed an efficient and accurate method for determining a drug’s mechanism of action — the cellular machinery through which it produces its pharmacological effect. Considering that most drugs, including widely used ones, act in ways that are not completely understood at the molecular level, this accomplishment addresses a key challenge to drug development. The new approach also holds great potential for improving drugs’ effectiveness, identifying better combination therapies, and avoiding dangerous drug-induced side effects.

According to Andrea Califano, the Clyde and Helen Wu Professor of Chemical Systems Biology and co-senior author on the study, “This new methodology makes it possible for the first time to generate a genome-wide footprint of the proteins that are responsible for implementing or modulating the activity of a drug. The accuracy of the method has been the most surprising result, with up to 80% of the identified proteins confirmed by experimental assays.”

Expanding the landscape of breast cancer drivers

In comparison with a previous study (Stephens et al., 2012, shown in gray), a new computational approach that focuses on somatic copy number mutations increased the number of known driver mutations in breast tumors to a median of five for each tumor. The findings could raise the likelihood of finding actionable targets in individual patients with breast cancer.

For many years, researchers have known that somatic copy number alterations (SCNA’s) — insertions, deletions, duplications, and transpositions of sections of DNA that are not inherited but occur after birth — play important roles in causing many types of cancer. Indeed, most recurrent drivers of epithelial tumors are copy number alterations, with some found in up to 40% of patients with specific tumor types. However, because SCNA’s occur when entire sections of chromosomes become damaged, biologists have had difficulty developing effective methods for distinguishing genes within SCNA’s that actually drive cancer from those genes that might lie near a driver but do not themselves cause disease.

Helios nearly doubled the number of high-confidence predictions of breast cancer drivers.

In a new paper published in Cell, researchers in the laboratories of Dana Pe’er (Columbia University Departments of Systems Biology and Biological Sciences) and Jose Silva (Icahn School of Medicine at Mount Sinai) report on a new computational algorithm that promises to dramatically improve researchers’ ability to identify cancer-driving genes within potentially large SCNA’s. The algorithm, called Helios, was used to analyze a combination of genomic data and information generated by functional RNAi screens, enabling them to predict several dozen new SCNA drivers of breast cancer. In follow-up in vitro experimental studies, they tested 12 of these predictions, 10 of which were validated in the laboratory. Their findings nearly double the number of breast cancer drivers, providing many new opportunities towards personalized treatments for breast cancer. Their methodology is general and could also be used to locate disease-causing SCNA’s in other cancer types.

Leading this effort was Felix Sanchez-Garcia, a recent PhD graduate from the Pe’er Lab and a first author on the paper. The story of how this breakthrough came about illuminates how the interdisciplinary research and education that take place at the Department of Systems Biology can address important challenges facing biological and biomedical research.

DIGGIT identifies mutations upstream of master regulators.

A new algorithm called DIGGIT identifies mutations that lie upstream of crucial bottlenecks within regulatory networks. These bottlenecks, called master regulators, integrate these mutations and become essential functional drivers of diseases such as cancer.

Although genome-wide association studies have made it possible to identify mutations that are linked to diseases such as cancer, determining which mutations actually drive disease and the mechanics of how they do so has been an ongoing challenge. In a paper just published in Cell, researchers in the lab of Andrea Califano describe a new computational approach that may help address this problem.

geWorkbench screenshot

A new version of geWorkbench lets researchers access a range of powerful, integrated bioinformatics tools using a standard web browser. Here, an ARACNe-generated gene regulatory network is displayed using the Cytoscape Web plugin.

Since its creation in 2005, investigators in Columbia University’s Center for the Multiscale Analysis of Genomic and Cellular Networks (MAGNet) have developed a large number of computational tools for studying biological systems from the perspectives of structural biology and systems biology. To consolidate and disseminate these tools to the wider research community, MAGNet developed geWorkbench (genomics Workbench), a free, open-source bioinformatics application that gathers all of the Center’s software and databases into one integrated software platform. These include applications for the analysis of cellular regulatory networks, protein structure, DNA and protein sequences, gene expression, and other kinds of biological data.

Initially, geWorkbench was made available as a software package that users could install and run on their local computers. Now, in a major upgrade, MAGNet has released a web-based version that makes these tools accessible through a browser interface.

Attractor Metagenes - DREAM7

Team Attractor Metagenes receives its award at the DREAM7 Conference. Gustavo Stolovitzky (IBM Research), Adam Margolis (Sage Bionetworks), Dimitris Anastassiou, Tai-Hsien Ou Yang, Wei-Yi Cheng, Stephen Friend (Sage Bionetworks), Erhan Bilal (IBM Research)

The team of Professor Dimitris Anastassiou and graduate students Wei-Yi Cheng and Tai-Hsien Ou Yang has been recognized as the best performer in the Sage Bionetworks – DREAM Breast Cancer Prognosis Challenge. This challenge, one of four organized as part of the seventh Dialogue for Reverse Engineering Assessments and Methods (DREAM7), was designed to assess the ability of participants’ computational models to predict breast cancer survival using patient clinical information and molecular profiling data. As a reward for this accomplishment, the journal Science Translational Medicine has just published a paper from the Anastassiou lab describing their model. It is also the journal’s cover theme for this issue, which includes a second article describing the Challenge.

The Columbia University researchers based their DREAM entry on previous work to identify what they call “attractor metagenes,” sets of strongly co-expressed genes that they have found to be present with very little variation in many cancer types. Moreover, these metagenes appear to be associated with specific attributes of cancer including chromosomal instability, epithelial-mesenchymal transition, and a lymphocyte-specific immune response. As Wei-Yi Cheng comments in Sage Synapse, “We like to think of these three main attractor metagenes as representing three key ‘bioinformatic hallmarks of cancer,’ reflecting the ability of cancer cells to divide uncontrollably and invade surrounding tissues, and the ability of the organism to recruit a particular type of immune response to fight the disease.”

Genes forming cluster I in the context of cellular signaling pathways

Genes forming cluster I in the context of cellular signaling pathways. Proteins encoded by cluster genes are shown in yellow, and those corresponding to other relevant genes that were present in the input data but not selected by the NETBAG+ algorithm are shown in cyan. 

In a new paper published in the journal Nature Neuroscience, Columbia University researchers report that many of the genes that are mutated in schizophrenia are organized into two main networks. Surprisingly, the study also found that a genetic network that leads to schizophrenia is very similar to a network that has been linked to autism. 

Using a computational approach called NETBAG+, Dennis Vitkup and colleagues performed network-based analyses of rare de novo mutations to map the gene networks that lead to schizophrenia. When they compared one schizophrenia network to an autism network described in a study he published last year, they discovered that different copy number variants in the same genes can lead to either schizophrenia or autism. The overlapping genes are important for processes such as axon guidance, synapse function, and cell migration — processes within the brain that have been shown to play a role in the development of these two diseases. These gene networks are particularly active during prenatal development, suggesting that the foundations for schizophrenia and autism are laid very early in life.

GLOBUS algorithm

 An overview of the GLOBUS algorithm.

A Columbia University team led by professor Dennis Vitkup and PhD student German Plata of the Center for Computational Biology and Bioinformatics has developed a novel genome-wide framework for making probabilistic annotations of metabolic networks. Their approach, called Global Biochemical Reconstruction Using Sampling (GLOBUS), combines information about sequence homology with context-specific information including phylogeny, gene clustering, and mRNA co-expression to predict the probability of biochemical interactions between specific genes. By integrating these different categories of information using a principled probabilistic framework, this approach overcomes limitations of considering only one functional category or one gene at a time, providing a global and accurate prediction of metabolic networks.

In a paper published in Nature Chemical Biology, the scientists write, "Currently, most publicly available biochemical databases do not provide quantitative probabilities or confidence measures for existing annotations. This makes it hard for the users of these valuable resources to distinguish between confident assignments and mere guesses... The GLOBUS approach, which is based on statistical sampling of possible biochemical assignments, provides a principled framework for such global probabilistic annotations. The method assigns annotation probabilities to each gene and suggests likely alternative functions."

Gene clusters found using NETBAG analysis of de novo CNV regions observed in autistic individuals.

Gene clusters found using NETBAG analysis of de novo CNV regions observed in autistic individuals. A) The highest scoring cluster obtained using the search procedure with up to one gene per each CNV region. B) The cluster obtained using the search with up to two genes per region.

Identification of complex molecular networks underlying common human phenotypes is a major challenge of modern genetics. A new network-based method developed at the lab of Dennis Vitkup was used to identify a large biological network of genes affected by rare de novo copy number variations (CNVs) in autism. The genes forming the network are primarily related to synapse development, axon targeting, and neuron motility. The identified network is strongly related to genes previously implicated in autism and intellectual disability phenotypes.

Systematic characterization of cancer genomes has revealed a staggering number of diverse alterations that differ among individuals, so that their functional importance and physiological impact remains poorly defined. In order to identify which genetic alterations are functional, the lab of Dr. Dana Pe’er has developed a novel Bayesian probabilistic algorithm, CONEXIC, to integrate copy number and gene expression data in order to identify tumor-specific “driver” aberrations, as well as the cellular processes they affect.

In work published in the journal Cell, the new method was applied on data from melanoma patients, identifying a list of 64 putative ‘drivers’ and the core processes affected by them. This list includes many known driver genes (e.g., MITF), which CONEXIC correctly identified and paired with their known targets. This list also includes novel ‘driver’ candidates including Rab27a and TBC1D16, both involved in protein trafficking. ShRNA-mediated silencing of these genes in short-term tumor-derived cultures determined that they are tumor dependencies and validated their computationally predicted role in melanoma (including target identification), suggesting that protein trafficking may play an important role in this malignancy.