Computational Biology×


Distribution of marker expression across development

A new algorithm called Wanderlust uses single-cell measurements to detect how marker expression changes across development.

In a new paper published in the journal Cell, a team of researchers led by Dana Pe’er at Columbia University and Garry Nolan at Stanford University describes a powerful new method for mapping cellular development at the single cell level. By combining emerging technologies for studying single cells with a new, advanced computational algorithm, they have designed a novel approach for mapping development and created the most comprehensive map ever made of human B cell development. Their approach will greatly improve researchers’ ability to investigate development in cells of all types, make it possible to identify rare aberrations in development that lead to disease, and ultimately help to guide the next generation of research in regenerative medicine.

Pointing out why being able to generate these maps is an important advance, Dr. Pe’er, an associate professor in the Columbia University Department of Systems Biology and Department of Biological Sciences, explains, “There are so many diseases that result from malfunctions in the molecular programs that control the development of our cell repertoire and so many rare, yet important, regulatory cell types that we have yet to discover. We can only truly understand what goes wrong in these diseases if we have a complete map of the progression in normal development. Such maps will also act as a compass for regenerative medicine, because it’s very difficult to grow something if you don’t know how it develops in nature. For the first time, our method makes it possible to build a high-resolution map, at the single cell level, that can guide these kinds of research.”

Chris Wiggins

In a “Most Creative People” feature, Fast Company magazine recently interviewed associate professor Chris Wiggins, a faculty member of the Department of Systems Biology and Center for Computational Biology and Bioinformatics, about his new appointment at one of the world’s most respected outlets for digital journalism. In this role, he will lead the development of a machine learning team that will help the New York Times to better understand how its audience is using and navigating its content.

In the interview Dr. Wiggins explains why machine learning is becoming increasingly important in the age of big data, and about the shared challenges that the natural sciences and the media are now both facing.

Tuuli LappalainenTuuli Lappalainen has joined Columbia University as an assistant professor in the Department of Systems Biology. Dr. Lappalainen is a specialist in the analysis of RNA sequencing data, with research interests including functional variation in the human genome, population genetic background of variation in the human genome, and interpretation of genome function.

Dr. Lappalainen joins the Department of Systems Biology in co-appointment with the New York Genome Center (NYGC), where she will also serve as a Junior Investigator and Core Member. Based in lower Manhattan, NYGC is a consortium made up primarily of New York-area institutions that is designed to translate promising genomics-based research into new strategies for treating, preventing, and managing disease. This co-appointment with Columbia University — an institutional founding member of the NYGC — will enhance collaboration between the two institutions. (Read an interview with Dr. Lappalainen at the New York Genome Center website.)

Dr. Lappalainen earned her PhD in genetics at the University of Helsinki, Finland, and held appointments as a postdoctoral researcher in at the University of Geneva Medical School, Switzerland and at the Stanford University School of Medicine. She is the chair of the analysis group for the Genetic European Variation in Health and Disease (Geuvadis) Consortium’s RNA sequencing project, a member of the analysis group for the National Institute of Health’s Genotype Tissue Expression (GTEx) project, and a member of the analysis and functional interpretation groups for the 1000 Genomes Project.

Models of Evolution In Charles Darwin's seminal treatise On the Origin of Species there is only one image, which visualizes evolution as following a branching pattern in which species diverge into lineages over time like the limbs on a tree. With the increasing availability of genomic data, scientists have attempted to understand evolution at the molecular level by using a similar phylogenetic paradigm, but as Department of Systems Biology Assistant Professor Raul Rabadan , MD/PhD student Joseph Chan, and Stanford University mathematician Gunnar Carlsson point out in a new paper published in the Proceedings of the National Academy of Sciences , it has a number of shortcomings when applied in this way. By developing a new mathematical approach based on a method called persistent homology, the researchers produced several insights into viral evolution that could not be found using other means.

Researchers in the Columbia University Department of Systems Biology and Herbert Irving Comprehensive Cancer Center have determined that measuring the expression levels of three genes associated with aging can be used to predict the aggressiveness of seemingly low-risk prostate cancer. Use of this three-gene biomarker, in conjunction with existing cancer-staging tests, could help physicians better determine which men with early prostate cancer can be safely followed with “active surveillance” and spared the risks of prostate removal or other invasive treatment. The findings were published today in the online edition of Science Translational Medicine.

More than 200,000 new cases of prostate cancer are diagnosed each year in the U.S. “Most of these cancers are slow growing and will remain so, and thus they do not require treatment,” said study leader Cory Abate-Shen, Michael and Stella Chernow Professor of Urological Oncology at Columbia University Medical Center (CUMC). “The problem is that, with existing tests, we cannot identify the small percentage of slow-growing tumors that will eventually become aggressive and spread beyond the prostate. The three-gene biomarker could take much of the guesswork out of the diagnostic process and ensure that patients are neither overtreated nor undertreated.”

Rabadan, Nature Genetics

An analysis of all gene mutations in nearly 140 brain tumors has uncovered most of the genes responsible for driving glioblastoma. The analysis found 18 new driver genes (labeled red), never before implicated in glioblastoma and correctly identified the 15 previously known driver genes (labeled blue). The graphs show mutated genes that are commonly found in varying numbers in glioblastoma (left), that frequently contain insertions (middle), and that frequently contain deletions (right). Genes represented by blue dots in the graphs were statistically most likely to be driver genes.

A team of Columbia University Medical Center researchers has identified 18 new genes responsible for driving glioblastoma multiforme, the most common—and most aggressive—form of brain cancer in adults. The study was published August 5, 2013, in the journal Nature Genetics.

The Columbia team used a combination of high-throughput DNA sequencing and a new method of statistical analysis developed by co-author Raul Rabadan, an assistant professor in the Department of Systems Biology, to generate a short list of candidate gene mutations that were highly likely to drive cancer, as opposed to mutations that have no effect.

Considering these results along with a previous study this group conducted, Rabadan and collaborators Antonio Iavarone and Anna Lasorella point out that approximately 15% of glioblastomas could now be targeted with drugs that have already been approved by the FDA. As Lasorella remarks in an article for the CUMC Newsroom, “There is no reason why these patients couldn’t receive these drugs now in clinical trials.”

Searches for hyperglycemia-related terms

Percentage of users in each of the three user groups searching for hyperglycemia-related terms, computed per week over 12 months of search log data. Background refers to the fraction of all searchers who search for hyperglycemia-related symptoms or terminology independent of the presence of the drugs in the users’ search histories.

Although the US Food and Drug Organization and other agencies collect and analyze reports on adverse drug effects, alerts for single drugs and drug-drug interactions are often delayed due to the time it takes to accumulate evidence. Columbia University Department of Systems Biology faculty member Nicholas Tatonetti, in collaboration with investigators at Stanford University and Microsoft Research, hypothesized that Internet users can provide early clues of adverse drug events as they seek information on the web concerning symptoms they are experiencing. A new paper explains their results.

As a test, Tatonetti and colleagues asked whether it would be possible to detect evidence of an interaction between the antidepressant paroxetine and the anti-cholesterol drug pravastatin by analyzing web search logs from 2010. As a postfoc at Stanford, Tatonetti and colleagues used a data mining algorithm to analyze FDA adverse event reporting records, and retroactively found this combination to be associated with hyperglycemia (high blood sugar) in some patients. In this new project, the researchers analyzed the search logs of millions of Internet users from a period before the above association was identified to see how often they entered search terms related to hyperglycemia and to one or both medications under investigation. (Participants in this study opted in by voluntarily installing a web browser extension that tracked their activity anonymously.)


viSNE reveals the progression of cancer in a sample of cells taken from a patient with acute myeloid leukemia. Cells are colored according to intensity of expression of the indicated cell markers, enabling the comparison of expression patterns before and after relapse. For example, Fit3 is expressed primarily in the diagnosis sample, while CD34 emerges in the relapse sample.

Researchers in the Columbia Initiative in Systems Biology have developed a computational method that enables scientists to visualize and interpret high-dimensional data produced by single-cell measurement technologies such as mass cytometry. The method, called viSNE (visual interactive Stochastic Neighbor Embedding), has just been published in the online edition of Nature Biotechnology. It has particular relevance to cancer research and therapeutics. As Columbia University Medical Center reports:

Researchers now understand that cancer within an individual can harbor subpopulations of cells with different molecular characteristics. Groups of cells may behave differently from one another, including in how they respond to treatment. The ability to study single cells, as well as to identify and characterize subpopulations of cancerous cells within an individual, could lead to more precise methods of diagnosis and treatment.

“Our method not only will allow scientists to explore the heterogeneity of cancer cells and to characterize drug-resistant cancer cells, but also will allow physicians to track tumor progression, identify drug-resistant cancer cells, and detect minute quantities of cancer cells that increase the risk of relapse,” said co-senior author Dana Pe’er, associate professor of biological sciences and systems biology at Columbia.

Barry Honig

When Columbia University founded the Center for Multiscale Analysis of Genomic and Cellular Networks (MAGNet) in 2005, one of its goals was to integrate the methods of structural biology with those of systems biology. Considering protein structure within the context of computational models of cellular networks, researchers hoped, would not only improve the predictive value of their models by giving another layer of evidence, but also lead to new types of predictions that could not be made using other methods.

In a new paper published in Nature magazine, Barry Honig, Andrea Califano, and other members of the Columbia Initiative in Systems Biology, including first authors Qiangfeng Cliff Zhang and Donald Petrey, report that this goal has now been realized. For the first time, the researchers have shown that information about protein structure can be used to make predictions about protein-protein interactions on a genome-wide scale. Their approach capitalizes on innovative techniques in computational structural biology that the Honig lab has developed over the last 15 years, culminating in the development of a new algorithm called Predicting Protein-Protein Interactions (PrePPI). In this interview, Honig describes the evolution of this new approach, and what it could mean for the future of systems biology.


Tumor-induced mRNA expression changes for individual biochemical reactions in central metabolism. 

A large study analyzing gene expression data from 22 cancer types has identified a broad spectrum of metabolic expression changes associated with cancer. The analysis, led by Dennis Vitkup, first author Jie Hu, a postdoctoral research scientist in the Vitkup lab, with a multi-institutional group of collaborators, also identified hundreds of potential drug targets that could cut off a tumor’s fuel supply or interfere with its ability to synthesize essential elements necessary for tumor growth. The study has just been published in the online edition of Nature Biotechnology .

As Columbia University Medical Center reports:

The results should ramp up research into drugs that interfere with cancer metabolism, a field that dominated cancer research in the early 20th century and has recently undergone a renaissance.

Attractor Metagenes - DREAM7

Team Attractor Metagenes receives its award at the DREAM7 Conference. Gustavo Stolovitzky (IBM Research), Adam Margolis (Sage Bionetworks), Dimitris Anastassiou, Tai-Hsien Ou Yang, Wei-Yi Cheng, Stephen Friend (Sage Bionetworks), Erhan Bilal (IBM Research)

The team of Professor Dimitris Anastassiou and graduate students Wei-Yi Cheng and Tai-Hsien Ou Yang has been recognized as the best performer in the Sage Bionetworks – DREAM Breast Cancer Prognosis Challenge. This challenge, one of four organized as part of the seventh Dialogue for Reverse Engineering Assessments and Methods (DREAM7), was designed to assess the ability of participants’ computational models to predict breast cancer survival using patient clinical information and molecular profiling data. As a reward for this accomplishment, the journal Science Translational Medicine has just published a paper from the Anastassiou lab describing their model. It is also the journal’s cover theme for this issue, which includes a second article describing the Challenge.

The Columbia University researchers based their DREAM entry on previous work to identify what they call “attractor metagenes,” sets of strongly co-expressed genes that they have found to be present with very little variation in many cancer types. Moreover, these metagenes appear to be associated with specific attributes of cancer including chromosomal instability, epithelial-mesenchymal transition, and a lymphocyte-specific immune response. As Wei-Yi Cheng comments in Sage Synapse, “We like to think of these three main attractor metagenes as representing three key ‘bioinformatic hallmarks of cancer,’ reflecting the ability of cancer cells to divide uncontrollably and invade surrounding tissues, and the ability of the organism to recruit a particular type of immune response to fight the disease.”

Genes forming cluster I in the context of cellular signaling pathways

Genes forming cluster I in the context of cellular signaling pathways. Proteins encoded by cluster genes are shown in yellow, and those corresponding to other relevant genes that were present in the input data but not selected by the NETBAG+ algorithm are shown in cyan. 

In a new paper published in the journal Nature Neuroscience, Columbia University researchers report that many of the genes that are mutated in schizophrenia are organized into two main networks. Surprisingly, the study also found that a genetic network that leads to schizophrenia is very similar to a network that has been linked to autism. 

Using a computational approach called NETBAG+, Dennis Vitkup and colleagues performed network-based analyses of rare de novo mutations to map the gene networks that lead to schizophrenia. When they compared one schizophrenia network to an autism network described in a study he published last year, they discovered that different copy number variants in the same genes can lead to either schizophrenia or autism. The overlapping genes are important for processes such as axon guidance, synapse function, and cell migration — processes within the brain that have been shown to play a role in the development of these two diseases. These gene networks are particularly active during prenatal development, suggesting that the foundations for schizophrenia and autism are laid very early in life.

Itsik Pe'erItsik Pe'er, an Associate Professor in the Department of Computer Science and member of the Columbia Initiative in Systems Biology, is using mathematics and computer analytics to identify the genetic makeup of the founding Ashkenazi Jews. By analyzing the full DNA sequences of hundreds of their descendants in the New York City area and comparing them to reference sets of non-Ashkenazi DNA, his goal is to identify Ashkenazi-specific genetic mutations associated with diseases such as Tay-Sachs, Crohn's, and Parkinson's disease. As a new article in Columbia News explains:

By examining similarities in DNA segments shared by large numbers of related individuals, his lab developed statistical models that allow him to make generalizations about entire populations. The mix of genes that every child inherits from each parent travels in long sequences of code that remain together and are remarkably consistent from one generation to the next.

"The size of the gene chunks gets smaller with each generation, but they diminish at a consistent and predictable rate. As a result, Pe’er can use his models to determine distant relationships shared by two individuals by measuring the length of their common DNA segments."

Read the complete article here.

GLOBUS algorithm

 An overview of the GLOBUS algorithm.

A Columbia University team led by professor Dennis Vitkup and PhD student German Plata of the Center for Computational Biology and Bioinformatics has developed a novel genome-wide framework for making probabilistic annotations of metabolic networks. Their approach, called Global Biochemical Reconstruction Using Sampling (GLOBUS), combines information about sequence homology with context-specific information including phylogeny, gene clustering, and mRNA co-expression to predict the probability of biochemical interactions between specific genes. By integrating these different categories of information using a principled probabilistic framework, this approach overcomes limitations of considering only one functional category or one gene at a time, providing a global and accurate prediction of metabolic networks.

In a paper published in Nature Chemical Biology, the scientists write, "Currently, most publicly available biochemical databases do not provide quantitative probabilities or confidence measures for existing annotations. This makes it hard for the users of these valuable resources to distinguish between confident assignments and mere guesses... The GLOBUS approach, which is based on statistical sampling of possible biochemical assignments, provides a principled framework for such global probabilistic annotations. The method assigns annotation probabilities to each gene and suggests likely alternative functions."

Transforming activity of FGFR-TACC fusion proteins

Representative microphotographs of hematoxylin and eosin staining of advanced FGFR3-TACC3-shp53–generated tumors show histological features of high-grade glioma.

A new paper published by Columbia University Medical Center researchers in the journal Science has determined that some cases of glioblastoma, the most aggressive form of primary brain cancer, result from the fusion of the genes FGFR and TACC. Raul Rabadan, a co-senior author on the study, led efforts to identify these genes by using quantitative methods to analyze the glioblastoma genome from nine patients, and then compare these results with more than 300 genomes from the Cancer Genome Atlas project.

The collaboration with cancer genomics expert Antonio Iavarone and co-senior author Anna Lasorella found that the protein produced by the FGFR-TACC fusion disrupts the mitotic spindle (the cellular structure that guides mitosis) and causes aneuploidy, an uneven distribution of chromosomes that causes tumorigenesis. The researchers also found that drugs that target this aberration can dramatically slow the growth of tumors in mice, suggesting a potential therapeutic target.

An extensive microRNA-mediated network of RNA-RNA interactions

Genome-wide inference of sponge modulators identified a miR-program mediated post-transcriptional regulatory (mPR) network including ~248,000 interactions.

For decades, scientists have thought that the primary role of messenger RNA (mRNA) is to shuttle information from the DNA to the ribosomes, the sites of protein synthesis. However, new studies now suggest that the mRNA of one gene can control, and be controlled by, the mRNA of other genes via a large pool of microRNA molecules, with dozens to hundreds of genes working together in complex self-regulating sub-networks.

In work published in the journal Cell, Andrea Califano, José Silva, and colleagues analyzed gene expression data in glioblastoma in combination with matched microRNA profiles to uncover a posttranscriptional regulation layer of surprising magnitude, comprising more than 248,000 microRNA (miR)-mediated interactions. These include ∼7,000 genes whose transcripts act as miR “sponges.” When two genes share a set of microRNA regulators, changes in expression of one gene affects the other. If, for instance, one of those genes is highly expressed, the increase in its mRNA molecules will “sponge up” more of the available microRNAs. As a result, fewer microRNA molecules will be available to bind and repress the other gene’s mRNAs, leading to a corresponding increase in expression.

Although such an effect had been previously elucidated, the range and relevance of this kind of interaction had not been characterized.

Gene clusters found using NETBAG analysis of de novo CNV regions observed in autistic individuals.

Gene clusters found using NETBAG analysis of de novo CNV regions observed in autistic individuals. A) The highest scoring cluster obtained using the search procedure with up to one gene per each CNV region. B) The cluster obtained using the search with up to two genes per region.

Identification of complex molecular networks underlying common human phenotypes is a major challenge of modern genetics. A new network-based method developed at the lab of Dennis Vitkup was used to identify a large biological network of genes affected by rare de novo copy number variations (CNVs) in autism. The genes forming the network are primarily related to synapse development, axon targeting, and neuron motility. The identified network is strongly related to genes previously implicated in autism and intellectual disability phenotypes.

These findings are consistent with the hypothesis that significantly stronger functional perturbations are required to trigger the autistic phenotype in females compared to males. Overall, the analysis of de novo variants supports the hypothesis that perturbed synaptogenesis is at the heart of autism.

Systematic characterization of cancer genomes has revealed a staggering number of diverse alterations that differ among individuals, so that their functional importance and physiological impact remains poorly defined. In order to identify which genetic alterations are functional, the lab of Dr. Dana Pe’er has developed a novel Bayesian probabilistic algorithm, CONEXIC, to integrate copy number and gene expression data in order to identify tumor-specific “driver” aberrations, as well as the cellular processes they affect.

In work published in the journal Cell, the new method was applied on data from melanoma patients, identifying a list of 64 putative ‘drivers’ and the core processes affected by them. This list includes many known driver genes (e.g., MITF), which CONEXIC correctly identified and paired with their known targets. This list also includes novel ‘driver’ candidates including Rab27a and TBC1D16, both involved in protein trafficking. ShRNA-mediated silencing of these genes in short-term tumor-derived cultures determined that they are tumor dependencies and validated their computationally predicted role in melanoma (including target identification), suggesting that protein trafficking may play an important role in this malignancy.

Flu cases in early 2009

Because flu viruses mutate nearly once every reproduction cycle, no two people are made sick by precisely the same virus, as illustrated by this chart documenting swine flu cases among humans in early 2009.

The recent outbreak and sudden spread of a novel H1N1 influenza virus has caused a worldwide concern and has tested our ability to respond to major public health challenges. Significant scientific resources have been marshaled to discover the best possible responses against this novel swine origin influenza virus. A group led by Raul Rabadan at the Center for Computational Biology and Bioinformatics, and the Department of Biomedical Informatics at Columbia University has been studying the evolution of influenza viruses and the origins of flu pandemics by analyzing large data sets that contain genomic information.