MAGNet: Basic Computational Research
- Editor's note: The MAGNet center formally closed in July 2016, following the mandatory conclusion of its grant after more than 10 years of activity. The pages in this section constitute an archive of its work.
Efforts in this area provide critical expertise in the advancement of theoretical knowledge-based methods that are then applied to the solution of specific biomedical problems. Research is carried out by investigators at Columbia's School of Engineering and Applied Science (Dimitris Anastassiou, Gail Kaiser, Itsik Pe'er, Kenneth Ross, David Waltz, Chris Wiggins), the Graduate School of Arts and Sciences (Dana Pe'er), the Columbia University Medical Center (Andrea Califano, Aris Floratos, Carol Friedman), the Memorial Sloan Kettering Cancer Center (Christina Leslie), and the University of Chicago (Yves Lussier). Work is distributed across 5 research projects:
Machine learning
(Lead Investigators: Christina Leslie, Chris Wiggins)
The focus of this project is the algorithmic development and software implementation of MEDUSA, GeneClass, NetClass, InfoMod, and NetBoost. MEDUSA is an algorithm that uses boosting to learn cis regulatory motifs and builds a gene regulatory program from promoter sequence and gene expression data, using the expression levels of known transcriptional regulators and signal transducers to help predict the differential expression of target genes as mediated by sequence motifs. GeneClass also uses boosting to learn a transcriptional regulatory program but uses predefined features associated with the promoter sequences, such as known motifs or ChIP chip occupancies. NetClass provides an approach to hypothesis testing for existing networks (making clear which evolutionary mechanisms best describe a particular network, and suggesting more detailed analyses such as phylogenetic comparisons) and provides priors (in the Bayesian sense) for constraining network inference algorithms on future data. InfoMod is an information-theoretic approach to quantifying network modularity and revealing functional modules in large networks. NetBoost uses boosting to determine which among a set of plausible generative processes may have given rise (over evolutionary time) to the observed topology of an interaction network.
Models for natural language processing and ontology
(Lead Investigators: Yves Lussier, Carol Friedman)
The PhenoGO database uses Natural Language Processing technology (BioMedLEE) to code gene-phenotype relationships. PhenoGO draws concepts from a number of different genotypic and phenotypic databases (UMLS, OMIM, etc) with different schemata. These databases are unified through PGschema, a novel representational schema that enables translation of phenotypic, genetic and associated cellular context information, found in textual narratives, to a well-defined data structure comprising phenotypic and genetic concepts from established ontologies along with modifiers and relationships. NLP-derived databases can be useful in addressing significant biological questions, such the as identification of protein interaction networks that are common to disease-pairs. Significant correlations between pairs of diseases and networks were identified and validated in this project, demonstrating the potential for discovery.
Meta-ontologies for bioinformatics component interoperability
(Lead Investigators: Aris Floratos, Andrea Califano)
The goal is to develop a formal Biomedical Informatics Structured ONtology (BISON) for the representation of bioinformatics data-structures and data structure transformations (algorithms, applications, tools). BISON is used to define universal interfaces and contracts for the interoperability of complex biomedical software components. It complements and extends existing biomedical informatics ontologies and vocabularies for the representation of data in knowledge-bases and databases. geWorkbench utilizes BISON as its data interoperability backbone, using BISON-anchored interfaces to model data exchanged between the various geWorkbench application plugin modules.
Novel tools for integration of biological databases and analyses
(Lead Investigators: Gail Kaiser, Kenneth Ross)
This project investigates support for knowledge sharing across communities of scientists. genSpace, a component of geWorkbench, automates incremental assembly of a community knowledge-base for knowledge sharing. genSpace monitors specific tool and workflow use within geWorkbench, through event logging, aggregation, and data mining facilities. Workflows are defined as patterns of analytical component invocations occurring within a configurable time window. Such knowledge (i.e., how researchers use tools and workflows) is presented to researchers as recommendations, leveraging social networking and collaborative filters as popular with Internet users.
Algorithmic approaches for the integration of genetic variation and expression data in the study of regulation
(Lead Investigators: Dana Pe'er, Itsik Pe'er)
The overall objective is to develop a suit of methods that integrate genotype, sequence and functional genomics data to understand the connections between genotype to phenotype. This includes detecting genetic polymorphism that is causative of phenotype and understanding how genetic variation alters the regulatory network, cell decision and more complex phenotypes. A particular focus is applying this approach to human disease, providing insight into how genetic variation (mutation) can lead to dysregulation and disease (e.g. cancer). The approach followed exploits the modularity of biological systems by searching for regulatory programs that are predictive of entire modules, allowing discovery of complex combinatorial regulation programs that are undetectable when considering each gene in isolation.