20th Annual International Conference on
Intelligent Systems for Molecular Biology
PDF Print E-mail


Proceedings Track Presentations
As of May 1, 2012 (schedule subject to change)

New for 2012!

All Highlights and Proceedings Track presentations are presented by scientific area part of the combined Paper Presentation schedule.
A full schedule of Paper Presentations can be found here.


Presenting Authors are are shown in bold:

GenomeRing: alignment visualization based on SuperGenome coordinates

Author(s):
Alexander Herbig, University of Tübingen, Germany
Günter Jäger, University of Tübingen
Florian Battke, University of Tübingen
Kay Nieselt, University of Tübingen

Session Chair: Robert Murphy
Abstract Show/Hide

Motivation: The number of completely sequenced genomes is continuously rising, allowing for comparative analyses of genomic variation. Such analyses are often based on whole-genome alignments to elucidate structural differences arising from insertions, deletions or from rearrangement events. Computational tools which can visualize genome alignments in a meaningful manner are needed to help researchers gain new insights into the underlying data. Such visualizations typically are either realized in a linear fashion as in genome browsers or by using a circular approach, where relationships between genomic regions are indicated by arcs. Both methods allow for the integration of additional information such as experimental data or annotations. However, providing a visualization that that still allows for a quick and comprehensive interpretation of all important genomic variations together with various supplemental data, which may be highly heterogeneous, remains a challenge. Results: Here we present two complementary approaches to tackle this problem. Firstly, we propose the SuperGenome concept for the computation of a common coordinate system for all genomes in a multiple alignment. This coordinate system allows for the consistent placement of genome annotations in the presence of insertions, deletions, and rearrangements. Secondly, we present the GenomeRing visualization which, based on the SuperGenome, creates an interactive visualization of the multiple genome alignment in a circular layout. We demonstrate our methods by applying them to an alignment of Campylobacter jejuni strains for the discovery of genomic islands as well as to an alignment of Helicobacter pylori, which we visualize in combination with gene expression data.
TOP

Joint Stage Recognition and Anatomical Annotation of Drosophila Gene Expression Patterns

Author(s):
Xiao Cai, University of Texas at Arlington, United States
Hua Wang, University of Texas at Arlington
Heng Huang, University of Texas at Arlington
Chris Ding, University of Texas at Arlington

Session Chair: Robert Murphy
Abstract Show/Hide

Motivation: Staining the mRNA of a gene via in situ hybridization (ISH) during the development of a Drosophila melanogaster embryo delivers the detailed spatio-temporal patterns of the gene expression. Many related biological problems such as the detection of co-expressed genes, co-regulated genes, and transcription factor binding motifs rely heavily on the analysis of these image patterns. To provide the text-based pattern searching for facilitating related biological studies, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with developmental stage term and anatomical ontology terms manually by domain experts. Due to the rapid increasing number of such images and the inevitable bias annotations by human curators, it is necessary to develop an automatic method to recognize the developmental stage and annotate anatomical terms. Results: In this paper, we propose a novel computational model for joint stage classification and anatomical terms annotation of Drosophila gene expression patterns. We introduce a new Tri-Relational Graph (TG) model that comprises the data graph, anatomical terms graph, developmental stage term graph, and connects them by three additional graphs induced from stage or annotation label assignments. Upon the TG model, we present a Preferential Random Walk (PRW) method to jointly recognize developmental stage and annotate anatomical terms by utilizing the interrelations between two tasks. The experimental results on two refined BDGP data sets demonstrate our joint learning method can achieve superior prediction results on both tasks than the state-of-the-art methods.
TOP

Improved synapse detection for mGRASP-asssisted brain connectivity mapping

Author(s):
Linqing Feng, Korea Institute of Science and Technology, Republic of Korea
Ting Zhao, Zhejiang University
Jinhyun Kim, Korea Institute of Science and Technology

Session Chair: Robert Murphy
Abstract Show/Hide

Motivation: A new technique, mammalian GFP reconstitution across synaptic partners (mGRASP), enables mapping mammalian synaptic connectivity with light microscopy. To! characterize the locations and distribution of synapses in complex neuronal networks visualized by mGRASP, it is essential to detect mGRASP fluorescence signals with high accuracy.

Results: We developed a fully-automatic method for detecting mGRASP-labeled synapse puncta. By modeling each punctum as a Gaussian distribution, our method enables accurate detection even when puncta of varying size and shape partially overlap. The method consists of three stages: blob detection by global thresholding; blob separation by watershed; and punctum modeling by a Variational Bayesian Gaussian Mixture Model. Extensive testing shows that the three-stage method improved detection accuracy markedly, and especially reduces under-segmentation. The method provides a goodness-of-fit score for each detected punctum, allowing efficient error detection. We applied this advantage to also develop an efficient interactive method for correcting errors.

Availability: The software is available on http! ://jinny.kist.re.kr

TOP

Nonparametric Bayesian Inference for Perturbed and Orthologous Gene Regulatory Networks

Author(s):
Christopher A. Penfold, University of Warwick, United Kingdom
Vicky Buchanan-Wollaston, University of Warwick
Katherine J. Denby, University of Warwick
David L. Wild, University of Warwick

Session Chair: Paul Horton
Abstract Show/Hide

The generation of time series transcriptomic datasets collected under multiple experimental conditions has proven to be a powerful approach for disentangling complex biological processes, allowing for the reverse engineering of gene regulatory networks (GRNs). Most methods for reverse engineering GRNs from multiple datasets assume that each of the time series were generated from networks with identical topology. Here we outline a hierarchical, nonparametric Bayesian approach for reverse engineering GRNs using multiple time series that can be applied in a number of novel situations including: (i) where different, but overlapping sets of transcription factors are expected to bind in the different experimental conditions i.e., where switching events could potentially arise under the different treatments; and (ii) for inference in evolutionary related species in which orthologous GRNs exist. The hierarchical inference outperforms related (but non- hierarchical) approaches when the networks used to generate the data were identical, and performs comparably even when the networks used to generate data were independent. The method was subsequently used alongside yeast one-hybrid and microarray time series data to infer potential transcriptional switches in Arabidopsis thaliana response to abiotic stress. The results confirm previous biological studies and allow for additional insights into gene regulation under various abiotic stresses.
TOP

Protein Subcellular Location Pattern Classification in Cellular Images Using Latent Discriminative Models

Author(s):
Jieyue Li, Carnegie Mellon University, United States
Liang Xiong, Carnegie Mellon University
Robert Murphy, Carnegie Mellon University
Jeff Schneider, Carnegie Mellon University

Session Chair: Robert Murphy
Abstract Show/Hide

In human proteome, the subcellular location pattern is crucial for understanding the functions of a protein. This pattern is essentially characterized by the co-localization of the protein and the components in the cell. In this paper, we address the protein pattern classification problem based on the confocal immune-fluorescence cellular images from the Human Protein Atlas (HPA) project. In our HPA data set, each cell has the staining images of one protein and three reference components, and in the meanwhile there are many other components that are invisible. Given one such cell, the task is to classify the pattern type of stained protein. We first randomly select local image regions within the cells, and then extract various carefully designed features from these regions. Compared to traditional cell based methods, this region based approach enables us to explicitly study the relationship between proteins and different cell components, as well as the interactions between these components. To achieve these two goals, we propose two discriminative models that extend logistic regression with structured latent variables. The first model allows the same protein pattern class to be expressed differently according to the underlying components in different regions. The second model further captures the spatial dependencies between the components within the same cell so that we can better infer these components. To learn these models, we propose a fast approximate algorithm for inference, and then use gradient based methods to maximize the data likelihood. In the experiments, we show that the proposed models can both help improve the classification accuracies on synthetic data and real cellular images. The best overall accuracy we report in this paper for classifying $942$ proteins into $13$ classes of patterns is about $84.6\%$, which to our knowledge is the best so far. In addition, we can give these results biological interpretations.
TOP

NOrMAL: Accurate Nucleosome Positioning using a Modified Gaussian Mixture Model

Author(s):
Anton Polishko, UC Riverside, United States
Nadia Ponts, UC Riverside
Karine Le Roch, UC Riverside
Stefano Lonardi, UC Riverside

Session Chair: Paul Horton
Abstract Show/Hide

Motivation: Nucleosomes are the basic elements of DNA chromatin structure. They control the packaging of DNA and play a critical role in gene regulation by allowing physical access to transcription factors. The advent of second-generation sequencing has enabled landmark genome-wide studies of nucleosome position for several model organisms. Current methods to determine nucleosome positioning first compute an occupancy coverage profile by mapping nucleosome-enriched sequenced reads to a reference genome; then, nucleosomes are placed according to the peaks of the coverage profile. These methods are quite accurate on placing isolated nucleosomes, but they do not properly handle "overlapping" nucleosomes. Also, they can only provide the positions of nucleosomes and their occupancy level, while it is very beneficial to supply molecular biologists additional information about nucleosomes like the probability of placement, the size of DNA fragments enriched for nucleosomes, and/or whether nucleosome are well-positioned or "fuzzy" in the sequenced cell sample. Results: We address these issues by providing a novel method based on a parametric probabilistic model. An expectation maximization (EM) algorithm is used to infer the parameters of the mixture of distributions. We compare the performance of our method on two real datasets against Template Filtering, which is considered the current state-of-the-art. Experimental results show that our method detects a significantly higher number of nucleosomes than Template Filtering.
TOP

How networks change with time

Author(s):
Yongjin Park, Johns Hopkins University, United States
Joel Bader, Johns Hopkins University

Session Chair: Lenore Cowen
Abstract Show/Hide

Motivation: Biological networks change in response to genetic and environmental cues. Changes are reflected in the abundances of biomolecules, the composition of protein complexes, and other descriptors of the biological state. Methods to infer the dynamic state of a cell would have great value for understanding how cells change over time to accomplish biological goals. Results: A new method predicts the dynamic state of protein complexes in a cell, with protein expression inferred from transcription profile time courses and protein complexes inferred by joint analysis of protein co-expression and protein-protein interaction maps. Two algorithmic advances are presented: a new method, DHAC (Dynamical Hierarchical Agglomerative Clustering), for clustering time-evolving networks; and a companion method, MATCH-EM, for matching corresponding clusters across time-points. With link prediction as an objective assessment metric, DHAC provides a substantial advance over existing clustering methods. An application to the yeast metabolic cycle demonstrates how waves of gene expression correspond to individual protein complexes. Our results suggest regulatory mechanisms for assembling the mitochondrial ribosome and illustrate dynamic changes in the components of the nuclear pore. Availability: All source code and data will be available through a BSD open source license as supplementary material and at www.baderzone.org.
TOP

Leveraging Input and Output Structures For Joint Mapping of Epistatic and Marginal eQTLs

Author(s):
Seunghak Lee, Carnegie Mellon University, United States
Eric Xing, Carnegie Mellon University

Session Chair: Eran Halperin
Abstract Show/Hide

Motivation: Since many complex disease and expression phenotypes are the outcome of intricate perturbation of molecular networks underlying gene regulation resulted from interdependent genome variations, association mapping of causal QTLs or eQTLs must consider both additive and epistatic effects of multiple candidate genotypes. This problem poses a significant challenge to contemporary genome-wide-association (GWA) mapping technologies because of its computational complexity. Fortunately, a plethora of recent developments in biological network community, especially the availability of genetic interaction networks, make it possible to construct informative priors of complex interactions between genotypes, which can substantially reduce the complexity and increase the statistical power of GWA inference. Results: In this paper, we consider the problem of learning a multi-task regression model while taking advantage of the prior information on structures on both the inputs (genetic variations) and outputs (expression levels). We propose a novel regularization scheme over multi-task regression called structured jointly input/output lasso based on an L1/L2 norm, which allows shared sparsity patterns for related inputs and outputs to be optimally estimated. Such patterns capture multiple related SNPs that jointly influence multiple related expression traits. In addition, we generalize this new multi-task regression to structurally regularized polynomial regression to detect epistatic interactions with manageable complexity by exploiting the prior knowledge on candidate epistatic SNPs from biological experiments. We demonstrate our method on simulated and yeast eQTL datasets.
TOP

Lineage based identification of cellular states and expression programs

Author(s):
Tatsunori Hashimoto, Massachusetts Institute of Technology, United States
Tommi Jaakkola, Massachusetts Institute of Technology
Richard Sherwood, Brigham and Women's Hospita
Esteban Mazzoni, Columbia University Medical Center
Hynek Witchterle, Columbia University Medical Center
David Gifford, Massachusetts Institute of Technology

Session Chair: Janet Kelso
Abstract Show/Hide

We present a method, Lineage Program, that uses the developmental lineage relationship of observed gene expression measurements to improve the learning of developmentally relevant cellular states and expression programs. We find that incorporating lineage information allows us to significantly improve both the predictive power and interpretability of expression programs that are derived from expression measurements from in vitro differentiation experiments. The lineage tree of a differentiation experiment is a tree graph whose nodes describe all of the unique expression states in the input expression measurements, and edges describe the experimental perturbations applied to cells. Our method, LineageProgram, is based on a log-linear model with parameters that reflect changes along the lineage tree. Regularization with L1 based methods controls the parameters in three distinct ways: the number of genes which change between two cellular states, the number of unique cellular states, and the number of underlying factors responsible for changes in cell state. The model is estimated with proximal operators to quickly discover a small number of key cell states and gene sets. Comparisons with existing factorization techniques such as singular value decomposition and nonnegative matrix factorization show that our method provides higher predictive power in held-out tests while inducing sparse and biologically relevant gene sets.
TOP

A single-source k shortest paths algorithm to infer regulatory pathways in a gene network

Author(s):
Yu-Keng Shih, The Ohio State University, United States
Srinivasan Parthasarathy, The Ohio State University

Session Chair: Lenore Cowen
Abstract Show/Hide

Motivation: Inferring the underlying signaling pathways within a gene interaction network is a fundamental problem in Systems Biology to help understand the complex interactions and the transmission and flow of information within a system-of-interest. Given a weighted gene network and a gene in this network, the goal of an inference algorithm is to identify the potential signaling pathways passing through this gene. Results: In a departure from previous approaches that largely rely on the random walk model, we propose a novel single-source $k$ shortest paths based algorithm to address this inference problem. An important element of our approach is to explicitly account for and enhance the diversity of paths discovered by our algorithm. The intuition here is that diversity in paths can help enrich different functions and thereby better position one to understand the underlying system-of-interest. Results on the yeast gene network demonstrate the utility of the proposed approach over extant state-of-the-art inference algorithms. Beyond utility, our algorithm achieves a significant speedup over these baselines.
TOP

Incorporating Prior Information into Association Studies

Author(s):
Gregory Darnell, UCLA , United States
Dat Duong, University of California Berkeley
Buhm Han, University of California
Eleazar Eskin, UCLA

Session Chair: Eran Halperin
Abstract Show/Hide

Recent technological developments in measuring genetic variation have ushered in an era of genome wide association studies which have discovered many genes involved in human disease. Current methods to perform association studies collect genetic information and compare the frequency of variants in a individuals who with and without the disease. Standard approaches do not take into account any information on whether or not a given variant is likely to have an effect on the disease. We propose a novel method for computing an association statistics which takes into account prior information. Our method improves both power and resolution by 43.5% and 45%, repsectively, over traditional methods for performing association studies when applied to simulations using the HapMap data. Advantages of our method are that it is as simple to apply to association studies as standard methods, the results of the method are intepretable since the method reports p-values, and the method is optimal in its use of prior information in regards to statistical power.
TOP

Matching experiments across species using expression values and textual information

Author(s):
Aaron Wise, Carnegie Mellon University, United States
Zoltan Oltvai, University of Pittsburgh
Ziv Bar-Joseph, Carnegie Mellon University

Session Chair: Janet Kelso
Abstract Show/Hide

Motivation: With the vast increase in the number of gene expression datasets deposited in public databases, novel techniques are required to analyze and mine this wealth of data. Similar to the way BLAST enables cross-species comparison of sequence data, tools that enable cross-species expression comparison will allow us to better utilize these datasets: Cross-species expression comparison enables us to address questions in evolution and development, and further allows the identification of disease related genes and pathways that play similar roles in humans and model organisms. Unlike sequence, which is static, expression data changes over time and under different conditions. Thus, a prerequisite for performing cross-species analysis is the ability to match experiments across species. Results: To enable better cross-species comparisons, we developed methods for automatically identifying pairs of similar expression datasets across species. Our method uses a co-training algorithm to combine a model of expression similarity with a model of the text which accompanies the expression experiments. The co-training method outperforms previous methods based on expression similarity alone. Using expert analysis, we show that the new matches identified by our method indeed capture biological similarities across species. We then use the matched expression pairs between human and mouse to recover known and novel cycling genes as well as to identify genes with possible involvement in diabetes. By providing the ability to identify novel candidate genes in model organisms, our method opens the door to new models for studying diseases.
TOP

Toward 3D structure prediction of large RNA molecules: An integer programming framework to insert local 3D motifs in RNA secondary structure

Author(s):
Vladimir Reinharz, McGill University, Canada
Francois Major, Institute for Research in Immunology and Cancer
Jerome Waldispuhl, McGill University

Session Chair: Cenk Sahinalp
Abstract Show/Hide

Motivation: The prediction of RNA three-dimensional structures from its sequence only is a milestone to RNA function analysis and prediction. In recent years, many methods addressed this challenge, ranging from cycle decomposition and fragment assembly to molecular dynamics simulations. However, their predictions remain fragile and limited to small RNAs. To expand the range and accuracy of these techniques, we need to develop algorithms that will enable to use all the structural information available. In particular, the energetic contribution of secondary structure interactions is now well documented, but the quantification of non-canonical interactions – those shaping the tertiary structure – is poorly understood. Nonetheless, even if a complete RNA tertiary structure energy model is currently unavailable, we now have catalogues of local 3D structural motifs including non-canonical base pairings. A practical objective is thus to develop techniques enabling us to use this knowledge for robust RNA tertiary structure predictors. Results: In this work, we introduce RNA-MoIP, a program that benefits from the progresses made over the last 30 years in the field of RNA secondary structure prediction and expands these methods to incorporate the novel local motif information available in databases. Using an integer programming framework, our method refines predicted secondary structures (i.e. removes incorrect canonical base-pairs) to accommodate the insertion of RNA 3D motifs (i.e. hairpins, internal loops and k-way junctions). Then, we use predictions as templates to generate complete 3D structures with the MC-Sym program. We benchmarked RNA-MoIP on a set of 9 RNAs with sizes varying from 53 to 128 nucleotides. We show that our approach (i) improves the accuracy of canonical base pair predictions, (ii) identifies the best secondary structures in a pool of sub-optimal structures, and (iii) predicts accurate 3D structures of large RNA molecules. RNA-MoIP is publicly available at: http://csb.cs.mcgill.ca/RNAMoIP
TOP

Identification of Sequence-Structure RNA Binding Motifs for SELEX Derived Aptamers

Author(s):
Jan Hoinka, NCBI, NIH, United States
Elena Zotenko, Garvan Institute for Medical Research
Adam Friedman, UNC Chapel Hill
Zuben E. Sauna, US Food and Drug Administration
Teresa Przytycka, NIH

Session Chair: Cenk Sahinalp
Abstract Show/Hide

Motivation: Systematic Evolution of Ligands by EXponential Enrichment (SELEX) represents a state of the art technology to isolate single stranded (ribo)nucleic acid fragments, named aptamers, that bind to a molecule (or molecules) of interest via specific structural regions induced by their sequence dependent fold. This powerful method has applications in designing protein inhibitors, molecular detection systems, therapeutic drugs, and antibody replacement among others. However, full understanding and consequently optimal utilization of the process has lagged behind it's wide application due to the lack of dedicated computational approaches. At the same time the combination of SELEX with novel sequencing technologies is beginning to provide the data that will allow the examination of a variety of properties of the selection process. Results: To close this gap we developed, Aptamotif, a computational method for the identification of sequence-structure motifs in SELEX derived aptamers. To increase the chances of identifying functional motifs, Aptamotif uses an ensemble based approach. Our new algorithmic solutions are accompanied with rigorous statistical analysis. We validated the method using two published aptamer datasets containing experimentally determined motifs of increasing complexity. We were able to recreate the authors findings to a high degree, thus proving the capability of our approach to identify binding motifs in SELEX data. Additionally, using our new experimental dataset, we illustrate the application of Aptamotif to elucidate several properties of the selection process.
TOP

GraphClust: alignment-free structural clustering of local RNA secondary structures

Author(s):
Fabrizio Costa, Albert-Ludwigs-University Freiburg, Germany
Fabrizio Costa, Albert-Ludwigs-University Freiburg
Dominic Rose, Albert-Ludwigs-University Freiburg
Rolf Backofen, Albert-Ludwigs-University Freiburg

Session Chair: Cenk Sahinalp
Abstract Show/Hide

Motivation: Clustering according to sequence-structure similarity has now become a generally accepted scheme for ncRNA annotation. Its application to complete genomic sequences as well as whole transcriptomes is therefore desirable but hindered by extremely high computational costs. Results: We present a novel linear-time, alignment-free method for comparing and clustering RNAs according to sequence and structure. The approach scales to datasets of hundreds of thousands of sequences. The quality of the retrieved clusters has been benchmarked against known ncRNA datasets and is comparable to state-of-the-art sequence-structure methods although achieving speed-ups of several orders of magnitude. A selection of applications aiming at the detection of novel structural non-coding RNAs are presented. Exemplarily, we predicted local structural elements specific to lincRNAs likely functionally associating involved transcripts to vital processes of the human nervous system. In total, we predicted 349 local structural RNA elements.
TOP

Detection of Allele-Specific Methylations through a Generalized Heterogeneous Epigenome Model

Author(s):
Qian Peng, UCSD, United States
Joseph Ecker, The Salk Institute for Biological Studies

Session Chair: Cenk Sahinalp
Abstract Show/Hide

Motivations: High throughput sequencing has made it possible to sequence DNA methylation of a whole genome at the single-base resolution. A sample however may contain a number of distinct methylation patterns. For instance, cells of different types and in different developmental stages may have different methylation patterns. Alleles may be differentially methylated, which may partially explain that the large portions of epigenomes from single cell types are partially methylated, and may have ma jor effects on transcriptional output. Approaches relying on DNA sequence polymorphism to identify individual patterns from a mixture of heterogeneous epigenomes are insufficient as methylcytosines occur at a much higher density than SNPs. Results: We have developed a mixture model-based approach for resolving distinct epigenomes from a heterogeneous sample. In particular, the model is applied for the detection of allele-specific methylations (ASM). The methods are tested on a synthetic methylome and applied to an Arabidopsis single root cell methylome.
TOP

A Conditional Neural Fields model for protein threading

Author(s):
Jianzhu Ma, Toyota Technological Institute at Chicago, United States
Jian Peng, Toyota Technological Institute at Chicago
Sheng Wang, Toyota Technological Institute at Chicago
Jinbo Xu, Toyota Technological Institute at Chicago

Session Chair: Bonnie Berger
Abstract Show/Hide

Motivation: Alignment errors are still the main bottleneck of current template-based protein modeling (TM) methods including protein threading and homology modeling, especially when the sequence identity between two proteins under consideration is low (<30%). Results: We present a novel protein threading method for much more accurate sequence-template alignment by employing a probabilistic graphical model Conditional Neural Fields (CNF), which aligns one protein sequence to its remote template using a nonlinear scoring function. This scoring function can account for correlation among a variety of protein sequence and structure features, make use of information in the neighborhood of two residues to be aligned, and thus, is much more sensitive than the widely-used linear function or profile-based scoring function. To train this CNF threading model, we employ a novel quality-sensitive method that can directly maximize the expected quality of a set of training alignments, instead of the standard maximum-likelihood method. Experimental results show that our CNF method generates significantly better alignments than the best profile-based and threading methods on several public (but small) benchmarks and very large in-house datasets. Our method outperforms others regardless of protein classes and lengths and works particularly well for proteins with sparse sequence profile due to the effective utilization of structure information. The methodology presented here can also be adapted to protein sequence alignment.
TOP

Novel domain combinations in proteins encoded by chimeric transcripts
Cancelled
Author(s):
Milana Frenkel-Morgenstern, Spain Spanish National Cancer Research Centre (CNIO), Spain
Alfonso Valencia, Spain Spanish National Cancer Research Centre (CNIO)

Abstract Show/Hide

Chimeric RNA transcripts are generated by different mechanisms, including pre-mRNA trans-splicing, chromosomal translocation and/or gene fusion, and it was recently shown that at least some chimeric transcripts may be translated into functional chimeric proteins. To gain a better understanding of the design principles behind the production of chimeric proteins, we have analyzed 7,424 chimeric RNAs from humans. We focused on the specific domains present in these proteins, comparing their permutations with those of known human proteins. We found that chimeras contain complete protein domains more often than in random datasets and specifically, that eight different types of domains are over represented among all chimeras, as well as in those chimeras confirmed by RNA-seq experiments. Moreover, we discovered that some chimeras potentially encode proteins with novel and unique combinations of such domains. Given the prevalence of complete protein domains observed in chimeras, we predict that putative chimeras that lack activation domains may actively compete with their parental proteins, thereby exerting a dominant negative effect. In more general terms, the generation of chimeric transcripts produces a combinatorial increase in the number of protein products available, which may disturb the function of parental genes and influence their protein-protein interaction network.
TOP

Xenome - A Tool for Classifying Reads from Xenograft Samples

Author(s):
Thomas Conway, NICTA, Australia
Jeremy Wazny, NICTA
Andrew Bromage, NICTA
Martin Tymms, Monash Institute for Medical Research
Dhanya Sooraj, Monash Institute for Medical Research
Elizabeth Williams, Monash Institute for Medical Research
Bryan Beresford-Smith, NICTA

Session Chair: Burkhard Rost
Abstract Show/Hide

Motivation: Shotgun sequence read data derived from xenograft material contains a mixture of reads arising from the host and reads arising from the graft. Classifying the read mixture to separate the two allows for more precise analysis to be performed. Results: We present a technique, with an associated tool Xenome, which performs fast, accurate and specific classification of xenograft derived sequence read data. We have evaluated it on RNA-Seq data from human, mouse and human-in-mouse xenograft data sets. Availability: Xenome is available for non-commercial use from http://www.nicta.com.au/bioinformatics
TOP

Fast alignment of fragmentation trees

Author(s):
Franziska Hufsky, Friedrich-Schiller-University Jena, Germany
Kai Dührkop, Friedrich-Schiller-University Jena
Florian Rasche, Friedrich-Schiller-University Jena
Markus Chimani, Friedrich-Schiller-University Jena
Sebastian Böcker, Friedrich-Schiller-University Jena

Session Chair: Reinhard Schneider
Abstract Show/Hide

Mass spectrometry allows sensitive, automated and high- throughput analysis of small molecules such as metabolites. One major bottleneck in metabolomics is the identification of "unknown" small molecules not in any database. Recently, fragmentation tree alignments have been introduced for the automated comparison of the fragmentation patterns of small molecules. Fragmentation pat- tern similarities are strongly correlated with the chemical similarity of molecules, and allow us to cluster compounds based solely on their fragmentation patterns. Aligning fragmentation trees is computationally hard. Nevertheless, we present three exact algorithms for the problem: A dynamic pro- gramming (DP) algorithm, a sparse variant of the DP, and an Integer Linear Program (ILP). Evaluation of our methods on three different datasets showed that thousands of alignments can be computed in a matter of minutes using DP, even for "challenging" instances. Run- ning times of the sparse DP were an order of magnitude better than for the classical DP. The ILP was clearly outperformed by both DP approaches. We also found that for both DP algorithms, computing the 1 % slowest alignments required as much time as computing the 99 % fastest.
TOP

Dissect: Detection and Characterization of Novel Structural Alterations in Transcribed Sequences

Author(s):
Deniz Yorukoglu, Massachusetts Institute of Technology, United States
Faraz Hach, Simon Fraser University
Lucas Swanson, Simon Fraser University
Colin C. Collins, Vancouver Prostate Centre
Inanc Birol, Genome Sciences Centre
S. Cenk Sahinalp, Simon Fraser University

Session Chair: Hagit Shatkay
Abstract Show/Hide

Motivation: Computational identification of genomic structural variants via high throughput sequencing is an important problem for which a number of highly sophisticated solutions have been developed recently. With the advent of high-throughput transcriptome sequencing (RNA-Seq), the problem of identifying structural alterations in the transcriptome is now attracting significant attention. In this paper, we introduce two novel algorithmic formulations for identifying transcriptomic structural variants through aligning transcripts to the reference genome under the consideration of such variation. The first formulation is based on a nucleotide-level alignment model; a second, potentially faster formulation is based on chaining fragments shared between each transcript and the reference genome. Based on these formulations, we introduce a novel transcriptome-to-genome alignment tool, Dissect, which can identify and characterize transcriptomic events such as duplications, inversions, rearrangements and fusions. Dissect is suitable for whole transcriptome structural variation discovery problems involving sufficiently long reads or accurately assembled contigs.
TOP

MoRFpred, a computational tool for sequence-based prediction and characterization of disorder-to-order transitioning binding sites in proteins

Author(s):
Fatemeh Miri Disfani, University of Alberta, Canada
Wei-Lun Hsu, Indiana University
Marcin J. Mizianty, University of Alberta
Christopher J. Oldfield, Indiana University
Bin Xue, University of South Florida
A. Keith Dunker, Indiana University
Vladimir N. Uversky, University of South Florida
Lukasz Kurgan, University of Alberta

Session Chair: Nir Ben-Tal
Abstract Show/Hide

Motivation: Molecular Recognition Feature (MoRF) regions are disordered binding sites that become structured upon binding. MoRFs are implicated in important biological processes, including signaling and regulation. However, only a limited number of experimentally validated MoRFs is known, which motivates development of computational methods that predict MoRFs from protein chains. Results: We introduce a new MoRF predictor, MoRFpred, which identifies all MoRF types (alpha, beta, coil, and complex). We develop a comprehensive dataset of annotated MoRFs and use it to build and empirically compare our method. MoRFpred is based on a novel design in which annotations generated by sequence alignment are fused with predictions generated by a Support Vector Machine (SVM), which uses a custom designed set of sequence-derived features. The features provide information about evolutionary pro-files, selected physiochemical properties of amino acids, and predicted disorder, solvent accessibility, and B-factors. Empirical evaluation shows that MoRFpred statistically significantly outperforms existing predictors, alpha-MoRF-Pred and ANCHOR, by 0.07 in AUC and 10% in success rate. We show that our predicted (new) MoRF regions have non-random sequence similarity with native MoRFs. We use this observation along with the fact that predictions with higher probability are more accurate to identify putative MoRF regions. We present case studies to analyze these putative MoRFs. We also identify a few sequence-derived hallmarks of MoRFs. They are characterized by dips in the disorder predictions and higher hydrophobicity and stability when compared to adjacent (in the chain) residues. Availability: http://biomine.ece.ualberta.ca/MoRFpred/ Supplementary information: http://biomine.ece.ualberta.ca/MoRFpred/Supplement.pdf Contact: lkurgan@ece.ualberta.ca
TOP

Extending ontologies by finding siblings using set expansion techniques

Author(s):
Götz Fabian, Technische Universität Dresden, Germany
Thomas Wächter, Technische Universität Dresden
Michael Schroeder, Technische Universität Dresden

Session Chair: Michal Linial
Abstract Show/Hide

Motivation: Ontologies are an everyday tool in biomedicine to capture and represent knowledge. However, many ontologies lack a high degree of coverage in their domain and need to improve their overall quality and maturity. Automatically extending sets of existing terms will enable ontology engineers to systematically improve text- based ontologies level by level. Results: We developed an approach to extend ontologies by discovering new terms which are in a sibling relationship to existing terms of an ontology. For this purpose, we combined two approaches which retrieve new terms from the web. The first approach extracts siblings by exploiting the structure of HTML documents, whereas the second approach uses text mining techniques to extract siblings from unstructured text. Our evaluation against MeSH shows that our method for sibling discovery is able to suggest first-class ontology terms and can be used as an initial step towards assessing the completeness of ontologies. The evaluation yields a recall of 80% at a precision of 61% where the two independent approaches are complementing each other. For MeSH in particular, we show that it can be considered complete in its medical focus area. We integrated the work into DOG4DAG, an ontology generation plugin for the editors OBO-Edit and Protégé, making it the first plugin that supports sibling discovery on-the-fly. Availability: Sibling discovery for ontology is available as part of DOG4DAG (www.biotec.tu-dresden.de/research/schroeder/dog4dag) for both Proteégé 4.1 and OBO-Edit 2.1.
TOP

Recognition Models to Predict DNA-binding Specificities of Homeodomain Proteins

Author(s):
Ryan Christensen, Washington University School of Medicine, United States
Metewo Selase Enuameh, University of Massachusetts Medical School
Marcus B. Noyes, University of Massachusetts Medical School
Michael H. Brodsky, University of Massachusetts Medical School
Scot A. Wolfe, University of Massachusetts Medical School
Gary D. Stormo, Washington University School of Medicine

Session Chair: Nir Ben-Tal
Abstract Show/Hide

Recognition models for protein-DNA interactions, which allow the prediction of specificity for a DNA-binding domain based only on its sequence or the alteration of specificity through rational design, have long been a goal of computational biology. There has been some progress in constructing useful models, especially for C2H2 zinc finger proteins, but it remains a challenging problem with ample room for improvement. For most families of transcription factors the best available methods utilize k-nearest neighbor algorithms to make specificity predictions based on the average of the specificities of the k most similar proteins with defined specificities. Homeodomain proteins are the second most abundant family of transcription factors, after zinc fingers, in most metazoan genomes, and as a consequence an effective recognition model for this family would facilitate predictive models of many transcriptional regulatory networks within these genomes. Using extensive experimental data, we have tested several machine learning approaches and find that both support vector machines and random forests can produce recognition models for homeodomain proteins that are significant improvements over k-nearest neighbor based methods. Cross-validation analyses show that the resulting models are capable of predicting specificities with high accuracy. We have produced a web-based prediction tool, PreMoTF (Predicted Motifs for Transcription Factors) (http://stormo.wustl.edu/PreMoTF), for predicting PFMs from protein sequence using a random forest based model.
TOP

DELISHUS: An Efficient and Exact Algorithm for Genome-Wide Detection of Deletion Polymorphism in Autism

Author(s):
Derek Aguiar, Brown University, United States
Bjarni Halldorsson, Reykjavik University
Eric Morrow, Brown University
Sorin Istrail, Brown University

Session Chair: Terry Gaasterland
Abstract Show/Hide

The understanding of the genetic determinants of complex disease is undergoing a paradigm shift. Genetic heterogeneity of rare mutations with deleterious effects is more commonly being viewed as a major component of disease. Autism is an excellent example where research is active in identifying matches between the phenotypic and genomic heterogeneities. A substantial portion of autism appears to be correlated with copy number variation which is not directly probed by single nucleotide polymorphism (SNP) array technologies. Identifying the genetic heterogeneity of small deletions remains a major unresolved computational problem due, in part, to the inability of algorithms to detect them. In this paper we present an algorithmic framework, which we term DELISHUS, that implements three highly efficient algorithms for inferring genomic deletions of all sizes and frequencies in SNP array data. We implement a polynomial-time backtracking algorithm -- that finishes on a 1 billion entry genome-wide association study (GWAS) SNP matrix in a few minutes -- to compute all potential deletions in a dataset. Given a set of called deletions, we also give a polynomial time algorithm for detecting regions that contain multiple recurrent deletions. Finally, we give an algorithm for detecting de novo deletions. Because our algorithms consider all individuals in the sample at once, they achieve significantly lower false positive rates and higher power when compared to previously published single individual algorithms. Our method may be used to identify the deletion spectrum for GWAS where deletion polymorphism was previously not analyzed. DELISHUS is available at http://www.brown.edu/Research/Istrail_Lab/
TOP

Ranking of Multidimensional Drug Profiling Data by Fractional Adjusted Bi-Partitional Scores

Author(s):
Dorit Hochbaum, University of California at Berkeley, United States
Chun-Nan Hsu, University of Southern California, Marina del Rey
Yan T. Yang, University of California at Berkeley

Session Chair: Michal Linial
Abstract Show/Hide

Motivation: The recent development of high-throughput drug profiling (high content screening or HCS) provides a large amount of quantitative multidimensional data. Despite its potentials, it poses several challenges for academia and industry analysts alike. This is especially true for ranking the effectiveness of several drugs from many thousands of images directly. This paper introduces, for the first time, a new framework for automatically ordering the performance of drugs, called fractional adjusted bi-partitional score (FABS). This general strategy takes advantage of graph-based formulations and solutions and avoids many shortfalls of traditionally used methods in practice. We experimented with FABS framework by implementing it with a specific algorithm, a variant of normalized cut – normalized cut prime (FABS-NC′), producing a ranking of drugs. This algorithm is known to run in polynomial time and therefore can scale well in high-throughput applications.

Results: We compare the performance of FABS-NC′ to other methods that could be used for drugs ranking. We devise two variants of the FABS algorithm: FABS-SVM that utilizes support vector machine (SVM) as black box, and FABS-Spectral that utilizes the eigenvector technique (spectral) as black box. We compare the performance of FABS-NC′ also to three other methods that have been previously considered: center ranking (Center), PCA ranking (PCA), and graph transition energy method (GTEM). The conclusion is encouraging: FABS-NC′ consistently outperforms all these five alternatives. FABS-SVM has the second best performance among these six methods, but is far behind FABS-NC′: In some cases FABSNC ′ produces over half correctly predicted ranking experiment trials than FABS-SVM.

Availablility: The system and data for the evaluation reported here will be made available upon request to the authors after this manuscript is accepted for publication.

TOP

TMBMODEL: Toward 3D modeling of transmembrane beta barrel proteins based on z-coordinate and topology prediction

Author(s):
Sikander Hayat, Stockholm University, Sweden
Arne Elofsson, Stockholm University

Session Chair: Nir Ben-Tal
Abstract Show/Hide

Motivation: Two types of transmembrane proteins exist, alpha-helical membrane proteins and transmembrane beta-barrels. The later type exists in the outer membrane of gram-negative bacteria and in chloroplast and mitochondria where they play a major role in the translocation machinery. Here, we aim to build three-dimensional models for transmembrane beta-barrels based on a large set of predicted topologies used to generate alternative three-dimensional models.Thereafter, the predicted Z-coordinate, i.e. the distance of a residue from the membrane center, is used to identify the best model.

Results: We present TMBMODEL; a method for generating three-dimensional models based on predicted topologies. TMBMODEL employs theoretic principles from known structures to construct a model for a barrel of a given transmembrane beta-barrel sequence. Firstly, different topologies are obtained from running the BOCTOPUS topology predictor and then three-dimensional models are constructed for different shear numbers. The best model is then selected based on a novel Z-coordinate predictor. Based on a leave-one-out cross-validation, the Z-coordinate predictor predicts 74% residues within 2 Å on a non-redundant dataset of 36 transmembrane beta-barrels. The average error and correctly identified membrane residues is 1.61 Å and 71%, respectively. TMBMODEL chose the correct topology for 75% proteins in the data set, which is a slight improvement over BOCTOPUS. More importantly TMBMODEL provides a C-alpha template for more detailed structural analysis. The average RMSD for this template is 7.24 Å.

Availability: TMBMODEL is freely available as a web-server at: http://tmbmodel.cbr.su.se/. The data sets used for training and evaluations are also available from this site.

Contact: arne@bioinfo.se

Abbreviations: TMB, transmembrane beta-barrel protein; HMM, Hidden
Markov Model; SVM, support vector machine.

TOP

SEQuel: Improving the Accuracy of Genome Assemblies

Author(s):
Roy Ronen, University of California, San Diego, United States
Christina Boucher, University of California, San Diego
Hamidreza Chitsaz, Wayne State University
Pavel Pevzner, University of California, San Diego

Session Chair: Terry Gaasterland
Abstract Show/Hide

Motivation: Assemblies of next generation sequencing data, while accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop SEQuel, a tool that corrects errors (i.e., insertions, deletions, and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the positional de Bruijn graph, a graph structure that models k-mers within reads while incorporating the approximate positions of reads into the model. Results: SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell E. coli data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently-assembled Deltaproteobacterium SAR324 genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly. Availability: SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at http://bix.ucsd.edu/SEQuel/.
TOP

DACTAL: divide-and-conquer trees (almost) without alignments

Author(s):
Serita Nelesen, Calvin College, United States
Kevin Liu, Rice University
Li-San Wang, University of Pennsylvania
C. Randal Linder, University of Texas at Austin
Tandy Warnow, University of Texas at Austin

Session Chair: Michal Linial
Abstract Show/Hide

We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SATe and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000 to 27,643 taxa. Our studies show that DACTAL dramatically outperforms two-phase methods with respect to tree accuracy. The comparison to SAT\{e} shows that both have the same tree accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SAT\'{e}, including a dataset with almost 28,000 sequences.

DACTAL source code is available at www.cs.utexas.edu/users/phylo/software/dactal

TOP

Minimum Message Length Inference of Secondary Structure from Protein Coordinate Data.

Author(s):
Arun Konagurthu, Monash University, Australia
Arthur Lesk, Pennsylvania State University
Lloyd Allison, Monash University

Session Chair: Nir Ben-Tal
Abstract Show/Hide

Motivation: Secondary structure underpins the folding pattern and architecture of most proteins. Accurate assignment of the secondary structure elements is therefore an important problem. Although many approximate solutions of the secondary structure assignment problem exist, the statement of the problem has resisted a consistent and mathematically rigorous definition. A variety of comparative studies have highlighted major disagreements in the way the available methods define and assign secondary structure to coordinate data. Results: We report a new method to infer secondary structure based on the Bayesian method of Minimum Message Length (MML) inference. It treats assignments of secondary structure as hypotheses that explain the given coordinate data. The method seeks to maximise the joint probability of a hypothesis and the data. There is a natural null hypothesis and any assignment that cannot better it is unacceptable. We developed a program SST based on this approach and compared it to popular programs such as DSSP and STRIDE amongst others. Our evaluation suggests that SST gives reliable assignments even on low resolution structures.
TOP

Weighted Pooling - Practical and Cost Effective Techniques for Pooled High Throughput Sequencing

Author(s):
David Golan, Tel Aviv University , Israel
Saharon Rosset, Tel Aviv University
Yaniv Erlich, Whitehead Institute

Session Chair: Terry Gaasterland
Abstract Show/Hide

Motivation: Despite the rapid decline in sequencing costs, sequencing large cohorts of individuals is still prohibitively expensive. Recently, several sophisticated pooling designs were suggested that can identify carriers of rare alleles in large cohorts with a significantly smaller number of pools, thus dramatically reducing the cost of such large scale sequencing projects (Erlich et al. 2009). These approaches use combinatorial pooling designs where each individual is either present or absent from a pool. One can then infer the number of carriers in a pool, and by combining information across pools, reconstruct the identity of the carriers. Results: We show that one can gain further efficiency and cost reduction by using "weighted" designs, in which different individuals donate different amounts of DNA to the pools. Intuitively, in this situation the number of mutant reads in a pool does not only indicate the number of carriers, but also their identity. We describe and study a powerful example of such weighted designs, using non-overlapping pools. We demonstrate that this approach is not only easier to implement and analyze but is also competitive in terms of accuracy with combinatorial designs when identifying rare variants, and is superior when sequencing common variants. We then discuss how weighting can be incorporated into existing combinatorial designs to increase their accuracy and demonstrate the resulting improvement using simulations. Finally, we argue that weighted designs have enough power to facilitate detection of common alleles, so they can be used as a cornerstone of whole-exome sequencing projects.
TOP

Statistical model-based testing to evaluate the recurrence of genomic aberrations

Author(s):
Atsushi Niida, University of Tokyo, Japan
Seiya Imoto, University of Tokyo
Teppei Shimamura, University of Tokyo
Satoru Miyano, University of Tokyo

Session Chair: Serafim Batzoglou
Abstract Show/Hide

Motivation: In cancer genomes, chromosomal regions harboring cancer genes are often subjected to genomic aberrations like copy number alteration and loss of heterozygosity (LOH). Given this, finding recurrent genomic aberrations is considered an apt approach for screening cancer genes. Although several permutation-based tests have been proposed for this purpose, none of them are designed to find recurrent aberrations from the genomic data set without paired normal sample controls. Their application to unpaired genomic data may lead to false discoveries, because they retrieve pseudo-aberrations that exist in normal genomes as polymorphisms. Results: We develop a new parametric method named parametric aberration recurrence test (PART) to test for the recurrence of genomic aberrations. The introduction of Poisson-binomial statistics allow us to compute small p-values more efficiently and precisely than the previously proposed permutation-based approach. Moreover, we extended PART to cover unpaired data (PART-up) so that there is a statistical basis for analyzing unpaired genomic data. PART-up utilizes information from unpaired normal sample controls to remove pseudo-aberrations in unpaired genomic data. Using PART-up, we successfully predict recurrent genomic aberrations in cancer cell line samples whose paired normal sample controls are unavailable. This paper thus proposes a powerful statistical framework for the identification of driver aberrations, which would be applicable to ever-increasing amounts of cancer genomic data seen in the era of next generation sequencing.
TOP

Data-Driven Integration Of Epidemiological And Toxicological Data To Select Candidate Interacting Genes And Environmental Factors In Association With Disease

Author(s):
Chirag Patel, Stanford University, United States
Rong Chen, Stanford University
Atul Butte, Stanford University

Session Chair: Serafim Batzoglou
Abstract Show/Hide

Complex diseases, such as Type 2 Diabetes Mellitus (T2D), result from the interplay of both environmental and genetic factors. However, most studies either investigate either the genetics or the environment in context of disease and there are a few that study their possible interaction. One key challenge in documenting interactions between genes and environment includes choosing which of each to test jointly. Here, we attempt to address this challenge through a data-driven integration of epidemiological and toxicological studies. Specifically, we derive lists of candidate interacting genetic and environmental factors by integrating findings from genome-wide and environment-wide association studies (GWAS and EWAS). Next, we search for evidence of toxicological relationships between these genetic and environmental factors that may have an etiological role in the disease. We illustrate our method by selecting candidate interacting factors for Type 2 Diabetes.
TOP

Identifying Disease Sensitive and Quantitative Trait Relevant Biomarkers from Heterogeneous Imaging Genetics Data via Sparse Multi-Modal Multi-Task Learning

Author(s):
Hua Wang, University of Texas at Arlington, United States
Feiping Nie, University of Texas at Arlington
Heng Huang, University of Texas at Arlington
Shannon Leigh Risacher, Indiana University
Andrew Saykin, Indiana University School of Medicine
Li Shen, Indiana University School of Medicine

Session Chair: Terry Gaasterland
Abstract Show/Hide

Motivation: Recent advances in brain imaging and high-throughput genotyping techniques enable new approaches to study the influence of genetic and anatomical variations on brain functions and disorders. Traditional association studies typically perform independent and pairwise analysis among neuroimaging measures, cognitive scores, and disease status, and ignore the important underlying interacting relationships between these units. Results: To overcome this limitation, in this paper, we propose a new sparse multi-modal multi-task learning method to reveal complex relationships from gene to brain to symptom. Our main contributions are three-fold: 1) utilizing a joint classification and regression learning model to identify disease-sensitive and cognition-relevant biomarkers; 2) introducing combined structured sparsity regularizations into multimodal multi-task learning to integrate heterogenous imaging genetics data and identify multi-modal biomarkers; 3) deriving a new efficient optimization algorithm to solve our non-smooth objective function and providing rigorous theoretical analysis on the global optimum convergency. Using the imaging genetics data from the Alzheimer’s Disease Neuroimaging Initiative database, the effectiveness of the proposed method is demonstrated by clearly improved performance on predicting both cognitive scores and disease status. The identified multi-modal biomarkers could predict not only disease status but also cognitive function to help elucidate the biological pathway from gene to brain structure and function, and to cognition and disease.
TOP

Efficient Algorithms for the Reconciliation Problem with Gene Duplication, Horizontal Transfer, and Loss

Author(s):
Mukul S. Bansal, Massachusetts Institute of Technology, United States
Eric J. Alm, Massachusetts Institute of Technology
Manolis Kellis, Massachusetts Institute of Technology

Session Chair: Jaques Reifman
Abstract Show/Hide

Motivation: Gene family evolution is driven by evolutionary events like speciation, gene duplication, horizontal gene transfer, and gene loss, and inferring these events in the evolutionary history of a given gene family is a fundamental problem in comparative and evolutionary genomics with numerous important applications. Solving this problem requires the use of a reconciliation framework, where the input consists of a gene family phylogeny and the corresponding species phylogeny, and the goal is to reconcile the two by postulating speciation, gene duplication, horizontal gene transfer, and gene loss events. This reconciliation problem is referred to as Duplication-Transfer-Loss (DTL) reconciliation and has been extensively studied in the literature. Yet, even the fastest existing algorithms for DTL-reconciliation are too slow for reconciling large gene families and for use in more sophisticated applications such as gene tree or species tree reconstruction. Results: We present two new algorithms for the DTL-reconciliation problem that are dramatically faster than existing algorithms, both asymptotically and in practice. We also extend the standard DTL-reconciliation model by considering distance-dependent transfer costs, that allow for more accurate reconciliation, and give an efficient algorithm for DTL-reconciliation under this extended model. We implemented our new algorithms and demonstrate up to 100,000-fold speed-up over existing methods, using both simulated and biological datasets. This dramatic improvement makes it possible to use DTL-reconciliation for performing rigorous evolutionary analyses of large gene families, and enables its use in advanced reconciliation-based gene and species tree reconstruction methods.
TOP