Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


Evolution and Comparative Genomics

COSI Track Presentations

Schedule subject to change
Tuesday, July 23rd
10:20 AM-10:40 AM
Proceedings Presentation: Inference of clonal selection in cancer populations using single-cell sequencing data
  • Pavel Skums, Georgia State University, United States
  • Viachaslau Tsyvina, Georgia State University, United States
  • Alex Zelikovsky, GSU, United States

Presentation Overview: Show

Intra-tumor heterogeneity is one of the major factors influencing cancer progression and treatment outcome. However, evolutionary dynamics of cancer clone populations remain poorly understood. Quantification of clonal selection and inference of fitness landscapes of tumors is a key step to understanding evolutionary mechanisms driving cancer. These problems could be addressed using single cell sequencing, which provides an unprecedented insight into intra-tumor heterogeneity allowing to study and quantify selective advantages of individual clones. Here we present SCIFIL, a computational tool for inference of fitness landscapes of heterogeneous cancer clone populations from single cell sequencing data. SCIFIL allows to estimate maximum likelihood fitnesses of clone variants, measure their selective advantages and order of appearance by fitting an evolutionary model into the tumor phylogeny. We demonstrate the accuracy our approach, and show how it could be applied to experimental tumor data to study clonal selection and infer evolutionary history. SCIFIL can be used to provide new insight into the evolutionary dynamics of cancer. Its source code is available at https://github.com/compbel/SCIFIL

10:40 AM-10:50 AM
Spatial structure governs the mode of tumour evolution
  • Robert Noble, ETH Zurich, Switzerland
  • Dominik Burri, University of Basel, Switzerland
  • Jakob Kalther, University Hospital RWTH Aachen, Germany
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Presentation Overview: Show

Characterizing the mode – the way, manner, or pattern – of evolution in tumours is important for clinical forecasting and optimizing cancer treatment. DNA sequencing studies have inferred various modes, including branching, punctuated and neutral evolution, but it is unclear why a particular pattern predominates in any given tumour. Here we propose that differences in tumour architecture alone can explain the variety of observed patterns. We examine this hypothesis using spatially explicit population genetic models and demonstrate that, within biologically relevant parameter ranges, human tumours are expected to exhibit four distinct onco-evolutionary modes (oncoevotypes): rapid clonal expansion (predicted in leukaemia); progressive diversification (in colorectal adenomas and early-stage colorectal carcinomas); branching evolution (in invasive glandular tumours); and effectively almost neutral evolution (in certain non-glandular and poorly differentiated solid tumours). We thus provide a simple, mechanistic explanation for a wide range of empirical observations. Oncoevotypes are governed by modes of cell dispersal and cell-cell interactions, which we show are essential factors in accurately characterizing, forecasting and controlling tumour evolution.

10:50 AM-11:00 AM
Growth patterns and driver effects from individual samples provide insights in tumor evolution
  • Leonidas Salichos, Yale University, United States
  • William Meyerson, Yale University, United States
  • Jonathan Warrell, Yale University, United States
  • Mark Gerstein, Yale University, United States

Presentation Overview: Show

Evolving tumors accumulate thousands of mutations. Data explosion and whole genome sequencing have led to many methods for detecting cancer drivers which, however, underperform when recurrence is low. Our approach involves harnessing the VAF of mutations in the population of tumor cells in an ultra-deep sequenced single biopsy. We have developed a method that quantifies tumor growth and driver effects for individual samples based solely on the VAF spectrum. Drivers introduce a perturbation into this spectrum, and our method measures that perturbation. To validate our method, we used simulation models to successfully approximate the timing and size of a driver’s effect. Then, we tested our method on 993 linear tumors from the PCAWG Consortium and found that the identified periods of positive growth are associated with known drivers. Finally, we applied our method to an ultra-deep sequenced AML tumor and identified known cancer genes and additional driver candidates. In general, our results shed light to the dynamics of tumor progression indicating multicellular processes as significantly affected. Moreover, different mutation types appear to have adverse effects on tumor growth. Our method presents opportunities for personalized diagnosis via modeling of tumor progression using deep sequenced whole genome data from an individual.

11:00 AM-11:20 AM
Proceedings Presentation: Large-Scale Mammalian Genome Rearrangements Coincide with Chromatin Interactions
  • Krister Swenson, CNRS, Université de Montpellier, France
  • Mathieu Blanchette, McGill University, Canada

Presentation Overview: Show

Motivation: Genome rearrangements drastically change gene order along great stretches of a
chromosome. There has been initial evidence that these apparently non-local events in the 1D sense
may have breakpoints that are close in the 3D sense. We harness the power of the Double Cut and
Join model of genome rearrangement, along with Hi-C chromosome capture data to test this hypothesis
between human and mouse.
Results: We devise novel statistical tests which show that indeed, rearrangement scenarios that transform
the human into the mouse gene order are enriched for pairs of breakpoints that have frequent chromosome
interactions. This is observed for both intra-chromosomal breakpoint pairs, as well as for inter-chromosomal
pairs. For intra-chromosomal rearrangements, the enrichment exists for close (<20Mbs) and far (100Mbs)
pairs. Further, the pattern exists across multiple cell lines, from multiple laboratories, in different states
of the cell cycle. We show that similarities in the contact frequencies between these many experiments
contribute to the enrichment. We conclude that either 1) rearrangements usually involve breakpoints that
are spatially close, or 2) there is selection against rearrangements that act on spatially distant breakpoints.

11:20 AM-11:30 AM
Distinguishing successive ancient polyploidy levels based on genome-internal syntenic alignments
  • Yue Zhang, University of Ottawa, Canada
  • Chunfang Zheng, University of Ottawa, Canada
  • David Sankoff, University of Ottawa, Canada

Presentation Overview: Show

A basic tool for studying the polyploidization history of a genome, especially in plants, is the distribution of duplicate gene similarities in syntenically aligned regions of a genome. Often there are two or more peaks, each representing a different polyploidization event. These distributions may be generated by means of a discrete time, non-homogeneous branching process, followed by a standard sequence divergence model. While the similarities data allows for inference of fractionation rates and other parameters they usually cannot pin down the ploidy level of each event. For a sequence of two events of unknown ploidy, either tetraploid or hexaploid, we base our analysis on high-similarity triples of genes -- triangles. We calculate the probability of the four triangle types with origins in one or the other event, and impose a mutational model so that the distribution resembles the original data. Using a ML transition point in the similarities between the two events as an discriminator for the hypothesized origin of each similarity, we calculate the predicted number of triangles of each hypothesized type for each mode combining hexaploidization and/or tetraploidization. This yields a profile of triangle type for each model, which can then be used to assess real genomic data.

11:30 AM-11:40 AM
Accounting for calibration uncertainty: Bayesian molecular dating as a “doubly intractable” problem
  • Stephane Guindon, CNRS, France

Presentation Overview: Show

This study introduces a new Bayesian technique for molecular dating that explicitly accommodates for uncertainty in the phylogenetic position of calibrated nodes derived from the analysis of fossil data. The proposed approach thus defines an adequate framework for incorporating expert knowledge and/or prior information about the way fossils were collected in the inference of node ages. Although it belongs to the class of “node-dating” approaches, this method shares interesting properties with “tip-dating” techniques. Yet, it alleviates some of the computational and modeling difficulties that hamper tip-dating approaches. The influence of fossil data on the probabilistic distribution of trees is the crux of the matter considered here. More specifically, among all the phylogenies that a tree model (e.g., the birth–death process) generates, only a fraction of them “agree” with the fossil data. Bayesian inference under the new model requires taking this fraction into account. However, evaluating this quantity is difficult in practice. A generic solution to this issue is presented here. The proposed approach relies on a recent statistical technique, the so-called exchange algorithm, dedicated to drawing samples from “doubly intractable” distributions.

11:40 AM-12:00 PM
Proceedings Presentation: Summarizing the Solution Space in Tumor Phylogeny Inference by Multiple Consensus Trees
  • Nuraini Aguse, University of Illinois at Urbana-Champaign, United States
  • Yuanyuan Qi, University of Illinois at Urbana–Champaign, United States
  • Mohammed El-Kebir, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees.
We introduce the MULTIPLE CONSENSUS TREE (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T.

12:00 PM-12:10 PM
Pairtree: fast cancer phylogeny reconstruction using multiple samples
  • Jeff Wintersinger, University of Toronto, Dept. of Computer Science, Canada
  • Stephanie Dobson, Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada, Canada
  • John Dick, Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada, Canada
  • Quaid Morris, University of Toronto, Canada

Presentation Overview: Show

Tumours are not homogeneous masses, but are instead composed of multiple genetically distinct subpopulations of cells. These genetic differences can affect treatment response. Using genomic sequencing data taken from mixtures of these subpopulations, we can infer which mutations each subpopulation possesses, and the evolutionary relationships between subpopulations.

Here we present Pairtree, a novel algorithm for profiling cancerous subpopulations in a patient's tumour. Pairtree can exploit multiple tissue samples taken from a patient, either from different spatial points in the tumour or at different temporal points through treatment. We can, for instance, characterize which evolutionary lineage gave rise to a metastasis or disease relapse, revealing how each subpopulation responded to treatment. This in turn can inform treatment targeted at this lineage.

Each additional tissue sample from a patient improves Pairtree's ability to resolve subclonal populations, and to characterize the evolutionary relationships between these populations. However, each additional sample also increases the complexity of the computational problem. Pairtree explicitly models relationships between mutations, allowing its accuracy and resolution to improve with each additional sample. Alternative algorithms, by contrast, cannot deal with this computational complexity, and so exhibit progressively worse accuracy and resolution as the data become richer.

12:10 PM-12:20 PM
Mapping global and local coevolution across 600 species to identify novel homologous recombination repair genes
  • Dana Sherill-Rofe, Hebrew University, Israel
  • Dolev Rahat, Hebrew University, Israel
  • Aviad Zick, Hadassah Medical Center, Israel
  • Yuval Tabach, The Hebrew University of Jerusalem, Israel

Presentation Overview: Show

Mutations in homologous recombination repair (HRR) genes can result in increased mutation rate and genomic rearrangements and are associated with numerous genetic disorders and cancer. Despite intensive research, the HRR pathway is not yet fully mapped. Phylogenetic profiling analysis, which detects functional linkage between genes using coevolution, is a powerful approach to identify factors in many pathways. Nevertheless, phylogenetic profiling has limited predictive power when analyzing pathways with complex evolutionary dynamics such as the HRR. To map novel HRR genes systematically, we developed a novel algorithm which detects local coevolution across hundreds of genomes and points to the evolutionary scale (e.g., mammals, vertebrates, animals, plants) at which coevolution occurred. By using this algorithm, we identified dozens of unrecognized genes that coevolved with the HRR pathway, either globally across all eukaryotes or locally in different clades. We validated eight genes in functional biological assays to have a role in DNA repair. These genes might lead to a better understanding of missing heredity in HRR-associated cancers (e.g., heredity breast and ovarian cancer). Our platform presents an innovative approach to predict gene function, identify novel factors related to different diseases and pathways, and characterize gene evolution.

12:20 PM-12:30 PM
Evolution of the Metazoan Protein Domain Toolkit.
  • Maureen Stolzer, Carnegie Mellon University, United States
  • Yuting Xiao, Carnegie Mellon University, United States
  • Daniel Durand, Carnegie Mellon University, United States

Presentation Overview: Show

Domains, sequence fragments that encode structural or functional protein modules, are the basic building blocks of proteins. Thus, the set of all domains encoded in a genome is the protein function toolkit of the species. Domain family gain, expansion, and loss drive the evolution of this toolkit. New protein functions can arise via gain of new domains or novel combinations of existing domains, while specialization and streamlining are effected by domain loss.

Here, we investigate how changes in domain content are linked to genome and organismal evolution in metazoa, using a phylogenetic birth-death-gain model. Our results show that the relative importance of gain, expansion, and loss varies across lineages, according to a small number of evolutionary strategies. Our results also reveal characteristic evolutionary patterns among domain families. We observe that sets of domain families are evolving in concert, sharing a similar history of events, representation in ancestral genomes, and/or inferred event rates. In many cases, they also share a functional role, linking protein family evolution to innovations in the immune and nervous systems. In summary, the use of a powerful probabilistic birth-death-gain model reveals organizing principles of protein evolution in metazoan genomes.

12:30 PM-12:40 PM
Spliced alignment for the reconstruction of gene and transcript evolution
  • Aida Ouangraoua, Université de Sherbrooke, Canada
  • Safa Jammali, Université de Sherbrooke, Canada
  • Esaie Kuitche Kamela, Université de Sherbrooke, Canada

Presentation Overview: Show

Alternative splicing is a powerful mechanism that allows the production of multiple splice transcript variants by genes in eukaryotic organisms. However, current comparative genomics and phylogenetic reconstruction methods make use of a single reference transcript per gene to reconstruct gene families evolution and infer gene orthology relationships. Moreover, most of these methods only relies on sequence similarity/divergence, while neglecting the splicing structure of transcripts that is also informative.

To address these lacks, we have developed a series of algorithms for computing multiple spliced alignments, inferring splicing orthology relationships, and constructing transcript and gene trees. The new methods account for multiple alternative transcripts and both sequence and splicing structure similarity between transcripts. We have also developed a method for the visualization and annotation of the splice variants of a set of homologous genes, based on multiple spliced alignment.

2:00 PM-2:20 PM
Proceedings Presentation: TreeMerge: A new method for improving the scalability of species tree estimation methods
  • Erin Molloy, University of Illinois at Urbana-Champaign, United States
  • Tandy Warnow, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Motivation: At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n^5) running time for datasets with n species.

Results: Here we present a new method called "TreeMerge" that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework---only O(n^2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 hours maximum running time. Thus, TreeMerge is a step towards a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All.

Availability: TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge).

2:20 PM-2:30 PM
Accurate and Efficient Cell Lineage Tree Inference from Noisy Single Cell Data: the Maximum Likelihood Perfect Phylogeny Approach
  • Yufeng Wu, Computer Science and Engineering Department, University of Connecticut, United States

Presentation Overview: Show

Cells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. There is significant noise in single cell genotypes called from sequence data. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes.

In this paper, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with uncertain genotypes in the form of individualized genotype probabilities (which can be computed by existing single cell genotype callers). This allows better utilization of the information about uncertain genotypes from single cell sequence data. ScisTree assumes the infinite sites model that leads to the well-known perfect phylogeny formulation. Given uncertain genotypes with individualized probabilities, ScisTree infers cell lineage tree and calls the genotypes that allow a perfect phylogeny and maximize the likelihood of the genotypes. ScisTree can also impute the so-called doublets from noisy data.

2:30 PM-2:40 PM
AmoCoala: Towards a more realistic model for cophylogeny reconstruction via an approximate Bayesian computation
  • Blerina Sinaimeri, INRIA, France
  • Laura Urbini, INRIA, France
  • Catherine Matias, CNRS, France
  • Marie-France Sagot, Inria, Université Claude Bernard Lyon 1, France

Presentation Overview: Show

Nowadays, the most used model in studies of the coevolution of hosts and symbionts is phylogenetic tree reconciliation. A crucial issue in this model is that from a biological point of view, reasonable cost values for an event-based reconciliation are not easily chosen. Different methods have been developed to infer the set of costs to be used for a given pair of host and symbiont trees. However, a major limitation of these methods is their inability to model the ``invasion'' of different host species by a same symbiont species, which is often observed in reality.

Here we propose a method, called AmoCoala, that for a given pair of host and symbiont trees, estimates the frequency of the cophylogeny events, in presence of invasion events, based on an approximate Bayesian computation (ABC) approach that may be more efficient than a classical likelihood
method. The algorithm we propose on one hand provides more confidence in the set of costs to be used for a given pair of host and parasite trees, while on the other hand it allows to estimate the frequency of the events in cases of big datasets. We evaluated our method in synthetic and real datasets.

2:40 PM-3:00 PM
Proceedings Presentation: Statistical Compression of Protein Sequences and Inference of Marginal Probability Landscapes over Competing Alignments using Finite State Models and Dirichlet Priors
  • Dinithi Sumanaweera, Monash University, Australia
  • Lloyd Allison, Monash University, Australia
  • Arun Konagurthu, Monash University, Australia

Presentation Overview: Show

The information criterion of Minimum Message Length (MML) provides a powerful statistical framework for inductive reasoning from observed data. We apply MML to the problem of protein sequence comparison using finite state models with Dirichlet distributions. The resulting framework allows us to supersede the ad hoc cost functions commonly used in the field, by systematically addressing the problem of arbitrariness in alignment parameters, and the disconnect between substitution scores and gap costs. Furthermore, our framework enables the generation of marginal probability landscapes over all possible alignment hypotheses, with potential to facilitate the users to simultaneously rationalise and assess competing alignment relationships between protein sequences, beyond simply reporting a single (best) alignment. We demonstrate the performance of our program on benchmarks containing distantly related protein sequences.

3:00 PM-3:10 PM
Choosing amino-acid replacement models
  • Lars Arvestad, Stockholm University, Sweden

Presentation Overview: Show

Selecting the most suitable sequence evolution model, using tools like ProtTest and IQ-TREE, is today a common step in phylogenetic tree inference. It has become established practice to use maximum likelihood as model selection principle, which is computationally demanding: every plausible model (and sub-model) is tested, leading to an unfortunate com- binatorial effect. Being able to quickly select an appropriate model, or at least reduce the set of models to test with maximum likelihood, would simplify experimentation.

We propose a fast method for choosing models, based on the eigen decomposition of amino acid replacement rate matrices. The method works well on simulated data.

3:10 PM-3:20 PM
Amino acid exchangeability parameters in models of protein evolution are strongly structure dependent and non-stationary across the tree of life
  • Akanksha Pandey, University of Florida, United States
  • Edward Braun, Univeristy of Florida, United States

Presentation Overview: Show

Parameters describing the exchangeability of amino acids are central to models of protein evolution. However, the ways that amino acid exchangeabilities vary across the tree of life remains unclear. To examine this variation, we estimated exchangeability parameters using large sets of proteins from specific clades. We found that exchangeability parameters that protein models could be divided into two major groups: 1) a group comprising vertebrate, HIV, and influenza virus proteins; and 2) a cluster that includes plants and microbial eukaryotes along with most published models of protein sequence evolution. Based on prior studies, we expected exchangeabilities to vary within proteins; we confirmed that this was true by estimating exchangeabilities for subsets of the proteins that were defined using their structure. However, the differences among models that reflect the taxonomic group used to estimate the model remained after subdividing proteins using their structure. Principal component analysis revealed that most of the variation among the exchangeabilities could be attributed to an axis that separated model based on structure (~50% of the variance) and a second axis for taxonomic group (~25% of the variance).

3:20 PM-3:40 PM
Proceedings Presentation: Estimating the predictability of cancer evolution
  • Sayed-Rzgar Hosseini, Cancer Research UK, Cambridge Institute, United Kingdom
  • Ramon Diaz-Uriarte, Dept. Biochemistry, Universidad Autonoma de Madrid, Instituto de Investigaciones Biomedicas “Alberto Sols” (UAM-CSIC), Spain
  • Florian Markowetz, University of Cambridge, United Kingdom
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Presentation Overview: Show

Motivation: How predictable is the evolution of cancer? This fundamental question is of immense relevance for the diagnosis, prognosis, and treatment of cancer. Evolutionary biologists have approached the question of predictability based on the underlying fitness landscape. However, empirical fitness landscapes of tumor cells are impossible to determine in vivo. Thus, in order to quantify the predictability of cancer evolution, alternative approaches are required that circumvent the need for fitness landscapes.
Results: We developed a computational method based on Conjunctive Bayesian Networks (CBNs) to quantify the predictability of cancer evolution directly from mutational data, without the need for measuring or estimating fitness. Using simulated data derived from more than 200 different fitness landscapes, we show that our CBN-based notion of evolutionary predictability strongly correlates with the classical notion of predictability based on fitness landscapes under the Strong Selection Weak Mutation assumption. The statistical framework enables robust and scalable quantification of evolutionary predictability. We applied our approach to driver mutation data from the TCGA and the MSK-IMPACT clinical cohorts to systematically compare the predictability of 15 different cancer types. We found that cancer evolution is remarkably predictable as only a small fraction of evolutionary trajectories are feasible during cancer progression.

3:40 PM-3:50 PM
Graph-based network analysis of transcriptional regulation pattern divergence in duplicated yeast gene pairs
  • Juris Viksna, Institute of Mathematics and Computer Science, University of Latvia, Latvia
  • Darta Rituma, Institute of Mathematics and Computer Science, University of Latvia, Latvia
  • Martins Opmanis, Institute of Mathematics and Computer Science, University of Latvia, Latvia
  • Lelde Lace, Institute of Mathematics and Computer Science, University of Latvia, Latvia
  • Paulis Kikusts, Institute of Mathematics and Computer Science, University of Latvia, Latvia
  • Karlis Cerans, Institute of Mathematics and Computer Science, University of Latvia, Latvia
  • Edgars Celms, Institute of Mathematics and Computer Science, University of Latvia, Latvia
  • Peteris Rucevskis, Institute of Mathematics and Computer Science, University of Latvia, Latvia
  • Gatis Melkus, Institute of Mathematics and Computer Science, University of Latvia, Latvia
  • Karlis Freivalds, Institute of Mathematics and Computer Science, University of Latvia, Latvia

Presentation Overview: Show

The genome of Saccharomyces cerevisiae is among the most extensively studied eukaryotic genomes. A defining event in the evolutionary history of S. cerevisiae was a whole genome duplication (WGD) event approximately 100-200 Ma ago, giving rise to a special class of paralogous genes known as ohnologues.

Here we investigate the possible implications of this difference in origin between yeast ohnologues and tandem-duplicated paralogues through the lens of network motif analysis. To achieve this, we generated a transcriptional regulatory network (TRN) from publicly available data and performed an exhaustive graph-based network motif analysis. The prevalence of both complete and partial bi-fan motifs within the context of feed-forward loops and other motifs proved to be an effective means of estimating functional divergence associated with ohnologue and paralogue pairs.

We found good agreement between our network divergence measures and sequence similarity, and additionally detected some notable differences in the apparent network divergence patterns of ohnologue and paralogue pairs. Our findings demonstrate that genetic divergence between paired ohnologues as well as paralogues is accompanied by a corresponding divergence in TRN motifs, and that the study of bi-fan motifs is a useful network-based approach for investigating post-WGD ohnologue evolution.

3:50 PM-4:00 PM
Enumerating Galled Networks
  • Andreas Dwi Maryanto Gunawan, National University of Singapore, Singapore
  • Jeyaram Rathin, PSG College of Technology, India
  • Louxin Zhang, National University of Singapore, Singapore

Presentation Overview: Show

Galled trees are widely studied as a recombination model in population genetics. This network model is generalized into galled networks by relaxing a structural condition that galled trees satisfy. We study connection between
dupplication trees and galled networks.
Using the connection between them, we develop a computer programe to enumerate all the possible galled networks over a set of taxa.

4:40 PM-5:00 PM
Proceedings Presentation: A Divide-and-Conquer Method for Scalable Phylogenetic Network Inference from Multi-locus Data
  • Jiafan Zhu, Rice University, United States
  • Xinhao Liu, Rice University, United States
  • Huw Ogilvie, Rice University, United States
  • Luay Nakhleh, Rice University, United States

Presentation Overview: Show

Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting (ILS). However, these methods can only handle a small number of loci from a handful of genomes.

In this paper, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To
reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it.
We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological data sets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.

We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet).

5:00 PM-5:10 PM
The Ortholog Conjecture Revisited: the Value of Orthologs and Paralogs in Function Prediction
  • Moses Stamboulian, Indiana University Bloomington, United States
  • Rafael Guerrero, Indiana University Bloomington, United States
  • Matthew Hahn, Indiana University Bloomington, United States
  • Predrag Radivojac, Northeastern University, United States

Presentation Overview: Show

Computational prediction of gene function is a key step in making full use of newly sequenced genomes. Function is generally predicted by transferring annotations from homologous genes for which experimental evidence exists. The “ortholog conjecture” proposes that orthologous genes should be preferred when making such predictions, as they evolve functions more slowly than paralogous genes. Previous research has provided little support for the ortholog conjecture, though the incomplete nature of the data used cast doubt on the conclusions. Here we use experimental annotations from over 40,000 proteins (drawn from over 80,000 publications) to revisit the ortholog conjecture in two pairs of species: human and mouse and Saccharomyces cerevisiae and Schizosaccharomyces pombe. By distinguishing between the evolution of function and the prediction of function, we find strong evidence against the ortholog conjecture in the context of function prediction. Furthermore, we quantify the amount of data that is ignored when paralogs are discarded, alognside the resulting loss in prediction accuracy in both species pairs. Our results support the view that the types of homologs used are largely irrelevant to the task of function prediction. We should instead aim to maximize the amount of data we use for this task, regardless of homology.

5:10 PM-5:20 PM
SonicParanoid: fast, accurate and easy orthology inference
  • Salvatore Cosentino, The University of Tokyo, Japan
  • Wataru Iwasaki, The University of Tokyo, Japan

Presentation Overview: Show

Orthology inference constitutes a common base of many genome-based studies, as a pre-requisite for annotating new genomes, finding target genes for biotechnological applications and revealing the evolutionary history of life. Although its importance keeps rising with the ever-growing number of sequenced genomes, existing tools are computationally demanding and difficult to employ.
Here, we present SonicParanoid, which is orders of magnitude faster than, but comparably accurate to, the well-established tools with a balanced precision-recall trade-off. Furthermore, SonicParanoid substantially relieves the difficulties of orthology inference for those who need to construct and maintain their own genomic datasets.
SonicParanoid is available with a GNU GPLv3 license on the Python Package Index and BitBucket. Documentation is available at http://iwasakilab.bs.s.u-tokyo.ac.jp/sonicparanoid.

5:20 PM-5:40 PM
Proceedings Presentation: Efficient Merging of Genome Profile Alignments
  • André Hennig, Center for Bioinformatics Tübingen, University of Tübingen, Germany
  • Kay Nieselt, Center for Bioinformatics Tübingen, University of Tübingen, Germany

Presentation Overview: Show

Motivation: Whole-genome alignment methods show insufficient scalability towards the generation of large-scale whole-genome alignments (WGAs). Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which makes the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles.
Results: Here, we present GPA, an approach that aligns the profiles of WGAs and is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve (Darling et al., 2010) and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial data sets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool.

5:40 PM-5:50 PM
Improving classification of novel genes into known gene families via the phylo-kmers
  • Benjamin Linard, LIRMM, France
  • Vincent Ranwez, SupAgro Montpellier, France
  • Céline Scornavacca, ISEM, France
  • Fabio Pardi, LIRMM, France

Presentation Overview: Show

One of the most fundamental tasks in genome annotation is to classify new genes into known gene families. It is generally addressed by pairwise alignments to establish similarity scores (Blast) or profile HMM alignments. Nowadays, the latter is standard because of its scalability but it still shows limitations: i) when species sampling is biased, HMMs are less representative of the most isolated clades and ii) HMMs do not account for evolutionary distances that may be available if phylogenies were built for each gene family. Recently, we developed the concept of “phylogenetically-informed kmers” (phylo-kmers) to provide an efficient solution to the problem of alignment-free phylogenetic placement.
In the algorithm CLAPPAS, we adapted the phylo-kmer idea to the problem of protein classification into gene families. It relies on a first phase where sets of phylo-kmers are build and indexed for each gene family and a second phase where matches between k-mers from a query gene and pre-computed phylo-kmers are used to assign the gene to its most likely family of origin. Our preliminary results show that in this classification phase, CLAPPAS is already several orders of magnitude faster than HMM-based classification while keeping comparable accuracy.

5:50 PM-6:00 PM
IMAP: Chromosome-level genome assembler combining multiple de novo assemblies
  • Giltae Song, Pusan National University, South Korea
  • Juyeon Kim, Konkuk University, South Korea
  • Seokwoo Kang, Pusan National University, South Korea
  • Hoyong Lee, Pusan National University, South Korea
  • Daehong Kwon, Konkuk University, South Korea
  • Daehwan Lee, Konkuk University, South Korea
  • Gregory Lang, Lehigh University, United States
  • J. Michael Cherry, Stanford University, United States

Presentation Overview: Show

Genomic data have become major resources to understand complex mechanisms at fine-scale temporal resolution in functional and evolutionary genetic studies, including human diseases, such as cancers. Recently, a large number of whole genomes of evolving populations of yeast (Saccharomyces cerevisiae W303 strain) were sequenced in a time-dependent manner to identify temporal evolutionary patterns. For this type of study, a chromosome-level sequence assembly of the strain or population at time zero is required to compare with the genomes derived later. However, there is no fully automated computational approach to establish the chromosome-level genome assembly using unique features of sequencing data in experimental evolution studies. In this study, we developed a new software pipeline, integrative meta-assembly pipeline (IMAP), to build chromosome-level genome sequence assemblies by combining multiple initial assemblies from only short-read sequencing data. We significantly improved the continuity and accuracy of the genome assembly using a large collection of sequencing data and hybrid assembly approaches. We validated our pipeline by generating chromosome-level assemblies of several fungal strains, and compared our results with assemblies built using long-read sequencing and various assembly evaluation metrics. Our pipeline combines the strengths of reference-guided and meta-assembly approaches.