Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


EvolCompGen COSI


Schedule subject to change
Tuesday, July 14th
10:40 AM-11:00 AM
Proceedings Presentation: FastMulRFS: Fast and accurate species tree estimation under generic gene duplication and loss models
Format: Pre-recorded with live Q&A

  • Tandy Warnow, the university of illinois at urbana-champaign, United States
  • Erin Molloy, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Motivation:Species tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed.

Results: We present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods.

11:00 AM-11:10 AM
Species tree estimation with selection
Format: Pre-recorded with live Q&A

  • Carolin Kosiol, University of St Andrews, United Kingdom
  • Rui Borges, Institute of Population Genetics, Vetmeduni Vienna, Austria

Presentation Overview: Show

Species trees form a basis to address many biological questions, and during the last years have crucial to our understanding of the divergence process and speciation. Although modeling the species tree is challenging the last two decades have seen an explosion of sophisticated models for the species tree. The multispecies coalescent model has arisen as a leading framework, but alternative ones have been proposed. The polymorphism-aware phylogenetic models (PoMos), in particular, offer the advantage of naturally accounting for incomplete lineage sorting and being flexible modeling-wise.
PoMo extends any DNA substitution model to account for polymorphisms by expanding the state space to include polymorphic states. A Moran process is used to model genetic drift. PoMo performs well for species tree estimation as it can accurately and time-efficiently estimate the parameters describing evolutionary patterns for phylogenetic trees of any shape (species trees, population trees, or any combination of those).
Here, we extend PoMo to incorporate allelic directional selection. Our results show that one might overestimate the divergence among evolving species when directional selection is unaccounted. We developed a Bayesian framework for PoMo with allelic selection, which permitted to infer pervasive signatures of nucleotide usage biases in great apes and fruit flies genomes.

11:10 AM-11:20 AM
Graph Splitting: A Graph-Based Approach for Superfamily-Scale Phylogenetic Tree Reconstruction
Format: Pre-recorded with live Q&A

  • Motomu Matsui, The University of Tokyo, Japan
  • Wataru Iwasaki, The University of Tokyo, Japan

Presentation Overview: Show

A protein superfamily contains distantly related proteins that have acquired diverse biological functions through a long evolutionary history. Phylogenetic analysis of the early evolution of protein superfamilies is a key challenge because existing phylogenetic methods show poor performance when protein sequences are too diverged to construct an informative multiple sequence alignment (MSA). Here, we propose the Graph Splitting (GS) method, which rapidly reconstructs a protein superfamily-scale phylogenetic tree using a graph-based approach. Evolutionary simulation showed that the GS method can accurately reconstruct phylogenetic trees and be robust to major problems in phylogenetic estimation, such as biased taxon sampling, heterogeneous evolutionary rates, and long-branch attraction when sequences are substantially diverge. Its application to an empirical data set of the triosephosphate isomerase (TIM)-barrel superfamily suggests rapid evolution of protein-mediated pyrimidine biosynthesis, likely taking place after the RNA world. Furthermore, the GS method can also substantially improve performance of widely used MSA methods by providing accurate guide trees. The GS method is freely available at our website: http://gs.bs.s.u-tokyo.ac.jp/

Motomu Matsui and Wataru Iwasaki. Systematic Biology, 69, 265-279. (2020)

11:20 AM-11:30 AM
Gaps and runs in syntenic alignments
Format: Pre-recorded with live Q&A

  • Zhe Yu, University of Ottawa, Canada
  • Chunfang Zheng, University of Ottawa, Canada
  • David Sankoff, University of Ottawa, Canada

Presentation Overview: Show

Gene loss is the obverse of novel gene acquisition by a genome through a variety of evolutionary processes. It serves a number of functional and structural roles, compensating for the energy and material costs of gene complement expansion.

A type of gene loss widespread in the lineages of plant genomes is “fractionation” after whole genome doubling or tripling, where one of a pair or triplet of paralogous genes in parallel syntenic contexts is discarded. The detailed syntenic mechanisms of gene loss, especially in fractionation, remain controversial.

We focus on the the frequency distribution of gap lengths (number of deleted genes – not nucleotides) within syntenic blocks calculated during the comparison of chromosomes from two genomes. We mathematically characterize a simple model in some detail and show how it is an adequate description neither of the Coffea arabica subgenomes nor its two progenitor genomes. We find that a mixture of two models, a random, one-gene-at-a-time, model and a geometric-length distributed excision for removing a variable number of genes, fits well.

11:30 AM-11:40 AM
A Solution to the Labeled Robinson-Foulds Distance Problem
Format: Pre-recorded with live Q&A

  • Samuel Briand, University of Montreal, Canada
  • Nadia El-Mabrouk, University of Montreal, Canada
  • Samuel Briand, University of Lausanne, Switzerland

Presentation Overview: Show

Gene trees are extensively used, notably for inferring the
most plausible scenario of evolutionary events leading to the observed gene family from a single ancestral gene copy. This has important implications towards elucidating the functional relationship between gene copies. For this purpose, reconciliation enables the labeling of internal nodes in the gene tree with the type of events at the origin of gene tree bifurcations.

The variety of phylogenetic inference methods, leading to different and potentially inconsistent trees for the same dataset, warrants the design of appropriate tools for comparing them. While, comparing reconciled gene trees remains a largely unexplored field, a large variety of measures have been developed for comparing unlabeled evolutionary trees. Among them, despite its limitations, the Robinson-Foulds (RF) distance remains the most widely used one, mostly due to its computational efficiency.

In this paper, we report on a Labeled Robinson-Foulds edit distance, which maintains desirable properties such as being computable exactly in linear-time.
Further, we show that this new distance is computable for an arbitrary number of label types, thus making it useful for applications including more label types than speciations and duplications.

12:00 PM-12:20 PM
Tissue-guided LASSO for prediction of clinical drug response using preclinical samples
  • Saurabh Sinha, University of Illinois at Urbana-Champaign, United States
  • Edward Huang, University of Illinois at Urbana-Champaign, United States
  • Ameya Bhope, McGill University, Canada
  • Jing Lim, University of Illinois at Urbana-Champaign, United States
  • Amin Emad, McGill University, Canada

Presentation Overview: Show

Predicting the clinical drug response (CDR) of cancer patients, based on their clinical parameters and their tumours' molecular profiles, can play an important role in precision medicine. While machine learning (ML) models have the potential to address this issue, their training requires data from a large number of patients treated with each drug, limiting their feasibility for many drugs. One alternative is training ML models on large databases containing molecular profiles of hundreds of preclinical cell lines and their response to hundreds of drugs. Here, we developed a novel algorithm (TG-LASSO) that explicitly incorporates information on samples' tissue of origin with gene expression profiles to predict the CDR of patients using preclinical samples. Using two large databases, we showed that TG-LASSO can accurately distinguish between resistant and sensitive patients for 7 out of 12 drugs, outperforming various other methods. Moreover, TG-LASSO identified genes associated with the drug response, including known targets and pathways involved in the drugs' mechanism of action. Additionally, genes identified by this method for multiple drugs in a tissue are associated with patient survival and can be used to predict their outcome. In summary, TG-LASSO can predict patients’ CDR and identify biomarkers of drug sensitivity and survival.

12:20 PM-12:30 PM
A Probabilistic Framework for Cell Lineage Tree Reconstruction
Format: Pre-recorded with live Q&A

  • Hazal Koptagel, KTH Royal Institute of Technology, Science For Life Laboratory, Sweden
  • Seong-Hwan Jun, KTH Royal Institute of Technology, Science For Life Laboratory, Sweden
  • Jens Lagergren, KTH Royal Institute of Technology, Science For Life Laboratory, Sweden

Presentation Overview: Show

Single-cell DNA sequencing (ScDNA-seq) technology enables a higher resolution look on the cells and has the potential to uncover the relationship between individual cells. ScDNA-seq is a fundamental tool for the evolutionary studies; however, it also introduces technological artefacts such as uneven coverage, allelic dropout, amplification and sequencing errors.

In this study, we focus on human cells without copy number alterations. We present a Bayesian model based on scDNA-seq, that uncovers the difference between cells with the help of germline single nucleotide variations (gSNVs). The use of gSNVs as reference points enables us to accurately differentiate between somatic point mutations and the technological artefacts, especially the amplification errors. The model outputs a cell-to-cell distance matrix of each analysed pairs of loci, from which we reconstruct the cell lineage tree with bootstrapping and neighbour joining. We evaluate the reconstructed tree with transfer bootstrap expectation scores of branches and the Robinson-Foulds distance to the underlying tree structure. The model is embarrassingly parallel and with the use of dynamic programming and neighbour joining, we can analyse tens of thousands positions in the genome.

The experiments showed high accuracy in tree reconstruction and the identification of subclones.

12:30 PM-12:40 PM
Splicing-structure-based selection of protein isoforms improves the accuracy of gene tree reconstruction
Format: Pre-recorded with live Q&A

  • Esaie Kuitche Kamela, Université de Sherbrooke, Canada
  • Wend-Yam D. Davy Ouédraogo, Université de Sherbrooke, Canada
  • Marie Degen, Université de Sherbrooke, Canada
  • Shengrui Wang, Université de Sherbrooke, Canada
  • Aida Ouangraoua, Université de Sherbrooke, Canada

Presentation Overview: Show

Constructing accurate gene trees is important, as gene trees play a key role in several biological studies, such as species tree reconstruction, gene functional analysis and gene family evolution studies. Although several methods have provided large improvements in the construction and the correction of gene trees by making use of the relationship with a species tree in addition to multiple sequence alignments, there is still room for improvement on the accuracy of gene trees and the computing time. In particular, accounting for alternative splicing that allows eukaryote genes to produce multiple transcripts and proteins per gene is a way to improve the quality of multiple sequence alignments used to reconstruct gene trees.Current methods for gene tree reconstruction usually make use of a set of transcripts composed of one representative transcript per gene, to generate multiple sequence alignments which are then used to estimate gene trees. In this work, we present two new splicing-structure-based methods to estimate gene trees based on wisely selecting an accurate set of homologous transcripts based on their splicing structure to represent the genes of a gene family. The results show that the new methods compare favorably with the currently most used gene tree construction methods.

12:40 PM-1:00 PM
Proceedings Presentation: Inference of Population Admixture Network from Local Gene Genealogies: a Coalescent-based Maximum Likelihood Approach
Format: Pre-recorded with live Q&A

  • Yufeng Wu, Computer Science and Engineering Department, University of Connecticut, United States

Presentation Overview: Show

Population admixture is an important subject in population genetics. Inferring population demographic history with admixture under the so-called admixture network model from population genetic data is an established problem in genetics. Existing admixture network inference approaches work with single genetic polymorphisms. While these methods are usually very fast, they don't fully utilize the information (e.g., linkage disequilibrium or LD) contained in population genetic data.

In this paper, we develop a new admixture network inference method called GTmix. Different from existing methods, GTmix works with local gene genealogies that can be inferred from population haplotypes. Local gene genealogies represent the evolutionary history of sampled alleles and contain the LD information. GTmix performs coalescent-based maximum likelihood inference of admixture networks with inferred genealogies based on the well-known multispecies coalescent (MSC) model. GTmix utilizes various techniques to speed up likelihood computation on the MSC model and optimal network search. Our simulations show that GTmix can infer more accurate admixture networks with much smaller data than existing methods, even when these existing methods are given much larger data. GTmix is reasonably efficient and can analyze genetic datasets of current interests.

2:00 PM-2:20 PM
Proceedings Presentation: EvoLSTM: Context-dependent models of sequence evolution using a sequence-to-sequence LSTM
Format: Pre-recorded with live Q&A

  • Mathieu Blanchette, McGill University, Canada
  • Dongjoon Lim, McGill University, Canada

Presentation Overview: Show

Motivation: Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate the evolution of sequence is also at the core of many benchmarking strategies. Yet mutational processes have complex context dependencies that remain poorly modeled and understood.
Results: We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence LSTM model trained to predict mutation probabilities at each position of a given descendant ancestral sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate primate DNA sequence and reveals unexpectedly strong long-range context dependencies in mutation rates.
Conclusion: EvoLSTM brings modern machine learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes.

2:20 PM-2:30 PM
Gene annotation refinement software using synteny based mapping
Format: Pre-recorded with live Q&A

  • Giltae Song, Pusan national University, South Korea
  • Hoyong Lee, Pusan National University, South Korea

Presentation Overview: Show

High throughput next-generation sequencing (NGS) reduces the generation cost of genome data substantially. To apply the NGS data for various genetics studies, the sequencing data is assembled into a genome sequence and annotated into genes. Gene annotation, one of fundamental steps to understand the functions of each gene, is to determine the location of the gene and coding regions. There exist several gene annotation tools, but they have still limitations in terms of ambiguity issues in gene annotation steps. Most annotation tools are also quite difficult for novice users to install and apply for their studies.
We propose a user-friendly and practically usable gene annotation software pipeline. To this end, the ambiguity problems are resolved using synteny mapping information. The performance of our software tool is evaluated using benchmark datasets such as the sequence of Saccharomyces cerevisiae S288C strain as well as other strain data. We believe our tool improves the accuracy of gene annotations so that it can substantially reduce the efforts and time required for manual curation in genome annotation. Our software package is released as an installation script and a Docker image so that users can easily install and apply for their own sequence data.

2:30 PM-2:40 PM
A Comprehensive Analysis of the Phylogenetic Signal in Ramp Sequences in 211 Vertebrates
Format: Pre-recorded with live Q&A

  • Lauren McKinnon, Brigham Young University, United States
  • Justin Miller, Brigham Young University, United States
  • Michael Whiting, Brigham Young University, United States
  • John Kauwe, Brigham Young University, United States
  • Perry Ridge, Brigham Young University, United States

Presentation Overview: Show

Background: Ramp sequences increase translational speed and accuracy when rare, slowly-translated codons are found at the beginnings of genes. Here, the results of the first analysis of ramp sequences in a phylogenetic construct are presented.

Methods: Ramp sequences were compared from 211 vertebrates (110 Mammalian and 101 non-mammalian). The presence and absence of ramp sequences was analyzed as a binary character in a parsimony and maximum likelihood framework. Additionally, ramp sequences were mapped to the Open Tree of Life taxonomy to determine the number of parallelisms and reversals that occurred.

Results: Parsimony and maximum likelihood analyses of the presence/absence of ramp sequences recovered phylogenies that are highly congruent with established phylogenies. Additionally, the retention index of ramp sequences is significantly higher than would be expected due to random chance (p-value = 0). A chi-square analysis of completely orthologous ramp sequences resulted in a p-value of approximately zero as compared to random chance.

Discussion: Ramp sequences recover comparable phylogenies as other phylogenomic methods. Although not all ramp sequences appear to have a phylogenetic signal, more ramp sequences track speciation than expected by random chance. Therefore, ramp sequences may be used in conjunction with other phylogenomic approaches.

2:40 PM-2:50 PM
Highly-regulated and diverse NTP-based biological conflict systems with implications for emergence of multicellularity
Format: Pre-recorded with live Q&A

  • Gurmeet Kaur, NCBI, NIH, United States
  • A Maxwell Burroughs, NCBI, NIH, United States
  • Lakshminarayan M Iyer, NCBI, NIH, United States
  • Aravind L., NCBI, NIH, United States

Presentation Overview: Show

Multicellular organizations are prone to infections even if a single cell is infected. We reveal novel highly-regulated chaperone-based systems that are likely used as survival tactic by prokaryotes with complex lifecycles. These architecturally-analogous systems have constant core modules coupled with highly-variable effector modules which is reminiscent of known biological conflict systems and co-evolutionary arms-race. The constant component is either an ATPase/GTPase and/or a peptidase that is activated in response to an invasive entity and causes effector deployment, that is additionally regulated by proteolytic processing or binding of nucleotide-derived signal. A third component senses invasive entities and transmits the signal. Effectors either: target invasive nucleic-acids or proteins; are inactive counterparts of host proteins that mediate decoy interactions with invasive molecules; or form macromolecular assemblages to cause host cell-death or containment of invasive entity. These apoptotic and immunity properties displayed by systems in phylogenetically-disparate multicellular prokaryotes are suggestive of evolutionary convergence for kin viability in multicellular organizations. Comparable protein domains appear to have organized into systems based on common principles in eukaryotic apoptosis. Thus, a similar operational “grammar” and shared “vocabulary” of protein domains in sensing and limiting infections during the multiple emergences of multicellularity across the tree of life is seen.

2:50 PM-3:00 PM
Integrated synteny- and similarity-based inference on the polyploidization-fractionation cycle
Format: Pre-recorded with live Q&A

  • Zhe Yu, University of Ottawa, Canada
  • Chunfang Zheng, University of Ottawa, Canada
  • David Sankoff, University of Ottawa, Canada
  • Yue Zhang, University of Ottawa, Canada

Presentation Overview: Show

Two orthogonal approaches to the study of fractionation (duplicate gene loss after polyploidization) focus on the decrease over time of the number of surviving duplicate pairs, on the one hand, and on the pattern of syntenically consecutive pairs lost at a deletion event, on the other. Here we explore a synergy between the two approaches that greatly enlarges the scope of both.

In the branching process approach to accounting for the distribution of gene pair similarities, the inference possibilities are minimal, since there is only one degree of freedom for each replication event. It is only by transcending the distribution of gene pair similarities and bringing other data to bear can we increase the number of parameters of the branching process that can be estimated.

We greatly enlarged the possibilities of estimating parameters this model of the replication-fractionation cycle, by considering the singletons within synteny blocks, by deriving theoretical constraints among the retention rates, and by correcting for erosion of synteny blocks over time.

3:20 PM-3:40 PM
Proceedings Presentation: Copy Number Evolution with Weighted Aberrations in Cancer
Format: Pre-recorded with live Q&A

  • Benjamin J. Raphael, Princeton University, United States
  • Ron Zeira, Princeton University, United States

Presentation Overview: Show

Motivation: Copy number aberrations (CNAs), which delete or amplify large contiguous segments of the genome, are a common type of somatic mutation in cancer. Copy number profiles, representing the number of copies of each region of a genome, are readily obtained from whole-genome sequencing or microarrays. However, modeling copy number evolution is a substantial challenge, since CNAs alter contiguous segments of the genome and different CNAs may overlap with one another. A recent popular model for copy number evolution is the Copy Number Distance (CND), defined as the length of a shortest sequence of deletions and amplifications of contiguous segments that transforms one profile into the other. All events contribute equally to the CND; however, CNAs are observed to occur at different rates according to their length or genomic position and also vary across cancer type.
Results: We introduce a weighted copy number distance that allows events to have varying weights, or probabilities, based on their length, position and type. We derive an efficient algorithm to compute the weighted copy number distance as well as the associated transformation, based on the observation that the constraint matrix of the underlying optimization problem is totally unimodular. We demonstrate the utility of the weighted copy number distance by showing that the weighted CND: improves phylogenetic reconstruction on simulated data where copy number aberrations occur with varying probabilities; aids in the derivation of phylogenies from ultra low-coverage single-cell DNA sequencing data; helps estimate CNA rates in a large pan-cancer dataset.

3:40 PM-3:50 PM
Reconstructing Tumor Evolutionary Histories and Clone Trees in Polynomial-time with SubMARine
Format: Pre-recorded with live Q&A

  • Linda K. Sundermann, University of Toronto, Canada
  • Jeff Wintersinger, University of Toronto, Canada
  • Gunnar Rätsch, ETH Zürich, Switzerland
  • Jens Stoye, Bielefeld University, Germany
  • Quaid Morris, Memorial Sloan Kettering Cancer Centre, United States

Presentation Overview: Show

Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data. Current methods do not effectively characterize this uncertainty, and cannot scale to cancers with many subclonal populations.

In this work we introduce a partial clone tree that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined relationships. Also, we define a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. We describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR with specific guarantees. We also extend SubMARine to work with subclonal copy number aberrations.

We show, both on simulated and a real lung cancer dataset, that SubMARine runs in less than 70 seconds, and that the subMAR equals the MAR in > 99.9% of cases where only a single tree exists.

SubMARine is available at https://github. com/morrislab/submarine.

3:50 PM-4:00 PM
Structural and transcriptional variation linked to protracted human frontal cortex development
Format: Pre-recorded with live Q&A

  • Jasmine Hendy, Delaware State University, United States
  • Christine Charvet, Delaware State University, United States

Presentation Overview: Show

The human frontal cortex is unusually large compared with many other species. The expansion of the human frontal cortex is accompanied by both connectivity and transcriptional changes. Yet, the developmental origins generating variation in frontal cortex circuitry across species remain unresolved. Nineteen genes, which encode filaments, synapse, and voltage-gated channels are especially enriched in the supragranular layers of the human cerebral cortex, which suggests enhanced cortico-cortical projections emerging from layer III. We identify species differences in connections with the use of diffusion MR tractography as well as gene expression in adulthood and in development to identify developmental mechanisms generating variation in frontal cortical circuitry. We demonstrate that increased expression of supragranular-enriched genes in frontal cortex layer III is concomitant with an expansion in cortico-cortical pathways projecting within the frontal cortex in humans relative to mice. We also demonstrate that the growth of the frontal cortex white matter and transcriptional profiles of supragranular-enriched genes are protracted in humans relative to mice. The expansion of projections emerging from the human frontal cortex emerges by extending frontal cortical circuitry development. Integrating gene expression with neuroimaging level phenotypes is an effective strategy to assess deviations in developmental programs leading to species differences in connections.

4:00 PM-4:10 PM
CoCoCoNet: Conserved and Comparative Co-expression Across a Diverse Set of Species
Format: Pre-recorded with live Q&A

  • John Lee, Cold Spring Harbor Laboratory, United States
  • Manthan Shah, Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, United States
  • Sara Ballouz, Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, United States
  • Megan Crow, Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, United States
  • Jesse Gillis, Cold Spring Harbor Laboratory, United States

Presentation Overview: Show

Co-expression analysis has provided insight into gene function in many organisms from Arabidopsis to Zebrafish. Comparison across species has the potential to enrich these results, for example, by prioritizing among candidate human disease genes based on their network properties, or by finding alternative model systems where their co-expression is conserved. Here, we present CoCoCoNet as a tool for identifying conserved gene modules and comparing co-expression networks. CoCoCoNet is a resource for both data and methods, providing gold-standard networks and sophisticated tools for on-the-fly comparative analyses across 14 species. We show how CoCoCoNet can be used in two use cases. In the first, we demonstrate deep conservation of a nucleolus gene module across very divergent organisms, and in the second, we show how the heterogeneity of autism mechanisms in humans can be broken down by functional group, and translated to model organisms. CoCoCoNet is free to use and available at https://milton.cshl.edu/CoCoCoNet/, providing users with convenient access to both data and methods for cross-species analyses, opening up a range of potential research questions relevant to evolution and comparative genomics.

4:10 PM-4:20 PM
Modeling gene expression evolution with EvoGeneX uncovers differences in evolution of species, organs and sexes
Format: Pre-recorded with live Q&A

  • Soumitra Pal, National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, United States
  • Brian Oliver, Laboratory of Cellular and Developmental Biology, National Institute of Diabetes and Digestive and Kidney Diseases, United States
  • Teresa Przytycka, National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, United States

Presentation Overview: Show

While DNA sequence evolution is well-studied, an equally important factor, evolution of gene expression, is yet to be fully understood. The availability of recent tissue/organ-specific expression datasets spanning several organisms across the tree of life, including our new data from Drosophila, has enabled detailed studies of expression evolution.

We introduce EvoGeneX, a computational method that complements existing models for expression evolution across species using stochastic processes, maximum likelihood-estimation and hypothesis-testing to differentiate three modes of evolution: 1) neutral: Brownian Motion, 2) constrained: when expression evolved toward an optimum (Ornstein-Uhlenbeck process), and 3) adaptive: when expression in different branches of species tree evolved toward different optima. Additionally, EvoGeneX incorporates biological replicates for within-species variations. We also introduce a novel comparative analysis of evolution across tissues and sexes using Michalis-Menten(MM) curves.

In our simulation EvoGeneX significantly outperformed the currently available method on false discovery rate. On expression data across organs, species, and sexes of Drosophila, our generic method revealed a large fraction of constrained genes including genes constrained in all organs and sexes. Our MM-based approach revealed striking differences in evolutionary dynamics in gonads. Finally, EvoGeneX revealed compelling examples of adaptive evolution, including odor binding proteins, ribosomal proteins, and amino acid metabolism.

4:20 PM-4:40 PM
Proceedings Presentation: Phylogenetic double placement of mixed samples
Format: Pre-recorded with live Q&A

  • Siavash Mirarab, University of California San Diego, United States
  • Metin Balaban, University of California San Diego, United States

Presentation Overview: Show

Motivation: Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction.
Results: We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non- convex optimization problem that deconvolutes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MISA, on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice.
Availability: The sofware and data are available at https://github.com/balabanmetin/misa.

5:00 PM-5:20 PM
Proceedings Presentation: Sampling and Summarizing Transmission Trees with Multi-strain Infections
Format: Pre-recorded with live Q&A

  • Mohammed El-Kebir, University Of Illinois at Urbana-Champaign, United States
  • Palash Sashittal, University Of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Motivation: The combination of genomic and epidemiological data hold the potential to enable accurate pathogen transmission history inference. However, the inference of outbreak transmission histories remains challenging due to various factors such as within-host pathogen diversity and multi-strain infections. Current computational methods ignore within-host diversity and/or multi-strain infections, often failing to accurately infer the transmission history. Thus, there is a need for efficient computational methods for transmission tree inference that accommodate the complexities of real data.
Results: We formulate the Direct Transmission Inference (DTI) problem for inferring transmission trees that support multi-strain infections given a timed phylogeny and additional epidemiological data. We establish hardness for the decision and counting version of the DTI problem. We introduce TiTUS, a method that uses SATISFIABILITY to almost uniformly sample from the space of transmission trees. We introduce criteria that prioritizes parsimonious transmission trees that we subsequently summarize using a novel consensus tree approach. We demonstrate TiTUS’s ability to accurately reconstruct transmission trees on simulated data as well as a documented HIV transmission chain.
Availability: https://github.com/elkebir-group/TiTUS

5:20 PM-5:30 PM
An integrative computational evolutionary approach to understand the protein repertoire in bacterial pathogens
Format: Pre-recorded with live Q&A

  • Janani Ravi, Michigan State University, United States
  • Samuel Chen, Michigan State University, United States
  • Karn Jongnarangsin, Michigan State University, United States
  • Lauren Sosinski, Michigan State University, United States

Presentation Overview: Show

Evolutionary relationships, further refined using structural and functional information, can provide vital clues about pathogenic proteins and operons. However, the data required to find these relationships are diverse and reside in disconnected web-resources, requiring the arduous task of piecing them together coherently. We have developed comprehensive and systematic computational evolutionary approaches that leverage sequence-structure-function relationships and comparative pathogenomics to identify novel molecular targets that can enable and guide better prevention, diagnostic, and treatment regimens. Our framework goes beyond simple sequence comparisons by delving into constituent domains, domain architectures, genomic neighborhoods, and pangenomes. These analyses pinpoint molecular genomic features that are unique to pathogenic bacteria, which then help prioritize candidate molecular targets in poorly characterized pathogenic genomes. To demonstrate the versatility of this framework, we are currently applying these workflows to Nontuberculous Mycobacteria (NTM), Staphylococcus aureus, and Bacillus anthracis that are zoonotic pathogens causing severe and chronic pathologies in humans and animals. We are implementing these approaches as open-source software and web-based applications that will enable us, and the scientific community, to prioritize candidate genetic factors for experimental validation. Our predictions will illuminate fundamental mechanisms such as the evolution of host-specificity and metabolic differences between environmental and pathogenic bacteria.

5:30 PM-5:40 PM
Reading the book of life: the language of proteins
Format: Pre-recorded with live Q&A

  • Malay Basu, University of Alabama, Birmingham, United States

Presentation Overview: Show

Genomes are remarkably similar to natural language texts. From an information theory perspective, we can think of amino acid residues as letters, protein domains as words, and proteins as sentences consisting of ordered arrangements of protein domains (domain architectures). This work describes our recent efforts towards understanding the linguistic properties of genomes.

Our recent work showed that the complexity of “grammars” in all major branches of life is close to a universal constant of ~1.2 bits. This is remarkably similar to natural languages; such an--yet unexplained--universal information gain has been observed and generally used to determine whether a series of symbols represent a language. In this work, we describe the implications of this work and its extension in various areas with a particular emphasis on measuring the proteome complexities in human tissues.

Our work established the similarity between natural languages and genomes and showed, for the first time, that there exists a “quasi-universal grammar” of protein domains and measured the minimal complexity of proteome required for a functional cell. We also describe the proteome complexities in human tissues and their functional significance.

5:40 PM-5:50 PM
What is the structure of the ‘evolutionary model space’ for proteins?
Format: Pre-recorded with live Q&A

  • Edward Braun, Univeristy of Florida, United States
  • Akanksha Pandy, University of Florida, United States
  • Gabrielle Scolaro, University of Florida, United States
  • Matthew Chang, University of Florida, United States
  • Emily Gordon, University of Florida, United States

Presentation Overview: Show

Estimates of amino acid exchangeabilities are central to to models of protein evolution; there have been many efforts to estimate those parameters using from large numbers of proteins. Although models trained in this way can be useful for phylogenetic analyses, they provide limited information about the process of protein evolution. Recent studies have revealed that patterns of protein evolution (assessed using amino acid exchangeability parameters) vary across the tree of life; in other words, the processes underlying protein evolution are non-homogeneous. However, optimizing a 20-state non-homogenous model requires the estimation of many. of free parameters. Thus, it represents a challenging computational problem. There are two straightforward ways to simplify this problem: 1) estimate parameters for the best-approximating time-reversible model using restricted taxon sets; or 2) reduce the number of free parameters by constraining the model using biochemical information. Different protein structural environments are also associated with distinct amino acid substitution patterns; this results in a mixture of underlying models that also differ among taxa. Efforts to use the approaches described above, in combination with structural partitioning, to understand the space of protein evolutionary models will be described.

5:50 PM-6:00 PM
On quantifying evolutionary importance of protein sites: A tale of two measures
Format: Pre-recorded with live Q&A

  • Yu Xia, McGill University, Canada
  • Avital Sharir-Ivry, McGill University, Israel

Presentation Overview: Show

A key challenge in evolutionary biology is the quantification of selective pressure on proteins and other biological macromolecules at single-site resolution. The evolutionary importance of a protein site under purifying selection is typically measured by the degree of conservation of the protein site itself. A possible alternative measure is the strength of the site-induced conservation gradient in the rest of the protein structure. Here, we show that despite major differences, there is a linear relationship between the two measures such that more conserved protein sites also induce stronger conservation gradient in the protein structure. This linear relationship is universal as it holds for different types of proteins and functional sites. Our results show that generally, the selective pressure acting on a functional site percolates through the rest of the protein via residue-residue contacts. Surprisingly however, catalytic sites in enzymes are the principal exception to this rule. Catalytic sites induce significantly stronger conservation gradients in the rest of the protein than expected from the degree of conservation of the site alone. The uniquely stringent requirement for the active site to selectively stabilize the transition state of the catalyzed chemical reaction imposes additional selective constraints on the rest of the enzyme.