Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Monday, July 15th
10:40-11:00
Proceedings Presentation: Median and Small Parsimony Problems on RNA trees
Confirmed Presenter: Bertrand Marchand, Université de Sherbrooke, Canada

Room: 518
Format: In Person

Moderator(s): Lars Arvestad


Authors List: Show

  • Bertrand Marchand, Université de Sherbrooke, Canada
  • Yoann Anselmetti, University of Sherbrooke, Canada
  • Manuel Lafond, Université de Sherbrooke, Canada
  • Aida Ouangraoua, Université de Sherbrooke, Canada

Presentation Overview: Show

Motivation:
Non-coding RNAs (ncRNAs) express their functions by adopting molecular structures. Specifically, RNA secondary structures serve as a relatively stable intermediate step before tertiary structures, offering a reliable signature of molecular function. Consequently, within an RNA functional family, secondary structures are generally more evolutionarily conserved than sequences. Conversely, homologous RNA families grouped within an RNA clan share ancestors but typically exhibit structural differences. Inferring the evolution of RNA structures within RNA families and clans is crucial for gaining insights into functional adaptations over time and providing clues about the Ancient RNA World Hypothesis.
Results:
We introduce the median problem and the small parsimony problem for ncRNA families, where secondary structures are represented as leaf-labelled trees. We utilize the Robinson-Foulds (RF) tree distance, which corresponds to a specific edit distance between RNA trees, and a new metric called the Internal-Leafset (IL) distance. While the RF tree distance compares sets of leaves descending from internal nodes of two RNA trees, the IL distance compares the collection of leaf-children of internal nodes. The latter is better at capturing differences in structural elements of RNAs than the RF distance, which is more focused on base pairs. We study the theoretical complexity of the median problem and the small parsimony problem under the three distance metrics and various biologically-relevant constraints, and we present polynomial-time maximum parsimony algorithms for solving some versions of the problems. Our algorithms are applied to ncRNA families from the RFAM database, illustrating their practical utility.

11:10-11:20
Inferring transcript phylogenies based on precomputed groups of conserved transcripts
Confirmed Presenter: Wend Yam Donald Davy Ouedraogo, Université de Sherbrooke, Canada

Room: 518
Format: In Person

Moderator(s): Lars Arvestad


Authors List: Show

  • Wend Yam Donald Davy Ouedraogo, Université de Sherbrooke, Canada
  • Aida Ouangraoua, Université de Sherbrooke, Canada

Presentation Overview: Show

Alternative Splicing (AS) is a mechanism in eukaryotic gene expression by which different combinations of introns are spliced to produce distinct transcript isoforms from a gene. Recent studies have highlighted that the transcript isoforms of human genes are often conserved in orthologous genes from various species. The conserved transcripts are referred to as transcript orthologs, and the identification of transcript ortholog groups provides valuable insights for studying their functions. Exploring the evolutionary histories of transcripts enhances our understanding of their proteins functions and their origins. It also allows us to better understand the role of alternative splicing in transcript evolution.
In a previous work(DOI: 10.1007/978-3-031-36911-7_2), we addressed the problem of inferring orthology and paralogy relations at the transcript level. In this work, we focus on the reconstruction of transcript evolutionary histories. We present a progressive supertree construction algorithm that relies on a dynamic programming approach to infer a transcript phylogeny based on precomputed clusters of orthologous transcripts.
We applied our algorithm to transcripts from simulated gene families, as well as to two case studies involving the transcripts of real gene families—specifically, the TAF6 and PAX6 gene families from the Ensembl-Compara database. The results align with those of previous studies aimed at reconstructing transcript phylogenies, while improving the computing time. The results also show that accurate transcript phylogenies can be obtained by first inferring accurately the pairwise homology relationships among transcripts and then using the latter to compute a phylogeny that agrees with the homology relationships.

11:20-11:40
A Representation for Phylogenetic Trees and Networks
Confirmed Presenter: Louxin Zhang, National University of Singapore, Singapore

Room: 518
Format: In Person

Moderator(s): Lars Arvestad


Authors List: Show

  • Cedric Chauve, Simon Fraser University, Canada
  • Caroline Colijn, Simon Fraser University, Canada
  • Louxin Zhang, National University of Singapore, Singapore

Presentation Overview: Show

Good representations for phylogenetic trees and networks are important for human-computer interface and implementation of scalable heuristic methods for inference of evolution for genes, genomes and species. We present a new representation for phylogenetic trees. It maps every binary tree on n taxa to a string of taxa in which each taxon appears exactly twice. The new representation is i) shorter than the Newick format, ii) bijective in the space of phylogenetic trees and iii) easy for recovering tree edges. Using this new format, we introduce a tree operation that enables to traverse tree space in at most 2n steps and a new metric for tree comparison that is computable in linear time and correlated with the Subtree Prune and Regraft distance better than the Robinson-Foulds distance. The new representation can be further generalized to the so-called tree-child networks.

11:40-12:00
Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES
Confirmed Presenter: Anshu Gupta, Department of Computer Science and Engineering, University of California San Diego, United States

Room: 518
Format: Live Stream

Moderator(s): Lars Arvestad


Authors List: Show

  • Anshu Gupta, Department of Computer Science and Engineering, University of California San Diego, United States
  • Siavash Mirarab, Department of Electrical and Computer Engineering, University of California San Diego, United States
  • Yatish Turakhia, Department of Electrical and Computer Engineering, University of California San Diego, United States

Presentation Overview: Show

Species tree inference is crucial in advancing our understanding of evolutionary relationships of life on Earth and has immense significance for diverse biological and medical applications. Extensive genome sequencing efforts are currently in progress across a broad spectrum of life forms, unraveling intricate branching patterns within the tree of life. However, estimating species trees starting from raw genome sequences is quite challenging, and the current cutting-edge methodologies require a series of error-prone steps involving gene annotations, orthology inference, and accounting for gene tree discordances, which are neither entirely automated nor standardized and require substantial human intervention. Therefore, we present ROADIES, a novel pipeline for species tree inference from raw genome assemblies that is fully automated, easy to use, scalable, free from reference bias, and provides flexibility to adjust the tradeoff between accuracy and runtime. The ROADIES pipeline eliminates the need to align whole genomes, choose a single reference species, or pre-select loci such as functional genes found using cumbersome annotation steps. Moreover, it leverages recent advances in phylogenetic inference to allow multi-copy genes, eliminating the need to detect orthology. Using genomic datasets released from large-scale sequencing consortia (Birds 10K Genome Project, Zoonomia) across three diverse life forms (placental mammals, pomace flies, and birds), ROADIES infers species trees that are comparable in quality with the state-of-the-art approaches while achieving >100x speedup compared to the conventional pipelines. ROADIES supports various modes of operation and is expected to improve the accuracy, speed, scalability, and reproducibility of phylogenomic analyses.

12:00-12:20
Generalized c/µ Ratio Test for Detecting Molecular Adaptation: Beyond the conventional Ka/Ks Ratio test without Assuming Synonymous Site Neutrality or Limitation to Translated Regions
Confirmed Presenter: Chun Wu, Rowan University, United States

Room: 518
Format: In Person

Moderator(s): Lars Arvestad


Authors List: Show

  • Chun Wu, Rowan University, United States
  • Nicholas Paradis, Rowan University, United States

Presentation Overview: Show

The 60-year debate in evolutionary biology over "neutralist-selectionist" views demands a robust method to measure fitness changes due to mutations, yet the Ka/Ks ratio test, despite its prominence, has significant limitations. This test, which assesses fitness changes in the genome's Translated Region (TR) based on non-synonymous (Ka) and synonymous (Ks) substitution rates, presupposes the neutrality of synonymous mutations—a notion increasingly challenged by evidence highlighting their non-neutral impacts in replication, transcription, and translation processes. Our previous work (Comp in Bio and Med 153 (2023) 106522) introduced the relative substitution rate c/µ test (c: a mean value of Ka and Ks; µ: mutation rate) as a versatile alternative, offering a broader application without assuming synonymous site neutrality. This paper derives a general equation linking c/µ with Ka/µ, Ks/µ, and Ka/Ks, demonstrating c/µ's superior accuracy in quantifying fitness changes across both TR and UTR. Through a comparative analysis of the c/µ and Ka/Ks tests across 10 genes, 11 UTRs, and significant SARS-CoV-2 mutations, using three independent genomic datasets from December 2019 to July 2021, we validate our molecular adaptation predictions with activity data from existing literature. Our findings advocate for the c/µ test as a more effective tool than the traditional Ka/Ks test, potentially resolving the longstanding debate in evolutionary biology by accommodating non-neutral effects at synonymous sites and extending applicability beyond the TR. This method was applied to over 2000 viruses with at least 50 genomes sequences, the preliminary results will be discussed.

AlphaHOGs, a protein structure-based reference classification to improve orthology inference
Confirmed Presenter: Christophe Dessimoz, University of Lausanne, Switzerland

Room: 518
Format: In Person

Moderator(s): Lars Arvestad


Authors List: Show

  • Stefano Pascarelli, University of Lausanne, Switzerland
  • Christophe Dessimoz, University of Lausanne, Switzerland

Presentation Overview: Show

The increasing availability of genomic sequences is driving forward our understanding of the diverse life forms on Earth. However, the ability to generalize organism-specific knowledge is limited by how well we relate genomes to each other. Genomes can be compared in terms of orthologous genes — genes of different species that derive from a single gene in the last common ancestor. Currently, the majority of orthology prediction software is based on the amino acid sequence, the most abundant information about proteins. However, the sequence signal of distantly related genes is weak, distributed in saturated positions, and confounded by evolutionary forces. In this work, I show how the more conserved protein 3D-structure can improve orthology prediction. I devised a method that combines sequence k-mers with k-mers generated from a local structural alphabet. The enriched sets of k-mers can be used to generate a reference classification of proteins into Hierarchical Orthologous Groups (HOGs) — a coarse-grained representation of protein families. The structure-informed reference HOGs, here named AlphaHOGs, can be exploited to infer orthology in thousands of proteomes, by using the recently developed software FastOMA. As a test case, we reconstruct the ancestral genome of the first multicellular animal with an unprecedented resolution, paving the way to higher-level analyses such as the ancestral gene content and protein interaction network, potentially shedding light on the current uncertainties about the origin of the animal lineage.

14:20-14:40
Proceedings Presentation: Joint inference of cell lineage and mitochondrial evolution from single-cell sequencing data
Confirmed Presenter: Viola Chen, Princeton University, United States

Room: 518
Format: In Person

Moderator(s): Giltae Song


Authors List: Show

  • Palash Sashittal, University of Illinois at Urbana-Champaign, United States
  • Viola Chen, Princeton University, United States
  • Amey Pasarkar, Princeton University, United States
  • Ben Raphael, Princeton University, United States

Presentation Overview: Show

Eukaryotic cells contain organelles called mitochondria that have their own genome. Most cells contain thousands of mitochondria which replicate, even in non-dividing cells, by means of a relatively error-prone process resulting in somatic mutations in their genome. Because of the higher mutation rate compared to the nuclear genome, mitochondrial mutations have been used to track cellular lineage, particularly using single-cell sequencing that measures mitochondrial mutations in individual cells. However, existing methods to infer the cell lineage tree from mitochondrial mutations do not model heteroplasmy, which is the presence of multiple mitochondrial clones with distinct sets of mutations in an individual cell. Single-cell sequencing data thus provides a mixture of the mitochondrial clones in individual cells, with the ancestral relationships between these clones described by a mitochondrial clone tree that must be concordant with the cell lineage tree. We formalize the problem of inferring a concordant pair of a mitochondrial clone tree and a cell lineage tree from single-cell sequencing data as the NESTED PERFECT PHYLOGENY MIXTURE (NPPM) problem. We derive an algorithm, MERLIN, to solve the NPPM problem. We show on simulated data that MERLIN outperforms existing methods that do not model mitochondrial heteroplasmy nor the concordance between the mitochondrial clone tree and the cell lineage tree. We use MERLIN to analyze single-cell whole genome sequencing data of 5220 cells of a gastric cancer cell line and show that MERLIN infers a more biologically plausible cell lineage tree and mitochondrial clone tree compared to existing methods.

14:50-15:00
Tracking tumorigenesis and the transition state through copy number variation-based pseudotime
Confirmed Presenter: Jonghyun Lee, National Cancer Center, South Korea

Room: 518
Format: In Person

Moderator(s): Giltae Song


Authors List: Show

  • Jonghyun Lee, National Cancer Center, South Korea
  • Najung Lim, National Cancer Center, South Korea
  • Dongkwan Shin, National Cancer Center, South Korea

Presentation Overview: Show

Driver mutations for different cancer types are extensively categorized. However, there appears to be no fixed number of driver mutations that guarantee the transition from normal to cancer cells. Therefore, we hypothesized that there are ranges of genetic alterations where both normal and cancer cells coexist, where two types of cells share similar genetic backgrounds while exhibiting vastly different phenotypes. By leveraging the technical advances of single-cell sequencing that captures the characteristics of thousands of cells, we sought to identify cells that belong to the transition state, where the tumor and normal cells share similar genetic alterations.
One of the well-established factors regarding the genetic changes during tumorigenesis is the accumulation of aneuploidy. Copy number variation (CNV) inference algorithms such as CopyKat and SCEVAN deduce the aneuploidy from the single-cell expression data. we utilized the accumulation of CNV to construct a pseudotime to describe the genetic changes during the tumorigenesis.
We found that there is indeed genetic background overlap between the tumor and the normal epithelial cells of breast cancer, from both patient data and mouse model. Cells within the transition state appear to share CNV events, which are represented by a NJ-based tree. These transition cells are also located between tumor and normal cell clusters in expression space. This result demonstrates that sufficient sampling can identify per-malignant normal cells with similar mutation profiles of tumor cells, which may aid in early detection and prevention of oncogenesis.

15:00-15:20
SPICE: Probabilistic reconstruction of copy-number evolution in cancer
Confirmed Presenter: Abigail Bunkum, University College London Cancer Institute, United Kingdom

Room: 518
Format: In Person

Moderator(s): Giltae Song


Authors List: Show

  • Abigail Bunkum, University College London Cancer Institute, United Kingdom
  • Simone Zaccaria, University College London Cancer Institute, United Kingdom

Presentation Overview: Show

Somatic copy number alterations (SCNAs) are frequent genetic alterations that accumulate in tumour cells during cancer evolution and amplify or delete large genomic regions. SCNAs are implicated to drive cancer progression, providing cancer cells with the ability to metastasise or resist treatment. Therefore, cancer sequencing studies aim to reconstruct the evolutionary history of SCNAs to investigate their role in cancer progression. Whilst several related phylogenetic methods have been introduced, these methods rely on the reconstruction of a single tumour phylogeny explaining SCNA evolution, discarding the innate uncertainty of this complex problem. In fact, modelling SCNA evolution is challenging and many different explanations for SCNA evolution are equally plausible. Therefore, reconstructing a single phylogeny might hinder the ability to accurately characterise SCNA evolution.

In this work, we introduce SPICE (Subclone Probability Inference of Copy-number Evolution), a novel algorithm that enumerates equally plausible explanations of SCNA evolution, enabling the estimation of the probabilities of SCNA events. We show, using a novel, realistic simulation framework, that SPICE outperforms previous methods on simulated datasets by combining multiple inferred phylogenies. To highlight the impact of our method, we applied SPICE to 49 bulk samples from metastatic prostate cancers to detect the presence of well-known driver cancer genes that appear to be recurrently affected by similar events in the same tumour, providing evidence for parallel evolution. Finally, we leverage information regarding the uncertainty of inferred phylogenetic topologies to identify novel metastatic migration patterns and characterise the probability of migrations between different tumour sites.

15:20-15:40
Uncovering Cancer's Fitness Landscape
Confirmed Presenter: Meaghan Parks, Case Western Reserve University, United States

Room: 518
Format: In Person

Moderator(s): Giltae Song


Authors List: Show

  • Meaghan Parks, Case Western Reserve University, United States

Presentation Overview: Show

CRISPR-based genome editing technologies have enabled massively-parallel genomic screens, such as DepMap – a Broad Consortium effort to catalog gene knockouts in cancer cell lines. These projects find that the growth effects of a mutation depend heavily on the background genotype of a cell. Evolutionary theory has studied the effects of background genotype on mutations for generations and has uncovered general patterns across the tree of life These patterns found in evolving populations have culminated in a ‘Geometric Model’ of adaptation that has successfully predicted the effects of novel combinations of mutations in yeast and E. coli. This model could in principle be applied to DepMap and other massively-parallel genomic screens to learn genotype to phenotype to fitness mappings and potentially predict the evolution of a population. Fitting this model to large-scale real data, however, is challenging because the model infers a latent (hidden) space of phenotypes with mathematical symmetries which confuse regression methods. Here, we present a methodology for fitting a Geometric Model of adaptation to large-scale genomic screens that is guaranteed to converge to a single inferred background genotype for any mutant. This methodology eliminates rotational, translational, and permutation symmetries in the inferred phenotype space and successfully reconstructs genotype to phenotype to fitness mappings of simulated cancer cell line knockout data. Thus, making comprehensive quantitative models of genotype to phenotype to fitness mappings possible in a multitude of diseases, which in turn will allow us to infer phenotypic complexity and predict treatment response.

15:40-16:00
Measuring pseudogenes' kinship to unravel overlooked evolutionary patterns
Confirmed Presenter: Valeriia Vasylieva, Sherbrooke University, Canada

Room: 518
Format: In Person

Moderator(s): Giltae Song


Authors List: Show

  • Valeriia Vasylieva, Sherbrooke University, Canada
  • Marie A. Brunet, Sherbrooke University, Canada

Presentation Overview: Show

Pseudogenes are defined as copies of protein-coding genes that have lost their ability to encode proteins and are functionless elements of our genomes. Yet, thousands are transcribed, and hundreds encode proteins. These discoveries question the definition of pseudogenes and their roles in our genome. To unravel the contribution of pseudogenes to the evolution of our genomes, we need to understand their origin. However, identification of the parental transcript and parental gene of pseudogenes is complex and currently incomplete. PsiCube database is the most up-to-date reference for pseudogene-parental gene annotation in human, yet it only references parental genes for 48% (8,225) of the pseudogenes currently annotated in Ensembl (17,349).
Here, we present a method based on the Mash distance, commonly used in metagenomics approaches, to measure kinships between transcripts of pseudogenes and protein-coding genes. Our strategy outperforms PsiCube, without any significant biases for sequence length, complexity, or GC content. We applied our method to unravel the evolutionary relationships between GAPDH and its pseudogenes. Mash distance was able to confidently separate unrelated sequences from the GAPDH paralog and from the parental GAPDH. Interestingly, our methodology highlighted pseudogenes with other pseudogenes as their closest related sequence. We expanded our Mash distance analysis to the whole human genome and identified pseudogenes arising from other pseudogenes amongst many gene families, including in loci associated with diseases.
Our work highlights an overlooked mechanism of gene evolution where pseudogenes can arise from existing pseudogenes and contribute to the diversity and evolution of our genomes.

Pseudogenes in plasmid genomes reveal past transitions in plasmid mobility
Confirmed Presenter: Dustin Hanke, Kiel University, Germany

Room: 518
Format: In Person

Moderator(s): Giltae Song


Authors List: Show

  • Dustin Hanke, Kiel University, Germany
  • Yiqing Wang, Kiel University, Germany
  • Tal Dagan, Kiel University, Germany

Presentation Overview: Show

Evidence for gene non-functionalization due to mutational processes is found in genomes in the form of pseudogenes. Pseudogenes are known to be rare in prokaryote chromosomes, with the exception of lineages that underwent an extreme genome reduction (e.g., obligatory symbionts). Much less is known about the frequency of pseudogenes in prokaryotic plasmids; those are genetic elements that can transfer between cells and may encode beneficial traits for their host. Non-functionalization of plasmid-encoded genes may alter the plasmid characteristics, e.g., mobility, or their effect on the host. Analyzing 10,832 prokaryotic genomes, we find that plasmid genomes are characterized by threefold-higher pseudogene density compared to chromosomes. The majority of plasmid pseudogenes correspond to deteriorated transposable elements. A detailed analysis of enterobacterial plasmids furthermore reveals frequent gene non-functionalization events associated with the loss of plasmid self-transmissibility. Reconstructing the evolution of closely related plasmids reveals that non-functionalization of the conjugation machinery led to the emergence of non-mobilizable plasmid types. Examples are virulence plasmids in Escherichia and Salmonella. Our study highlights non-functionalization of core plasmid mobility functions as one route for the evolution of domesticated plasmids. Pseudogenes in plasmids supply insights into past transitions in plasmid mobility that are akin to transitions in bacterial lifestyle.

16:40-17:00
Long range segmentation of prokaryotic genomes by gene age and functionality
Confirmed Presenter: Yuri Wolf, NCBI/NLM/NIH, United States

Room: 518
Format: In Person

Moderator(s): Edward Braun


Authors List: Show

  • Yuri Wolf, NCBI/NLM/NIH, United States
  • Ilya Schurov, Radboud University, Netherlands
  • Kira Makarova, NCBI/NLM/NIH, United States
  • Mikhail Katsnelson, Radboud University, Netherlands
  • Eugene Koonin, NCBI/NLM/NIH, United States

Presentation Overview: Show

Bacterial and archaeal genomes encompass numerous operons that typically consist of two to five genes. On larger scales, however, gene order is poorly conserved through the evolution of prokaryotes. Nevertheless, non-random localization of different classes of genes on prokaryotic chromosomes could reflect important functional and evolutionary constraints. We explored the patterns of genomic localization of evolutionarily conserved (ancient) and variable (young) genes across the diversity of bacteria and archaea. Nearly all bacterial and archaeal chromosomes were found to encompass large segments of 100-300 kilobases that were significantly enriched in either ancient or young genes. Similar clustering of genes with lethal knockout phenotype (essential genes) was observed as well. Mathematical modeling of genome evolution suggests that this long-range gene clustering in prokaryotic chromosomes reflects perpetual genome rearrangement driven by a combination of selective and neutral processes rather than evolutionary conservation.

17:00-17:20
The evolution of antibiotic resistance islands occurs within the framework of plasmid lineages
Confirmed Presenter: Yiqing Wang, Institute of General Microbiology, Kiel University, Germany, Germany

Room: 518
Format: Live Stream

Moderator(s): Edward Braun


Authors List: Show

  • Yiqing Wang, Institute of General Microbiology, Kiel University, Germany, Germany
  • Tal Dagan, Institute of General Microbiology, Kiel University, Germany, Germany

Presentation Overview: Show

Bacterial pathogens carrying multidrug resistance (MDR) plasmids are a major threat to human health. The acquisition of antibiotic resistance genes (ARGs) in plasmids is often facilitated by mobile genetic elements that copy or translocate ARGs between DNA molecules. The agglomeration of mobile elements in plasmids generates resistance islands comprising multiple ARGs. However, whether the emergence of resistance islands is restricted to specific MDR plasmid lineages remains understudied. Here we show that the agglomeration of ARGs in resistance islands is biased towards specific large plasmid lineages. Analyzing 6,784 plasmids in 2,441 Escherichia, Salmonella, and Klebsiella isolates, we quantify that 84% of the ARGs in MDR plasmids are found in resistance islands. We furthermore observe the rapid evolution of ARG combinations in resistance islands. Most regions identified as resistance islands are shared among closely related plasmids but rarely among distantly related plasmids. Our results suggest the presence of barriers to the dissemination of ARGs between plasmid lineages, which are related to plasmid genetic properties, host range, and the plasmid evolutionary history. The agglomeration of ARGs in plasmids is attributed to the workings of mobile genetic elements that operate within the framework of existing plasmid lineages.

Elucidating the Co-Evolution and Genetic Diversity of Acquired Phototrophy in Marine Worm Convolutriloba longifissura
Confirmed Presenter: Adena Collens, Invertebrate Zoology, National Museum of Natural History, Smithsonian; Biological Sciences, University of Maryland, United States

Room: 518
Format: In Person

Moderator(s): Edward Braun


Authors List: Show

  • Adena Collens, Invertebrate Zoology, National Museum of Natural History, Smithsonian; Biological Sciences, University of Maryland, United States
  • Allen Collins, Invertebrate Zoology, National Museum of Natural History, Smithsonian; Biological Sciences, University of Maryland, United States

Presentation Overview: Show

While instances of acquired phototrophy can be found across the eukaryotic tree of life, much about the evolution and maintenance of these endosymbiotic interactions remains unknown. Marine acoel worms are one such group that host photosynthetic algae (Tetraselmis sp.) within their tissues. For example, past work shows that unfed Convolutriloba retrogemma with endosymbiotic Tetraselmis algae lose less biomass in the light than in the dark, suggesting a transfer of photosynthethates to the host. However, the mechanisms and likely benefits to host and alga have yet to be described. Further, genetic diversity of the alga T. convolutae (associated with acoel Symsagittifera roscoffensis) is minimal even across diverse host genotypes and geographies, suggesting an intimate, long-term coevolution between acoel worm hosts and their algae.
Our study centers on Convolutriloba longifissura - Tetraselmis sp. photosymbiosis, a case for which even less is known. We present the first genetic characterization of the intercellular algae using low-pass whole genome sequencing data. Using short-read Illumina data, we assembled and annotated organellar genomes from both the acoel worm and the Tetraselmis alga, as well as nuclear ribosomal repeats from the acoel worm. We conducted phylogenetic analyses using several assembled markers to elucidate the relationships of C. longifissura and Tetraselmis sp. to existing sequencing data from acoel worms and green alga, respectively. In light of these findings, we intend to expand to the comparative analysis of transcriptome data to illuminate possible indicators, mechanisms, and interactions of this likely acquired phototrophic interaction.

17:20-17:40
Quality assessment of gene repertoires with OMArk
Confirmed Presenter: Yannis Nevers, Université de Lausanne, Switzerland

Room: 518
Format: In Person

Moderator(s): Edward Braun


Authors List: Show

  • Yannis Nevers, Université de Lausanne, Switzerland
  • Alex Warwick Vesztrocy, Université de Lausanne, Switzerland
  • Victor Rossier, Université de Lausanne, Switzerland
  • Clément Marie Train, Université de Lausanne, Switzerland
  • Adrian Altenhoff, ETH Zurich, Switzerland
  • Christophe Dessimoz, Université de Lausanne, Switzerland
  • Natasha Glover, Swiss Institute of Bioinformatics, Switzerland

Presentation Overview: Show

The amount and diversity of new genomes getting sequenced across the world opens the doors for large-scale comparative genomics. Thus, reliably ensuring the quality of the protein-coding gene repertoire derived from these data before including them in an analysis is becoming more critical. State-of-the-art genome annotation assessment tools measure some aspects of this quality but are blind to errors such as gene over-prediction or contamination.
We developed OMArk, a method that relies on fast alignment comparisons to precomputed gene families from the OMA orthology database. By identifying differences between a gene annotation and the typical gene repertoires of closely related species, OMArk assesses not only the completeness, but also the consistency of the gene repertoire as a whole compared to closely related species. This includes classification of genes with no relatives in close species, with dubious gene models, or those resulting from contamination. Through this global assessment, OMArk helps point out flaws in any given annotation.
We validated OMArk’s performances on simulated data, then performed an analysis of over 8,000 proteomes from public reference databases (UniProt, Ensembl, RefSeq..). We identified and confirmed cases of contaminations in multiple proteomes, characterized the improvements in gene repertoire quality resulting from improvement in genome assemblies, and found evidence of systematic errors induced by annotation pipelines in certain datasets. OMArk is available on GitHub (https://github.com/DessimozLab/OMArk), as a Python package on PyPi, as a bioconda package, and as an interactive online webserver at https://omark.omabrowser.org.

17:40-18:00
Leveraging machine learning to predict antimicrobial resistance in ESKAPE pathogens
Confirmed Presenter: Ethan Wolfe, Department of Pathobiology and Diagnostic Investigation, Michigan State University, East Lansing, MI, USA, United States

Room: 518
Format: In Person

Moderator(s): Edward Braun


Authors List: Show

  • Jacob Krol, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Ethan Wolfe, Department of Pathobiology and Diagnostic Investigation, Michigan State University, East Lansing, MI, USA, United States
  • Evan Brenner, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Keenan Manpearl, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Joseph Burke, Department of Pathobiology and Diagnostic Investigation, Michigan State University, East Lansing, MI, USA, United States
  • Charmie Vang, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Vignesh Sridhar, Department of Pathobiology and Diagnostic Investigation, Michigan State University, East Lansing, MI, USA, United States
  • Jill Bilodeaux, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Karn Jongnarangsin, Department of Pathobiology and Diagnostic Investigation, Michigan State University, East Lansing, MI, USA, United States
  • Elliot Majlessi, Department of Pathobiology and Diagnostic Investigation, Michigan State University, East Lansing, MI, USA, United States
  • Janani Ravi, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States

Presentation Overview: Show

Since the clinical introduction of antibiotics in the 1940s, antimicrobial resistance (AMR) has become an increasingly dire threat to global public health. Pathogens acquire AMR much faster than we discover new drugs, warranting new methods to better understand the molecular underpinnings of AMR. Traditional approaches for detecting AMR in novel bacterial strains require time-consuming, labor-intensive assays. Here, we introduce a machine learning approach to identify AMR-associated features. We focus on six highly drug-resistant bacterial pathogens responsible for most nosocomial infections: the “ESKAPE” pathogens. We use all NCBI-PGAP-annotated ESKAPE genomes with known AMR phenotype data from the Bacterial and Viral Bioinformatics Resource Center (BV-BRC). Then, for all complete and WGS genomes for each ESKAPE species, we cluster similar genes and construct pangenomes with Panaroo. To uncover the molecular mechanisms behind drug-/drug family-specific resistance and cross-resistance, we train logistic regression and random forest models on our pangenomes, which include antibiotic resistance/susceptibility labels per genome. The models are tested rigorously to yield ranked lists of AMR-associated genes and protein domains. In addition to recapitulating known AMR genes, our models have identified novel candidates for individual and cross-resistance mechanisms that await experimental validation. Our holistic approach promises thorough, reliable prediction of existing or developing resistance in newly identified pathogen genomes, along with mechanistic molecular contributors of resistance.

Predicting pathogen preferences and host adaptation by leveraging microbial genomics and machine learning
Confirmed Presenter: Evan Brenner, University of Colorado Anschutz Medical Campus, United States

Room: 518
Format: In Person

Moderator(s): Edward Braun


Authors List: Show

  • Evan Brenner, University of Colorado Anschutz Medical Campus, United States
  • Janani Ravi, University of Colorado Anschutz Medical Campus, United States

Presentation Overview: Show

Most emerging infectious diseases (EIDs) of humans originate in animals and are transmitted through zoonotic spillover events. However, the genetic determinants underlying host adaptation or host switching are often unclear. We hypothesize that genomic markers of pathogen adaptation to different hosts are detectable and can yield valuable insights into EID pathobiology. Utilizing publicly available databases, millions of bacterial and viral genomes with metadata, including their hosts of origin, are accessible for study. To leverage these, we are training machine learning models that associate pathogen genetic elements (e.g., genes, k-mers) with host labels. Our models are simple and interpretable (e.g., decision trees), run with reasonable computational requirements, and have been tested on a sampling of phylogenetically distinct bacterial and viral pathogens.

Our preliminary results have yielded high predictive performance for bacterial and viral pathogens, and top-ranked features in these models often pinpoint genomic elements that are 1) associated with horizontal gene transfer elements, and 2) demonstrated to play biologically relevant roles to host adaptation in prior literature. We will expand these models to new species, build more complex models that incorporate additional levels of genomic information (e.g., protein domains), and begin testing performance across species or genera rather than solely within. These advances offer promise in assessing threats to different host populations posed by new EIDs.

Tuesday, July 16th
8:40-9:00
Proceedings Presentation: Maximum Likelihood Phylogeographic Inference of Cell Motility and Cell Division from Spatial Lineage Tracing Data
Confirmed Presenter: Gary Hu, Princeton University, United States

Room: 518
Format: In Person

Moderator(s): Katharina Jahn


Authors List: Show

  • Uyen Mai, Princeton University, United States
  • Gary Hu, Princeton University, United States
  • Ben Raphael, Princeton University, United States

Presentation Overview: Show

Recently developed spatial lineage tracing technologies induce somatic mutations at specific genomic loci in growing cells and then measure these mutations in the sampled cells along with their physical locations. These technologies enable high-throughput studies of developmental processes over space and time. However, these applications rely on accurate reconstruction of a spatial cell lineage tree describing the history of bothcell divisions and locations. We demonstrate that standard phylogeographic models based on Brownian motion are inadequate to describe the symmetric spatial displacement of cells during cell division. We introduce a new model for cell motility that includes symmetric displacements of daughter cells from the parental cell followed by independent diffusion of daughter cells. We show that this model more accurately describes the locations of cells in a real spatial lineage tracing of Drosophila melanogaster embryos. Combining the spatial model with an evolutionary model of DNA mutations, we obtain a comprehensive model for spatial lineage tracing, namely spalin. Using this model, we estimate time-resolved branch lengths, spatial diffusion rate, and mutation rate. On both simulated and real data, we show that the proposed method accurately estimates all parameters while the Brownian motion model overestimates spatial diffusion rate in all test cases. In addition, the inclusion of spatial information improves accuracy of branch length estimation compared to sequence data alone, suggesting augmenting lineage tracing technologies with spatial information is useful to overcome the limitations of genome-editing in developmental systems.

9:10-9:20
Interpretable variational encoding of genotypes identifies comprehensive clonality and lineages in single cells geometrically
Confirmed Presenter: Hoi Man Chung, The University of Hong Kong, Hong Kong

Room: 518
Format: In Person

Moderator(s): Katharina Jahn


Authors List: Show

  • Hoi Man Chung, The University of Hong Kong, Hong Kong
  • Yuanhua Huang, University of Hong Kong, Hong Kong

Presentation Overview: Show

Despite the wide accessibility of genetic information in multiple omics assays, analyzing single-cell
genomics remains a challenge due to its diverse high-dimensional macrostructures and many
missing signals. For the sake of numerical convergence in diverse macrostructures, existing statistical
methods often pose strong constraints on the form of predicted mutation patterns, and therefore
easily identify underfitted or overfitted local or global optima that are biologically incomprehensive in
complex contexts. To solve this problem, we developed SNPmanifold, a Python package that detects
flexible mutation patterns with a shallow binomial variational autoencoder and UMAP (schematic
shown in Figure 1). After reducing allele count matrix to lower-dimensional latent space, SNPmanifold
then performs 3 downstream analyses on the genomic geometrical manifold: 1. Clustering of cells
with similar genotypes, 2. Ranking of important SNPs, and 3. Phylogenetic tree construction. Based
on nuclear or mitochondrial variants, we demonstrated that SNPmanifold can effectively identify a
large number of multiplexed donors of origin (k = 18) that all existing methods fail and lineages of
somatic clones with promising biological interpretation (detailed results of an example dataset shown
in Figure 2). Compared to existing methods, SNPmanifold can better identify the optimal degree of
fitting with enhanced generalizability and human-interpretability. SNPmanifold therefore can reveal
insights into single-cell clonality and lineages more comprehensively and straight-forwardly.

9:20-9:40
Genome streamlining: effect of mutation rate and population size on genome size reduction
Confirmed Presenter: Juliette Luiselli, INRIA Lyon, INSA Lyon, France

Room: 518
Format: In Person

Moderator(s): Katharina Jahn


Authors List: Show

  • Juliette Luiselli, INRIA Lyon, INSA Lyon, France
  • Jonathan Rouzaud-Cornabas, INRIA Lyon, INSA Lyon, France
  • Nicolas Lartillot, CNRS, Université Lyon 1, France
  • Guillaume Beslon, INRIA Lyon, INSA Lyon, France

Presentation Overview: Show

Genome size reduction, also known as genome streamlining, is observed in bacteria with very different life traits, including endosymbiotic bacteria and several marine bacteria, raising the question of its evolutionary origin. None of the hypotheses proposed in the literature is firmly established, mainly due to the many confounding factors related to the diverse habitats of species with streamlined genomes. Computational models may help overcome these difficulties and rigorously test hypotheses. We use Aevol, a platform designed to study the evolution of genome architecture, to test two main hypotheses: that an increase in population size (N) or mutation rate (μ) could cause genome reduction. Pre-evolved individuals were transferred into new conditions, characterized by an increase in population size or mutation rate. In our experiments, both conditions lead to genome reduction. However, they lead to very different genome structures. Under increased population size, genomes loose a significant fraction of non-coding sequences, but maintain their coding size, resulting in densely packed genomes (akin to streamlined marine bacteria genomes). By contrast, under increased mutation rate, genomes loose coding and non-coding sequences (akin to endosymbiotic bacteria genomes). Hence, both factors lead to an overall reduction in genome size, but the coding density of the genome appears to be determined by N × μ. Thus, a broad range of genome size and density can be achieved by different combinations of N and μ. Further analyses suggest that genome size and coding density are determined by the interplay between selection for phenotypic adaptation and selection for robustness.

Evolutionary dynamics of microRNAs pinpoint innovations in the gene regulatory network of vertebrates
Confirmed Presenter: Felix Langschied, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt, Germany, Germany

Room: 518
Format: In Person

Moderator(s): Katharina Jahn


Authors List: Show

  • Felix Langschied, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt, Germany, Germany
  • Matthias S. Leisegang, Institute for Cardiovascular Physiology, Goethe University, Frankfurt, Germany, Germany
  • Ralf P. Brandes, Institute for Cardiovascular Physiology, Goethe University, Frankfurt, Germany, Germany
  • Ingo Ebersberger, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt, Germany, Germany

Presentation Overview: Show

The evolution of the regulatory network formed by miRNAs and their target mRNAs remains poorly understood because scalable and accurate frameworks for miRNA ortholog detection are missing. We closed this methodological gap, and our tool ncOrtho identifies miRNA orthologs in large collections of unannotated genome assemblies matching manually curated annotations in sensitivity and precision. With ncOrtho, we have investigated the plasticity of the human miRNA repertoire across 402 phylogenetically diverse vertebrates. This revealed four main bursts of miRNA acquisition of which the oldest predates the diversification of the vertebrates, and the youngest is specific to the Simiiformes. Overall, miRNA loss is rare which directs the attention to 16 miRNA families that are absent in the Eumuroidea (Rodentia). To investigate the impact of these losses on the corresponding gene regulatory networks, we overexpressed Mir-197 and Mir-769 in induced pluripotent stem cells (iPSCs) of human and mouse. Overlapping sets of silenced mRNAs in the two species reveal that miRNA-dependent regulatory networks remain partly intact despite the miRNA losses. The prevalence of target sites specific to either lineage indicates a considerable evolutionary flexibility of the target gene repertoire. Interestingly, human protein-coding genes with a similar history of gene loss as the 16 miRNAs are enriched for transcription factors. This indicates that mouse but also rat have substantially modified their regulatory network of gene expression on transcriptional and post-transcriptional level compared to other vertebrate model organisms.

9:40-10:00
PHALCON: Phylogeny-aware variant calling from large-scale single-cell panel sequencing datasets
Confirmed Presenter: Priya, Department of Computer Science and Engineering, IIT Kanpur, India

Room: 518
Format: Live Stream

Moderator(s): Katharina Jahn


Authors List: Show

  • Priya, Department of Computer Science and Engineering, IIT Kanpur, India
  • Sunkara B. V. Chowdary, Department of Computer Science and Engineering, IIT Kanpur, India
  • Hamim Zafar, Department of Computer Science and Engineering & Department of Biological Sciences and Bioengineering, IIT Kanpur, India

Presentation Overview: Show

Single-cell sequencing (SCS) technologies bring cellular resolution in resolving intra-tumor heterogeneity, which can cause drug resistance and relapse in cancer. Nonetheless, SCS methods pose several technical challenges, such as uneven coverage, allelic dropout (ADO), or artifacts subjected to erroneous amplification. Single-cell variant callers have been developed to distinguish the true variants from technical artifacts. However, recently emerging parallel sequencing methods can now sequence up to thousands of cells by targeting only disease-specific genes. Current variant callers are not scalable for such high-throughput datasets and do not effectively address the amplification biases in panel-based sequencing protocols.

To address these, we present a statistical variant caller, PHALCON, which enables scalable mutation detection from large-scale single-cell panel sequencing data by modeling their evolutionary history under a finite-sites model along a clonal phylogeny. PHALCON infers the underlying cellular sub-populations based on genotype likelihoods of candidate sites and reconstructs a clonal phylogeny and the most likely mutation history (loss and recurrence included) using a probabilistic framework that maximizes the likelihood of the observed read counts given the genotypes.

Using numerous simulated datasets across varied experimental settings, we showed that PHALCON outperforms existing state-of-the-art methods in terms of variant calling accuracy (7.29-51.67% improvement), accuracy in inferring the tumor phylogeny (410.43-32931.8% improvement) and runtime (60-70 times faster). Furthermore, we applied PHALCON on real tumor single-cell panel sequencing datasets from triple negative breast cancer patients where PHALCON detected novel somatic mutations in important oncogenes and tumor suppressor genes with high functional impact and orthogonal support in bulk datasets.

10:40-11:00
Proceedings Presentation: A machine-learning based alternative to phylogenetic bootstrap
Confirmed Presenter: Tal Pupko, Tel Aviv University, Israel

Room: 518
Format: In Person

Moderator(s): Dannie Durand


Authors List: Show

  • Noa Ecker, Tel Aviv University, Israel
  • Tal Pupko, Tel Aviv University, Israel
  • Itay Mayrose, Tel Aviv University, Israel
  • Yishay Mansour, Tel Aviv University, Israel
  • Dorothée Huchon, Tel Aviv University, Israel

Presentation Overview: Show

Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance.
Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corre-sponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures.

11:10-11:20
Neutral variation in a protein interaction network limits predictability of protein evolution
Confirmed Presenter: Soham Dibyachintan, Université Laval, Canada

Room: 518
Format: In Person

Moderator(s): Dannie Durand


Authors List: Show

  • Soham Dibyachintan, Université Laval, Canada
  • Alexandre Dubé, Université Laval, Canada
  • David Bradley, Université Laval, Canada
  • Pascale Lemieux, Université Laval, Canada
  • Ugo Dionne, Lunenfeld-Tanenbaum Research Institute, Canada
  • Christian Landry, Université Laval, Canada

Presentation Overview: Show

The evolutionary fate of a mutation is dependent on its phenotypic effects. In recent years, multiple evolutionary models have been developed that use variation in natural sequences to predict the impact of a mutation in any given protein. However, many proteins display multiple phenotypes, most mediated by specific protein domains. Furthermore, many proteins originate from gene duplication and share most of their evolutionary history. How such factors affect the predictability of evolution is unknown owing to the lack of comprehensive experimental data on such proteins. We combined genome editing and high-throughput phenotypic assays to quantify the impact of all single-amino acid substitutions on the binding of two functionally redundant paralogous Src Homology 3 domains to their cognate interaction partners in yeast. These interaction partners have peptides which satisfy a consensus polyproline motif recognized by the SH3 domains. We observed that the effect of many mutations was not conserved across phenotypes or between the paralogs. A comparison of our experiments with evolutionary models revealed that these models only capture few differences in the effect of mutations between paralogs. Ancestral sequence reconstruction revealed that for mutations whose effects differed between domains, there was no difference between ancestral substitutions and mutations sampled at random. Broadly, our results illustrate that neutral sequence variation over time in the components of a protein interaction network limits our ability to predict protein evolution accurately using existing methods. Our work underscores the importance of using experimental data to inform computational models and improve the prediction of protein evolution.

11:20-11:40
Simultaneously Building and Reconciling a Synteny Tree
Confirmed Presenter: Mathieu Gascon, Université de Montréal, Canada

Room: 518
Format: In Person

Moderator(s): Dannie Durand


Authors List: Show

  • Mathieu Gascon, Université de Montréal, Canada
  • Mattéo Delabre, Université de Montréal, Canada
  • Nadia El-Mabrouk, University of Montreal, Canada

Presentation Overview: Show

Our lab recently presented Synesth (for SYNteny Evolution in SegmenTal Histories), an extended reconciliation model for synteny trees accounting for fissions, losses, gains, duplications and transfers potentially going through unsampled species. Synesth takes as input a synteny tree and a species tree, and outputs a most parsimonious evolutionary history. As reconciliation is very sensitive to the input trees (a slight modification may lead to a significant difference in the inferred evolutionary scenarios), obtaining accurate trees is essential. This is particularly challenging in the case of our model requiring a synteny tree as input, while phylogenetic methods on gene sequences rather output sets of gene trees, one for each gene family. If the individual gene trees are ''consistent'', meaning that they do not represent contradictory phylogenetic information, then a supertree (a tree displaying them all) can be obtained. Such a supertree can be used as an input for Synesth to represent the evolution of the syntenies containing the individual genes. As finding the optimal super-tree as been shown to be an NP-hard problem, the solution we proposed in a previous work was to test each possible supertree and retain the one leading to the most parsimonious reconciliation. In this presentation, we explore a new way to solve this problem by simultaneously building and reconciling the optimal supertree, leading to an algorithm that is exponential in the number of gene trees rather than in the total number of genes. We compare this new algorithm to the previous one using simulated datasets.

11:40-12:00
ntSynt: multi-genome synteny detection using minimizer graph mappings
Confirmed Presenter: Inanc Birol, Canada's Michael Smith Genome Sciences Centre at BC Cancer, Canada

Room: 518
Format: In Person

Moderator(s): Dannie Durand


Authors List: Show

  • Lauren Coombe, Canada's Michael Smith Genome Sciences Centre at BC Cancer, Canada
  • Rene Warren, Canada's Michael Smith Genome Sciences Centre at BC Cancer, Canada
  • Parham Kazemi, Canada's Michael Smith Genome Sciences Centre at BC Cancer, Canada
  • Johnathan Wong, Canada's Michael Smith Genome Sciences Centre at BC Cancer, Canada
  • Inanc Birol, Canada's Michael Smith Genome Sciences Centre at BC Cancer, Canada

Presentation Overview: Show

In recent years, the landscape of reference-grade genome assemblies has seen substantial diversification. With such rich data, there is pressing demand for robust tools for scalable, multi-species comparative genomics analyses, including detecting genome synteny, which informs on the sequence conservation between genomes and contributes crucial insights into species evolution. Here, we introduce ntSynt, a scalable utility for computing large-scale multi-genome synteny blocks using a minimizer graph-based approach. After computing the initial multi-genome synteny blocks using this constructed minimizer graph, the synteny blocks are refined in multiple rounds through indel detection, merging collinear blocks and extending block coordinates using decreasing minimizer window sizes. Through extensive testing utilizing multiple ~3 Gbp genomes, we demonstrate how ntSynt produces synteny blocks with coverages between 79–100% in at most 2h using 34 GB of memory, even for genomes with appreciable (>15%) sequence divergence. In addition, we used ntSynt to compare 11 bee genomes of the genus Andrena from the Earth BioGenome Project, and achieved synteny blocks with high coverage (85% for the smallest genome) in less than 15 minutes, despite these genomes varying in both chromosome number (3–7) and genome size (247 Mbp – 443 Mbp). Compared to existing state-of-the-art methodologies, ntSynt offers enhanced flexibility to diverse input genome sequences and synteny block granularity. We expect the macrosyntenic genome analyses facilitated by ntSynt to enable critical evolutionary insights within and between species across the tree of life. ntSynt is freely available at https://github.com/bcgsc/ntsynt.

14:20-14:40
Automated clade-level detection of Incomplete lineage sorting
Confirmed Presenter: Maureen Stolzer, Carnegie Mellon University, United States

Room: 518
Format: In Person

Moderator(s): Nadia El-Mabrouk


Authors List: Show

  • Maureen Stolzer, Carnegie Mellon University, United States
  • Yuting Xiao, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States

Presentation Overview: Show

Phylogenetic population modeling, combined with sequencing of large collections of closely related taxa, has enabled unprecedented exploration of population processes in evolutionary and ecological contexts. Incomplete Lineage Sorting (ILS) and introgression can result in gene trees that disagree with the species tree. For example, the history of a gene sampled from three species with phylogeny A|BC may agree with the species tree or have one of two incongruent topologies, B|AC or C|AB. The resulting distribution of gene tree topologies provides a wealth of information for testing alternate hypotheses and estimating population parameters. Despite these advances, quantification of ILS, while excluding incongruence due to introgression and paralogy, remains a challenging problem.
Here, we present an algorithm that extracts gene tree statistics associated with ILS from all species internodes in a single computational procedure, supporting automated, large-scale phylogenomic analyses of entire clades. Characterizing ILS can help to resolve phylogenetic uncertainty and is important for understanding the relative contributions of incomplete lineage sorting, introgression, and convergent evolution to trait evolution and present-day genetic variation. Our method accounts for uncertainty due to gene loss and missing data and screens out incongruence due to distant introgression and paralogy. As such, it can be applied to both multigene families and single-copy orthologs. The algorithm is polynomial in tree size and is thus applicable to very large species trees. We demonstrate our approach through the reanalysis of several phylogenomic datasets discussed in the literature.

14:40-15:00
Sparse Neighbor Joining: rapid phylogenetic inference using a sparse distance matrix
Confirmed Presenter: Semih Kurt, KTH Royal Institute of Technology, Sweden

Room: 518
Format: In Person

Moderator(s): Nadia El-Mabrouk


Authors List: Show

  • Semih Kurt, KTH Royal Institute of Technology, Sweden
  • Alexandre Bouchard-Cote, The University of British Columbia, Canada
  • Jens Lagergren, KTH Royal Institute of Technology, Sweden

Presentation Overview: Show

Phylogenetic reconstruction is a fundamental problem in computational biology. The Neighbor Joining (NJ) algorithm offers an efficient distance-based solution to this problem, which often serves as the foundation for more advanced statistical methods. Despite prior efforts to enhance the speed of NJ, the computation of the n^2 entries of the distance matrix, where n is the number of phylogenetic tree leaves, continues to pose a limitation in scaling NJ to larger datasets. In this work, we propose a new algorithm which does not require computing a dense distance matrix. Instead, it dynamically determines a sparse set of at most O(n log n) distance matrix entries to be computed in its basic version, and up to O(n log^2 n) entries in an enhanced version. We show by experiments that this approach reduces the execution time of NJ for large datasets, with a trade-off in accuracy.

15:00-15:20
Scalable distance-based phylogeny inference using divide-and-conquer
Confirmed Presenter: Lars Arvestad, Stockholm University, Sweden

Room: 518
Format: In Person

Moderator(s): Nadia El-Mabrouk


Authors List: Show

  • Amy Lee Jalsenius, Stockholm University, Sweden
  • Lars Arvestad, Stockholm University, Sweden

Presentation Overview: Show

Distance-based methods for inferring evolutionary trees are important subroutines in computational biology, sometimes as a first step in a statistically more robust phylogenetic method. The most popular method is Neighbor-Joining, mainly due to its relatively good accuracy. Unfortunately, Neighbor-Joining has cubic time complexity, which limits its applicability on larger datasets. Similar but faster algorithms have been suggested, but the overall time complexity of a Neighbor-Joining computation remains essentially cubic as long as the input is a distance matrix that must be computed. In practice, memory usage is today a limiting factor because distance matrix sizes grow quadratically. These constraints become a bottleneck in studies that rely on distance-based phylogeny estimation. With ever increasing data sizes, a scalable distance-based phylogeny inference method would change how scientists think about evolutionary-based studies.

We present two randomized divide-and-conquer heuristics, dnctree and dnctree-k, that selectively estimate pairwise sequence distances and infers a tree by connecting increasingly large subtrees. The divide-and-conquer approach avoids computing all pairwise distances and thereby saves both time and memory. The time complexity is at worst quadratic, and seems to scale like O(n lg n) in practice. Both algorithms have been implemented and tested, and dnctree-k shows similar accuracy as Neighbor-Joining in terms of inference accuracy in our experiments. We show that both algorithms scale very well, which is verified in computational experiments. In fact, they are applicable to very large datasets even when implemented in Python.

A Python implementation, dnctree, is available on GitHub (https://github.com/arvestad/dnctree) and PyPI.org.

15:20-16:00
Panel: Panel session
Room: 518
Format: In person

Moderator(s): Nadia El-Mabrouk


Authors List: Show