EvolCompGen

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CEST
Wednesday, July 26th
10:30-10:50
Deciphering developmentally programmed DNA elimination in Mesorhabditis nematodes
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Dannie Durand

  • Brice Letcher, LBMC, ENS Lyon & CNRS, France
  • Lewis Stevens, Wellcome Sanger Institute, United Kingdom
  • Marie Delattre, LBMC, ENS Lyon & CNRS, France


Presentation Overview: Show

While we commonly assume the genome to be largely identical across different cells of a multicellular organism, a number of species undergo a developmentally regulated elimination process by which the genome in somatic cells is reduced, while the germline genome remains intact. This process, called Programmed DNA Elimination (PDE), affects a number of species including copepod crustaceans, lamprey fish, single-celled ciliates and nematode worms (though not C. elegans!).

Only with high-depth and high-quality WGS data (PacBio HiFi or ONT, and Hi-C) has PDE recently become amenable to comprehensive genomic characterisation. In this talk, I will highlight our work to unravel PDE in Mesorhabditis nematodes, including computational identification of elimination breakpoints, identification of a sequence motif specifying cut-site location, and identifying which parts of the germline genome are eliminated. We also take a cross-species comparative approach, with the goal of probing the evolution of PDE.

10:50-11:10
An extended super-reconciliation model with synteny cuts and transfers through unsampled or extinct lineages
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Dannie Durand

  • Mattéo Delabre, University of Montreal, Canada
  • Yoann Anselmetti, University of Sherbrooke, Canada
  • Nadia El-Mabrouk, University of Montreal, Canada


Presentation Overview: Show

The gene tree-species tree reconciliation framework enables the inference of evolutionary histories for gene families. Various extensions of this model have been proposed to infer the histories of gene syntenies, one of which is super-reconciliation, where a synteny tree is reconciled with a species tree.

In this work, we investigate an extended model for super-reconciliation. In addition to segmental duplications, horizontal transfers, and losses, our model accounts for synteny splits and gene gains. We explicitly model the possibility of transient transfers going through unsampled or lost lineages.

We examine the combinatorial properties of this extended model and its associated parsimony optimization problem, and introduce a polynomial-time optimization algorithm. We evaluate the algorithm’s performance against other state-of-the-art approaches using simulated datasets.

Finally, we apply our algorithm to studying the evolution of CRISPR-Cas systems. We discuss the challenges and solutions for constructing a synteny tree from sequence data and for defining event costs, two key steps necessary to running the algorithm.

11:10-11:30
Genome-scale compression-based phylogeny estimation: An improved approach that uses the physicochemical properties of amino acids.
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Dannie Durand

  • Edward Braun, University of Florida, United States


Presentation Overview: Show

Phylogenomic analyses face several fundamental challenges. First, they should use methods that address sources of bias common to many phylogenetic datasets. Second, they should be robust to multiple sequence alignment error. Bias in phylogenetic estimation can reflect issues like long-branch attraction, but a major source of bias in phylogenomics is discordance among gene trees due to processes like incomplete lineage sorting (ILS). Distance-based phylogenetic methods can be a consistent estimator of the species tree under these conditions, raising the possibility that alignment-free distance methods could solve the challenges associated with ILS along and multiple sequence alignment error. Alignment-free approaches that calculate distances using conditional Kolmogorov complexity of the genome (or proteome) of one organism given the genome (or proteome) of another organism have been suggested but have received limited use in empirical studies. This reflects the fact that: 1) models of sequence evolution cannot be incorporated into the methods; and 2) it is difficult to estimating clade support using these methods. Herein, I provide a way to incorporate information about patterns of protein sequence evolution and estimate clade support similar to the bootstrap. The utility of this modified alignment-free distance method was demonstrated using empirical phylogenies of mammals and birds.

11:30-11:50
Tempo and mode of degeneration in independently evolved non-recombining regions
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Louxin Zhang

  • Ricardo C. Rodriguez de la Vega, GEE - ESE, CNRS. Universite Paris Saclay, AgroParisTech, France
  • Marine Duhamel, GEE - ESE, CNRS. Universite Paris Saclay, AgroParisTech, France
  • Fantin Carpentier, GEE - ESE, CNRS. Universite Paris Saclay, AgroParisTech, France
  • Wen Juan Ma, Vrije Universiteit Brussel, Belgium
  • Ozgur Taskent, GEE - ESE, CNRS. Universite Paris Saclay, AgroParisTech, France
  • Pauline Michel, GEE - ESE, CNRS. Universite Paris Saclay, AgroParisTech, France
  • Michael E. Hood, Amherst College, United States
  • Tatiana Giraud, GEE - ESE, CNRS. Universite Paris Saclay, AgroParisTech, France


Presentation Overview: Show

Despite the long-term advantages of recombination, local recombination suppression have evolved repeatedly, in particular in sex chromosomes. Recombination suppression leads to genomic degeneration due to the accumulation of deleterious mutations and transposable elements (TEs), which in turn can promote structural rearrangements. The dynamics of genomic degeneration after the onset of recombination suppression are largely unknown. Here we investigated the accumulation of deleterious mutations, TEs and structural rearrangements over time, leveraging on 22 independent events of recombination suppression identified on mating-type chromosomes of anther-smut fungi. We estimated degeneration levels in non-recombining regions spanning more than four million years of evolution. After controlling for differences in GC-biased gene conversion, ancestral expression and TEs epigenetic and mutational inactivation, we found that following recombination suppression: i) the frequency of optimal codons rapidly decreased, ii) the strength of purifying selection remained constant at an intermediate level between purifying selection and neutral evolution, iii) rapid initial TEs accumulation is followed by a slow down in later stages, iv) non-recombining regions serve as a reservoir of TEs that can then transpose to recombining regions and are involved in structural rearrangements. Our study sheds light on the processes underpinning the degeneration of non-recombining regions and its genome-wide consequences.

11:50-12:00
Fast and performant pipeline for coevolutionary analysis of eukaryotic genes
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Louxin Zhang

  • Giulia Sassi, University of Parma-Department of Chemistry, Life Sciences and Environmental Sustainability, Italy
  • Carlo De Rito, University of Parma-Department of Chemistry, Life Sciences and Environmental Sustainability, Italy
  • Riccardo Percudani, University of Parma-Department of Chemistry, Life Sciences and Environmental Sustainability, Italy


Presentation Overview: Show

Coevolution can be used to predict gene function, as in the case of phylogenetic profiling (PP) methods. PP delineates, with binary (presence/absence) vectors, the evolutionary association among genes. As an alternative to assessing global similarity between profiles, we have recently described the co-transition (cotr) analysis as a method to score and determine the significance of correlated transitions between gene pairs across phylogenetically ordered genomes (https://doi.org/10.1073/pnas.2218329120). Cotr analysis can find coevolutionary associations even among genes with low profile similarity. We propose an extended procedure as a first step to investigate the influence of OGs construction in the coevolutionary analysis. The process consists of: 1) EukProt species subselection, 2) Broccoli orthology inference, 3) ClustalΩ intra-OG multiple sequence alignment (MSA), 4) hmmbuild for HMM construction from each MSA, 5) hmmsearch to compare HMM against OrthoDB sequences and to recover orthologous in 1929 species, 6) PPs building and metric analysis through cotr analysis. Our in-depth pipeline is able to build from scratch the OGs and assign significant coevolutionary scores (adjusted P-values < 10-3) to 36,541 co-transitions between gene pairs in a manageable time. This analysis revealed novel coevolutionary associations and testable gene functions.

12:00-12:10
Exploring the evolution of metabolic networks in fungi
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Louxin Zhang

  • Vahiniaina Andriamanga, Institute for Integrative Biology of the Cell (I2BC), France
  • Olivier Lespinet, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France., France
  • Anne Lopes, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France., France


Presentation Overview: Show

Metabolic networks depict the relationships among all biochemical reactions. It defines the metabolic capacity of the organism to use compounds available in the environment and to synthesize new products. Consequently, the environment plays a role in constraining the evolution of metabolic networks. We studied the evolution of 910 enzyme activities in 174 fungi species using a unique combination of phylogenetic profiles and graph-based analysis. The enzyme activities were divided based on their conservation, with 454 enzyme activities conserved across species and 456 associated with specific clades or species. We then investigated their evolutionary history through phylostratigraphy approaches. Doing so, we showed that 406 lineage-specific enzyme activities were already present in fungal ancestors and subsequently lost during evolution. Moreover, 50 were novel fungal-specific enzyme activities. Regarding their location in the metabolic network, we showed that the enzyme activities associated with specific clades and species are mostly peripheral, less connected, and alternatives to common ones. In addition, enzyme activities that share similar phylogenetic profiles are proximal within the network. Network-breaking enzyme activity losses were tolerated if one subnetwork is an accessory or an alternative enzyme activity exists in other species.

12:10-12:20
Efficient homology-based annotation of transposable elements using minimizers
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Louxin Zhang

  • Laura Natalia González García, Universidad de los Andes, Université de Montpellier, Colombia
  • Daniela Lozano Arce, Universidad de los Andes, Colombia
  • Juan Pablo Londoño, Universidad de los Andes, Colombia
  • Romain Guyot, Université de Montpellier, France
  • Jorge Duitama, Universidad de los Andes, Colombia


Presentation Overview: Show

Transposable elements (TEs) make up more than half of the genomes of complex plant species and can modulate the expression of neighboring genes, producing significant variability of agronomically relevant traits. The availability of long-read sequencing technologies allows the building of genome assemblies for plant species with large and complex genomes. Unfortunately, TE annotation currently represents a bottleneck in the annotation of genome assemblies. We present a new functionality of the Next-Generation Sequencing Experience Platform (NGSEP) to perform efficient homology-based TE annotation. Sequences in a TEs reference library are treated as long reads and mapped to an input genome assembly using minimizers. A hierarchical annotation is then assigned by homology using the annotation of the reference library. We tested the performance of our algorithm on genome assemblies of different plant species, including Arabidopsis thaliana, Oryza sativa, Coffea humblotiana, and Triticum aestivum (bread wheat). Our algorithm outperforms traditional homology-based annotation tools in speed by a factor of three to >20, reducing the annotation time of the T. aestivum genome from months to hours, and recovering up to 80% of TEs annotated with RepeatMasker with a precision of up to 0.95.

12:20-12:30
A Splice Aware Approach to Predict Genes in Eukaryotes.
Room: Pasteur Auditorium
Format: Live-stream

Moderator(s): Louxin Zhang

  • Abigail Djossou, University of Sherbrooke, Canada
  • Aida Ouangraoua, Université de Sherbrooke, Canada


Presentation Overview: Show

Gene prediction is a fundamental step in genomic sequence analysis, involving the identification of genes within a DNA sequence. Existing methods for protein-coding genes prediction, categorized into model-based and model-free approaches, have various shortcomings that result in prediction errors. Although existing model-based gene prediction methods have achieved considerable success, there is still room for improvement.

We have developed a new model-based prediction approach for protein-coding genes in eukaryotic genomes based on Hidden Markov Models (HMMs). Our program advances the state of the art by integrating in the model, alternative splicing information, which is the mechanism by which a single gene can produce multiple protein isoforms. By considering alternative splice sites, our method allows for different exon combinations within a single gene. This integration differentiates our program from existing model-based gene prediction tools.

We trained and tested our program on human genomic sequences from the ENSEMBL 98 database. Our method offers a promising approach to gene prediction, by effectively integrating alternative splicing while providing accurate and efficient predictions. These results emphasize the potential of our model-based approach for gene prediction in eukaryotic genomes and highlight the significance of integrating alternative splicing information for precise gene identification.

13:50-14:10
Robust and platform-independent CNA calling with ASCAT v3
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Edward Braun

  • Tom Lesluyes, The Francis Crick Institute, United Kingdom
  • Maxime Tarabichi, Université Libre de Bruxelles, Belgium
  • Kerstin Haase, Max Delbrück Center for Molecular Medicine, Germany
  • Jonas Demeulemeester, VIB – KU Leuven Center for Cancer Biology, Belgium
  • Peter Van Loo, The University of Texas MD Anderson Cancer Center, United States


Presentation Overview: Show

Tumour initiation and evolution are fuelled by somatic changes, ranging from single nucleotide variants to whole-genome aberrations. As a key process, copy-number alterations (CNAs) have been an important field of investigation for decades and helped uncover insights in terms of diagnosis, prognosis and treatment. Therefore, obtaining accurate CNA calls is crucial for understanding tumour biology.

In 2010, ASCAT was proposed as an allele-specific CNA caller, resolving tumour purity and ploidy from array data. Since then, we have aimed to extend ASCAT, enabling efficient and robust CNA calling from sequencing data, ranging from small targeted panels to whole genomes. We considered TCGA/ICGC/PCAWG cases with patient-matched SNP6, WES and WGS data for validation and compared agreement between CNA profiles. Also, we propose metrics of interest for quality control of ASCAT results. Such features and other improvements are now available in ASCAT v3.

Furthermore, powerful bioinformatics methods allow delving deeper into tumour biology using sequencing data, but they require accurate purity and CNA estimates. We demonstrate how ASCAT and other tools enable characterising tumour evolution and heterogeneity.

All combined, ASCAT accurately assesses somatic and allele-specific copy-number changes in cancer genomes, making it a central entry point into uncovering key aspects of tumour biology.

14:10-14:20
Improving genome variation calls from non-human sequencing data using machine learning
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Edward Braun

  • Jeonghoon Choi, Pusan National University, South Korea
  • Bo Zhou, Stanford University, United States
  • Gwanghoon Jung, Pusan National University, South Korea
  • Minsu Kim, Pusan National University, South Korea
  • Donggil Kang, Pusan National University, South Korea
  • Giltae Song, Pusan National University, South Korea


Presentation Overview: Show

DeepVariant is a pipeline for accurate genome variation detection using convolutional neural networks that incorporate known genotype information [1]. Although the DeepVariant tool is one of the most popular tools for calling genome variation, its performance for non-human genome data remains suboptimal due to some preprocessing steps that rely on population genome datasets such as indel realignment and base recalibration, which are not available for non-human genomes.

To resolve this issue, we propose an approach based on machine learning for filtering out false positive genome variation calls generated by DeepVariant for non-human sequencing data. To mimic non-human genome variation calling situations, we skip the preprocessing steps for genome alignments. We train a model using genome data of which ground truth variation calls have been already determined such as HG002 from Genome In A Bottle (GIAB) [2] to identify false positives among variations called by DeepVariant. For building the model, we apply decision tree based ensemble approaches. We evaluate our model using other genomes of which ground truth variation calls are available. We expect that our model can accelerate genome variation studies of non-human species.

14:20-14:30
Deciphering mammalian genomes
Room: Salle Rhone 2
Format: Live-stream

Moderator(s): Edward Braun

  • Yury Bukhman, Morgridge Institute for Research, United States
  • Huishi Toh, Neuroscience Research Institute, University of California Santa Barbara, Santa Barbara, CA 93117, USA, United States
  • Phillip A. Morin, Southwest Fisheries Science Center, National Oceanic and Atmospheric Administration (NOAA), United States
  • Susanne Meyer, Neuroscience Research Institute, University of California, Santa Barbara, United States
  • Li-Fang Chu, Regenerative Biology, Morgridge Institute for Research, United States
  • Jeff K. Jacobsen, V.E. Enterprises, United States
  • Jessica Antosiewicz-Bourget, Regenerative Biology, Morgridge Institute for Research, United States
  • Daniel Mamott, Regenerative Biology, Morgridge Institute for Research, United States
  • Maylie Gonzales, Neuroscience Research Institute, University of California, Santa Barbara, United States
  • Cara Argus, Regenerative Biology, Morgridge Institute for Research, United States
  • Jennifer Bolin, Regenerative Biology, Morgridge Institute for Research, United States
  • Mark E. Berres, University of Wisconsin Biotechnology Center Bioinformatics Resource Center, United States
  • Chentao Yang, BGI-Shenzhen, Shenzhen 518083, China, China
  • Lucie A. Bergeron, Villum Centre for Biodiversity Genomics, University of Copenhagen, Denmark, Denmark
  • Guojie Zhang, Villum Centre for Biodiversity Genomics, University of Copenhagen, Denmark, Denmark
  • Jacqueline Mountcastle, Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA, United States
  • Bettina Haase, Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA, United States
  • Olivier Fedrigo, Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA, United States
  • Giulio Formenti, Laboratory of Neurogenetics of Language, The Rockefeller University/HHMI, New York, NY, USA, United States
  • Arang Rhie, Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA, United States
  • Robert S. Harris, Department of Biology, Pennsylvania State University, United States
  • Jo Wood, Tree of Life, Wellcome Sanger Institute, United Kingdom
  • Alan Tracey, Tree of Life, Wellcome Sanger Institute, Cambridge CB10 1SA, UK, United Kingdom
  • Willian Chow, Tree of Life, Wellcome Sanger Institute, Cambridge CB10 1SA, UK, United Kingdom
  • Kerstin Howe, Tree of Life, Wellcome Sanger Institute, Cambridge CB10 1SA, UK, United Kingdom
  • Kalpana Raja, Regenerative Biology, Morgridge Institute for Research, Madison, WI, USA, United States
  • John Steill, Bioinformatics and Regenerative Biology, Morgridge Institute for Research, Madison, WI, USA, United States
  • Scott A. Swanson, Bioinformatics and Regenerative Biology, Morgridge Institute for Research, Madison, WI, USA, United States
  • Peng Jiang, Center for Gene Regulation in Health and Disease (GRHD), Cleveland State University, Cleveland, OH, United States
  • John Fogg, Department of Statistics, University of Wisconsin - Madison, Madison, WI, USA, United States
  • Aashish Jain, Department of Computer Science, Purdue University, United States
  • Daisuke Kihara, a) Department of Biological Sciences, b) Department of Computer Science, Purdue University, United States
  • Bogdan M. Kirilenko, LOEWE Centre for Translational Biodiversity Genomics, Germany
  • Chetan Munegowda, LOEWE Centre for Translational Biodiversity Genomics, Germany
  • Michael Hiller, LOEWE Centre for Translational Biodiversity Genomics, Germany
  • J. Spencer Johnston, Department of Entomology, Texas A&M University, United States
  • Alexander Ionkov, Bioinformatics and Regenerative Biology, Morgridge Institute for Research, Madison, WI, USA, United States
  • Aimee Lang, Ocean Associates, Inc., United States
  • Magnus Wolf, Senckenberg Biodiversity and Climate Research Centre (BiK-F), Germany
  • Lily Yan, Department of Psychology & Neuroscience Program, Michigan State University, United States
  • Dennis O. Clegg, Neuroscience Research Institute, University of California, Santa Barbara, United States
  • Adam M. Phillippy, Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA, United States
  • Erich D. Jarvis, The Rockefeller University, United States
  • James A. Thomson, Regenerative Biology Laboratory, Morgridge Institute for Research, United States
  • Mark J.P. Chaisson, Department of Quantitative and Computational Biology, University of Southern California, United States
  • Ron Stewart, Bioinformatics and Regenerative Biology, Morgridge Institute for Research, United States


Presentation Overview: Show

We present reference-quality genome assemblies of two mammals, the blue whale Balaenoptera musculus and Nile rat Arvicanthis niloticus. The blue whale is the world’s largest animal, while the Nile rat is a promising animal model of type 2 diabetes. Both assemblies were built using multiple data types and state-of-the-art genome assembly workflows by the Vertebrates Genomes Project (VGP). The Nile rat is one of only a few diploid genomes whose two haplotypes have been fully resolved using the trio binning genome assembly workflow. We analyzed both genomes for heterozygosity and segmental duplications. We also compared them to related species in an effort to find features that may be responsible for large body size in the blue whale and diabetes susceptibility in the Nile rat.

14:30-14:40
Comparative Genomics study of Bos Genome
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Edward Braun

  • Menaka Thambiraja, SASTRA DEEMED TO BE UNIVERSITY, India
  • Ragothaman M Yennamalli, SASTRA DEEMED TO BE UNIVERSITY, India
  • Shukrruthi K Iyengar, SASTRA DEEMED TO BE UNIVERSITY, India
  • Brintha Satishkumar, SASTRA DEEMED TO BE UNIVERSITY, India
  • Sai Rohith Kavuru, SASTRA DEEMED TO BE UNIVERSITY, India
  • Aakanksha Katari, SASTRA DEEMED TO BE UNIVERSITY, India
  • Suneel K Onteru, National Dairy Research Institute, India
  • Kamlesh Kumari Bajwa, National Dairy Research Institute, India
  • Dheer Singh, National Dairy Research Institute, India


Presentation Overview: Show

Indigenous cattle in India are known for their economic management in comparison to exotic breeds, owing to their evolution under specific agroclimatic conditions. Their adaptation to harsh climatic conditions and resistance is attributed to the birth-and-death evolution model. One way of identification of the genomic variations is to catalog the copy number variations (CNVs) and the relationship between CNVs and the innate immunity of indicine cattle has not been in focus. We performed genome-wide comparative analysis for the existing genomes B. indicus (Nelore breed), B. indicus (Gir breed), and B. taurus. Using SyMap, GSAlign, and SyRI tool, we performed a chromosomes-by-chromosome analysis of these genomes and identified evolutionary-based sequence variations, such as 97.39% SNVs, 2.06% insertions, and 0.54% deletions between B. taurus and B. indicus (Nelore breed), 91.5% SNVs, 4.2% insertions, and 4.24% deletions between B. taurus and B. indicus (Gir breed) and 93.01% SNVs, 3.2% insertions and 3.8% deletions between B. indicus (Nelore breed) and B. indicus (Gir breed). In addition, we also studied the intrachromosomal variation that involved the comparison of autosomes with allosomes. The results identify the key genes and their associated loci involved in innate immunity among each breed.

14:40-14:50
Whole genome duplication and gene evolution in the hyperdiverse venomous gastropods
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Giltae Song

  • Sarah Farhat, Institut Systématique Evolution Biodiversité (ISYEB), Muséum national d'Histoire naturelle, CNRS, Sorbonne Université, France
  • Maria Vittoria Modica, Department of Biology and Evolution of Marine Organisms (BEOM), Stazione Zoologica Anton Dohrn, Roma, Italy, Italy
  • Nicolas Puillandre, Institut Systématique Evolution Biodiversité (ISYEB), Muséum national d'Histoire naturelle, CNRS, Sorbonne Université, France


Presentation Overview: Show

Research on venomous organisms and the toxins they produce is increasing, but there are biases in the taxonomic coverage. Neogastropods are a diverse group of marine predators that use various feeding strategies, including producing bioactive compounds like toxins to subdue prey. However, little is known about the link between the diversity of these compounds and the hyperdiversification of neogastropod species, and how genome evolution is related to both the compounds and species diversities. Only eight neogastropod genomes have been sequenced, and there is uneven quality assembly among the 45 gastropod genomes sequenced so far. To address this, we generated high-quality chromosome-level assemblies of two species, Monoplex corrugatus (tonnoidean) and Stramonita haemastoma (neogastropod), and identified their gene repertoire. We inferred a whole genome duplication event and identified potential toxins in both genomes, and highlighted possible cases of gene subfunctionalization and neofunctionalization using differential gene expression. The high-quality genomes provide valuable references for their respective taxa and facilitate the identification of genome-level processes that contribute to the evolutionary success of predatory neogastropods.

14:50-15:00
Comparative Genomics of the Arthropoda
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Giltae Song

  • Sean Chun-Chang Chen, Taipei Medical University, Taiwan
  • Carol Eunmi Lee, University of Wisconsin-Madison, United States


Presentation Overview: Show

Arthropods comprise the largest and most speciose phylum on earth. However, comparative genomics within the subphylum Pancrustacea have focused predominately on the insect clade (Hexopoda). As such, we lack a clear understanding of what constitutes an “insect” versus a “crustacean” genome, relative to other arthropod subphyla. Here, we present our preliminary analyses on characteristics of arthropod genomes. We analyzed 75 Arthropod genomes (15 Chelicerates, nine Myriapods, and 51 Pancrustacean genomes), along with two Tardigrade outgroups. We found intriguing differences in genomic characteristics among arthropods: (1) Crustaceans are less AT-rich than insects, Myriapods and Chelicerates (except Parasitiformes, which exhibits low AT); (2) Chelicerates show an inverse relation between genome size and AT content of the 3rd codon, whereas Crustaceans and insects exhibit positive relationships, and (3) Gene families specific to insects alone include odorant binding proteins and odorant receptors, potentially related to terrestrial colonization by insects. We also annotated previously unknown crustacean gene families and elucidated evolutionary patterns of several gene families that are related to environmental adaptation. Our results are a first step toward revealing the evolutionary forces that shape the genome architectures of arthropod lineages and uncover the association with ecological factors.

15:00-15:20
Predicting Cancer-Protective Variants using comparative genomics
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Giltae Song

  • Yuval Tabach, The Hebrew University-Hadassah Medical School, Israel
  • Lamis Naddaf, The Hebrew University-Hadassah Medical School, Israel


Presentation Overview: Show

Cancer is a leading cause of mortality. While much work has been done to identify germline and somatic mutations that increase the risk of cancer, there has been little understanding of genomic variants that reduce cancer risk. Systematically Identifying cancer-protecting genetic variations is almost impossible in the human population. It needs massive genomic and clinical data that are currently very limited. However, there are notable differences in the cancer rates among different animal species where some display almost total immunity to the disease. Through the utilization of comparative genomics, we identify genetic variants that are distinctive to cancer resistance rodents and predict cancer risk across mammals. In the human population, we identify 1,000 Resistant Alleles (SNPs) with lower prevalence among cancer patients. We validated two R-Alleles and show reduced cancerous characteristics in cancer cell cultures, with no noticeable phenotypic changes in healthy human cell cultures. Overall, we generated a cross-species map of R-alleles that are correlated with cancer resistance across various species. These findings have significant implications for understanding the evolution of cancer resistance, understanding the genotype-phenotype relationship, and improving cancer risk assessment, diagnosis, prognosis, and the discovery of protective drugs.

15:20-15:30
SonicParanoid2: fast, accurate and comprehensive orthology inference with machine learning and language models
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Giltae Song

  • Salvatore Cosentino, Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan, Japan
  • Wataru Iwasaki, Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan, Japan


Presentation Overview: Show

Accurate inference of orthologous genes constitutes a prerequisite for various genomic and evolutionary studies. SonicParanoid is one of the fastest tools for orthology inference; however, its scalability and sensitivity have been hampered by time-consuming all-versus-all alignments and the existence of proteins with complex domain architectures. Here, we report an update of SonicParanoid in which machine learning is used to overcome these two limitations. An AdaBoost classifier reduced execution time for the all-versus-all alignment up to 42% without negative effects on the accuracy. A Doc2Vec neural network model enabled orthology inference at the domain level and increased the number of predicted orthologs by one-third. Evaluation on standardized benchmark datasets and a huge 2,000 MAGs dataset showed that SonicParanoid2 is up to 18X faster, more scalable than other orthology-inference tools, and comparably accurate to well-established methods.

16:00-16:20
Proceedings Presentation: Genome-wide Scans for Selective Sweeps using Convolutional Neural Networks
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Wataru Iwasaki

  • Hanqing Zhao, University of Twente, Netherlands
  • Matthijs Souilljee, University of Twente, Netherlands
  • Pavlos Pavlidis, Foundation for Research and Technology-Hellas, Greece
  • Nikolaos Alachiotis, University of Twente, Netherlands


Presentation Overview: Show

Motivation: Recent methods for selective sweep detection cast the problem as a classification task and use summary statistics as features to capture region characteristics that are indicative of a selective sweep, thereby being sensitive to confounding factors. Furthermore, they are not designed to perform whole-
genome scans or to estimate the extent of the genomic region that was affected by positive selection; both are required for identifying candidate genes and the time and strength of selection.
Results: We present ASDEC (https://github.com/pephco/ASDEC), a neural-network-based framework which can scan whole genomes for selective sweeps. ASDEC achieves similar classification performance to other CNN-based classifiers that rely on summary statistics, but it is trained 10x faster and classifies genomic regions 5x faster by inferring region characteristics from the raw sequence data directly. Deploying
ASDEC for genomic scans achieved up to 15.2x higher sensitivity, 19.4x higher success rates, and 4x higher detection accuracy than state-of-the-art methods. We used ASDEC to scan human chromosome 1 of the Yoruba population (1000Genomes project), identifying 9 known candidate genes.

16:20-16:30
EvoProDom: evolutionary modeling of protein families by assessing translocations of protein domains
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Wataru Iwasaki

  • Milana Frenkel-Morgenstern, Bar-Ilan University, Israel
  • Gon Carmi, Bar-Ilan University, Israel


Presentation Overview: Show

We introduce a novel ‘evolution of protein domains’ (EvoProDom) model for describing the evolution of proteins based on the ‘mix and merge’ of protein domains. We assembled and integrated genomic and proteomic data comprising protein domain content and orthologous proteins from 109 organisms with full genomes. In EvoProDom, we characterized evolutionary events, particularly, translocations, as reciprocal exchanges of protein domains between orthologous proteins in different organisms. We showed that protein domains that translocate with highly frequency are generated by transcripts enriched in trans-splicing events, that is, the generation of novel transcripts from the fusion of two distinct genes. In EvoProDom, we describe a general method to collate orthologous protein annotation from KEGG, and protein domain content from protein sequences using tools such as KoFamKOAL and Pfam. Thus, EvoProDom represents a novel model for protein evolution based on the ‘mix and merge’ of protein domains rather than DNA-based evolution models. This confers the advantage of considering chromosomal alterations as drivers of protein evolutionary events.

16:30-16:50
Probing domain architecture design using language models
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Wataru Iwasaki

  • Xiaoyue Cui, Carnegie Mellon University, United States
  • Maureen Stolzer, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States


Presentation Overview: Show

Multidomain proteins are mosaics of structural or functional modules, called domains. The architecture of a multidomain protein - that is, its domain composition in N- to C-terminal order - is intimately related to its function, with each module playing a distinct functional role. For example, in cell signaling proteins, distinct domains are responsible for recognition and response to a stimulus. Multidomain architectures evolve via gain and loss of domain-encoding segments. This evolutionary exploration of domain architecture composition underlies the protein diversity seen in nature.
We exploit sophisticated machine learning algorithms for natural language processing, combined with a rapidly expanding repertoire of domain architecture data, to develop a framework for investigating the forces that govern this process. We represent domain architectures as vectors in a multidimensional space by applying various information retrieval and natural language processing techniques. This system provides a basis for exploratory analysis using visualization with a nonlinear dimensionality reduction method. We further provide quantitative measures that support rigorous comparison of sets of embedded domain architectures. This framework has many applications, including investigating taxonomic differences in the domain architecture complement, identifying domain ""synonyms"" with similar functional roles, and exploring substructure of the world of domain architectures.

16:50-17:00
Zoonosis Prediction Using Language Models
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Wataru Iwasaki

  • Blessy Antony, Virginia Polytechnic Institute and State University, United States
  • Jie Bu, Virginia Polytechnic Institute and State University, United States
  • Andrew Chan, Virginia Polytechnic Institute and State University, United States
  • Anuj Karpatne, Virginia Polytechnic Institute and State University, United States
  • T. M. Murali, Virginia Polytechnic Institute and State University, United States


Presentation Overview: Show

Zoonoses are diseases that are transmitted from non-human animals to humans through the evolution of the disease-causing pathogens. Identifying which species may be infected by a novel virus is an important first step in predicting and preventing the outbreak of an infectious disease in animal and human populations. In this study, we proposed a computational framework to understand the zoonotic potential of viruses. We trained a Transformer-based model on viral protein sequences to learn the language of the constituting amino acids. For this purpose, we used the collection of protein sequences from a diverse set of viruses. Further, we used one-dimensional convolution to gather local neighborhood features in viral sequences. We evaluated the performance of this proposed model in the challenging multi-class classification setting of predicting the animal hosts of a given virus sequence. The Transformer-based model yielded substantially higher AUPRC scores compared to standard machine learning classification algorithms. In ongoing research, we are developing interpretations of model results to discover the genetic mutations that may drive viral zoonoses.

17:00-17:10
UFCG: database of universal fungal core genes and pipeline for genome-wide phylogenetic analysis of fungi
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Nadia Mabrouk

  • Dongwook Kim, Seoul National University, South Korea
  • Cameron L.M. Gilchrist, Seoul National University, South Korea
  • Jongsik Chun, Seoul National University, South Korea
  • Martin Steinegger, Seoul National University, South Korea


Presentation Overview: Show

In phylogenomics the evolutionary relationship of organisms is studied by their genomic information. A common approach to phylogenomics is to extract related genes from each organism, build a multiple sequence alignment and then reconstruct evolution relations through a phylogenetic tree. Often a set of highly conserved genes occurring in single-copy, called core genes, are used for this analysis, as they allow efficient automation within a taxonomic clade. Here we introduce the Universal Fungal Core Genes (UFCG) database and pipeline for genome-wide phylogenetic analysis of fungi. The UFCG database consists of 61 curated fungal marker genes, including a novel set of 41 computationally derived core genes and 20 canonical genes derived from literature, as well as marker gene sequences extracted from publicly available fungal genomes. Furthermore, we provide an easy-to-use, fully automated and open-source pipeline for marker gene extraction, training and phylogenetic tree reconstruction. The UFCG pipeline can identify marker genes from genomic, proteomic and transcriptomic data, while producing phylogenies consistent with those previously reported, and is publicly available together with the UFCG database at https://ufcg.steineggerlab.com.

17:10-17:20
Clade Identification and Understanding Evolutionary Trajectory of Candida auris through Genome Rearrangements
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Nadia Mabrouk

  • Pavitra Selvakumar, The Institute of Mathematical Sciences, (HBNI), Chennai, Tamil Nadu, India
  • Rahul Siddharthan, The Institute of Mathematical Sciences, (HBNI), Chennai, Tamil Nadu, India
  • Aswathy Narayanan, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore, Karnataka, India
  • Kaustuv Sanyal, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore, Karnataka, India


Presentation Overview: Show

Candida auris, a multidrug-resistant human fungal pathogen has emerged and evolved as different clades across the globe in the past decade. C.auris clinical strains exhibit clade- specific features associated with virulence and drug resistance. The molecular events leading to the rapid emergence are yet to be understood. Here, chromosomal rearrangements among C.auris clades and related species are investigated with primary focus on centromeres, to understand its evolutionary trajectory. Centromeres, known to be the hotspots of breaks and downstream rearrangements are identified using a combined approach of chromatin immunoprecipitation and comparative genomic analysis. We find that C.auris and multiple other species in the Clavispora/Candida clade share a conserved small regional GC poor centromeric landscape that lack in pericentromeres and repeats. A centromere inactivation event has led to karyotypic alterations in the species complex. It is observed that one of the geographical clades, the East Asian Clade, has evolved along a unique trajectory compared to other clades and related species. Consequent to this rapid evolution, recently reported strains are indicating cross identification within the previously defined four distinct geographical clades. A rapid and specific colony PCR-based clade identification system (CLaID) is developed using unique DNA sequence junctions conserved in clade-specific manner.

17:20-17:40
Multiple RNA tree Robinson-Foulds Phylogeny
Room: Salle Rhone 2
Format: Live from venue

Moderator(s): Nadia Mabrouk

  • Yoann Anselmetti, University of Sherbrooke, Canada
  • Aïda Ouangraoua, University of Sherbrooke, Canada


Presentation Overview: Show

Over the last three decades, many methods were developed to predict the secondary structure of ncRNAs and build accurate ncRNA multiple sequence alignments accounting for their secondary structure. But until now, only a few algorithms and methods were designed to study the evolution of ncRNA secondary structure, and none of them allows to reconstruct the complete evolutionary history of the secondary structures of a ncRNA family.
In this talk, we consider the Small Parsimony and the Large Parsimony problems for families of ncRNAs whose secondary structures are represented as trees. For these two optimization problems, we have designed heuristic solutions under the Robinson-Foulds (RF) tree metric model. We study the theoretical complexity of the problems under the RF distance model, as well as the tree edit distance model, and provide efficient algorithmic solutions for the two problems under the two tree distance models. The study of the evolution of ncRNA structures has the potential to lead to interesting insights for therapeutic targeting of ncRNAs based on the comparison of their structures involved in metabolic pathways in different species, thus combining genomic, transcriptomic and metabolomic information.

17:40-18:00
Proceedings Presentation: Phylogenomic branch length estimation using quartets
Room: Salle Rhone 2
Format: Live-stream

Moderator(s): Nadia Mabrouk

  • Yasamin Tabatabaee, University of Illinois at Urbana-Champaign, United States
  • Chao Zhang, University of California at Berkeley, United States
  • Tandy Warnow, University of Illinois at Urbana-Champaign, United States
  • Siavash Mirarab, University of California, San Diego, United States


Presentation Overview: Show

Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. In this paper, we derive expected values of gene tree branch lengths in substitution units under an extension of the multi-species coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy.

Thursday, July 27th
8:30-8:40
Nucleotide content differences in high and low pathogenic human coronaviruses affect RNA structural features, selective constraints, and compensatory evolution
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Dannie Durand

  • Svetlana Shabalina, National Center for Biotechnology Information, United States
  • Aleksey Ogurtsov, National Center for Biotechnology Information, United States
  • Alexey Spiridonov, AI Research Institute, FB, United States
  • Eugene Koonin, National Center for Biotechnology Information, United States


Presentation Overview: Show

We compared nucleotide content, RNA features and predicted RNA structures of three groups of highly pathogenic (HP) coronaviruses that cause severe disease in humans with those of the low pathogenic (LP) human coronaviruses (hCoVs). Comparative analysis of global and local folding of genomic and subgenomic hCoV RNAs, structural characteristics and RNA stability revealed that HP-genomes folding was enriched in short-range interactions, with a preference for G-C base pairs compared to LP-genomes, where RNA folding contained a significantly higher fraction of the wobble G-U pairs. HP-genomes folding contained a greater number and higher density of predicted structural elements, especially short, locally folded hairpins. HP-genomes are also enriched in alternative RNA conformations, so short hairpins can transiently stabilize local RNA folding but can be easily disrupted to participate in alternative interactions. We showed that compensatory evolution in the isolated conserved structural elements of the HP-genomes is much more pronounced than in the LP genomes. In part, described differences between HP- and LP-genomes can be explained by the significant difference in C- and U-nucleotide composition. Significant enrichment of C-containing dinucleotides and C-terminal codons can optimize RNA stability and accelerate translation elongation rates in some regions of the HP- compared to LP-genomes.

8:40-8:50
Reconstructing horizontal gene flow network to understand prokaryotic evolution
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Dannie Durand

  • Soham Sengupta, University of North Texas, United States
  • Rajeev Azad, University of North Texas, United States


Presentation Overview: Show

Horizontal gene transfer (HGT) is a major source of phenotypic innovation and a mechanism of niche adaptation in prokaryotes. Quantification of HGT is critical to decipher its myriad roles in microbial evolution and adaptation. Advances in genome sequencing and bioinformatics have augmented our ability to understand the microbial world, particularly the direct or indirect influence of HGT on diverse life forms. Methods for detecting HGT can be classified into phylogenetic-based and parametric or composition-based approaches. Here, we exploited the complementary strengths of both the approaches to construct a high confidence horizontal gene flow network. Our network is unique in its ability to detect the transfer of native genes of a genome to genomes from other taxa, thus establishing donor and recipient organisms (taxa), rather than through a post hoc analysis as is the practice with several other approaches. The scale-free horizontal gene flow network presented here provides new insights into modes of transfer for the exchange of genetic information and also illuminates differential gene flow across phyla.

8:50-9:10
Improved interpretability of bacterial genome-wide associations using gene cluster centric k-mers
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Dannie Durand

  • Hannes Neubauer, Twincore/Hannover Medical School (MHH), Germany
  • Marco Galardini, Twincore/Hannover Medical School (MHH), Germany


Presentation Overview: Show

The wide adoption of bacterial genome sequencing and encoding both core and accessory genome variation using k-mers has allowed bacterial genome wide association studies (GWAS) to identify genetic variants associated with relevant phenotypes such as those linked to infection. Significant limitations still remain as far as the interpretation of association results is concerned, which affects the wider adoption of GWAS methods on microbial datasets. We have developed a simple computational method (panfeed) that explicitly links each k-mer to their gene cluster at base resolution level, which allows us to avoid biases introduced by a global de Bruijn graph as well as more easily map and annotate associated variants. We tested panfeed on two independent datasets, correctly identifying previously characterized causal variants, which demonstrates the precision of the method, as well as its scalable performance. panfeed is a command line tool written in the python programming language and available at https://github.com/microbial-pangenomes-lab/panfeed.

9:10-9:30
Machine learning enables prediction of metabolic system evolution in bacteria
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Dannie Durand

  • Naoki Konno, The University of Tokyo, Japan
  • Wataru Iwasaki, The University of Tokyo, Japan


Presentation Overview: Show

Evolution prediction is a long-standing goal in evolutionary biology, with potential impacts on strategic pathogen control, genome engineering, and synthetic biology. While laboratory evolution studies have shown the predictability of short-term and sequence-level evolution, that of long-term and system-level evolution has not been systematically examined. Here, we show that the gene content evolution of metabolic systems is generally predictable by applying ancestral gene content reconstruction and machine learning techniques to ~3000 bacterial genomes. Our framework, Evodictor, successfully predicted gene gain and loss evolution at the branches of the reference phylogenetic tree, suggesting that evolutionary pressures and constraints on metabolic systems are universally shared. Investigation of pathway architectures and meta-analysis of metagenomic datasets confirmed that these evolutionary patterns have physiological and ecological bases as functional dependencies among metabolic reactions and bacterial habitat changes. Last, pan-genomic analysis of intraspecies gene content variations proved that even "ongoing" evolution in extant bacterial species is predictable in our framework.

10:00-10:20
Proceedings Presentation: Phylogenetic Diversity Statistics for All Clades in a Phylogeny
Room: Pasteur Auditorium
Format: Live-stream

Moderator(s): Louxin Zhang

  • Siddhant Grover, Iowa State University, United States
  • Alexey Markin, USDA-ARS, United States
  • Tavis Anderson, USDA-ARS, United States
  • Oliver Eulenstein, Iowa State University, United States


Presentation Overview: Show

The classic quantitative measure of phylogenetic diversity, PD, has been used to address problems in conservation biology, microbial ecology, and evolutionary biology. PD is the minimum total length of the branches in a phylogeny required to cover a specified set of taxa on the phylogeny. A general goal in the application of PD has been identifying a set of taxa of size k that maximize PD on a given phylogeny; this has been mirrored in active research to develop efficient algorithms for the problem. Other descriptive statistics, such as the minimum PD, average PD, and standard deviation of PD, can provide invaluable insight into the distribution of PD across a phylogeny (relative to a fixed value of k). However, there has been limited or no research on computing these statistics, especially when required for each clade in a phylogeny, enabling direct comparisons of PD between clades. We introduce efficient algorithms for computing PD and the associated descriptive statistics for a given phylogeny and each of its clades. In simulation studies, we demonstrate the ability of our algorithms to analyze large-scale phylogenies with applications in ecology and evolutionary biology.
Availability: The software is available at https://github.com/flu-crew/PD_stats.

10:20-10:30
Back to the roots: Phylogeny of wild and cultivated beets
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Louxin Zhang

  • Felix L. Wascher, Institute of Computational Biology, Department of Biotechnology, Universität für Bodenkultur, Vienna, Austria, Austria
  • Heinz Himmelbauer, Institute of Computational Biology, Department of Biotechnology, Universität für Bodenkultur, Vienna, Austria, Austria
  • Juliane C. Dohm, Institute of Computational Biology, Department of Biotechnology, Universität für Bodenkultur, Vienna, Austria, Austria


Presentation Overview: Show

Cultivated beets, including sugar beet, rank among the most important crops. The wild ancestor of beet crops is the sea beet Beta vulgaris ssp. maritima. Species and subspecies of wild beets are readily crossable with cultivated beets and are thus available for crop improvement. Here we present our recent work on genetic relationships within the genus Beta, based on over 750 sequenced beet genomes. For genomic comparisons we used a k-mer based approach (Ondov et al., 2016) followed by phylogenetic tree construction. This allowed us to propose the origin of domestication of cultivated beets, and to provide comprehensive insights into the phylogeny of wild and cultivated beets. To underline the usefulness of k-mer based approaches in comparative genomics we compared our results to a classical SNP-chip based approach. Our workflow allows easy classification of wild beets of unknown origin and reveals a surprisingly high abundance of misclassified (sub-)species assignments in public seed banks which has impact on downstream analyses. Furthermore, we integrated the phylogenetic data with information regarding pathogen resistance. In summary, our work showcases the multiple usages of whole-genome sequencing data in the context of crop plant genomics.

10:30-10:50
AFconverge: alignment-free phylogenetic method for predicting convergent evolution of regulatory elements
Room: Pasteur Auditorium
Format: Live-stream

Moderator(s): Louxin Zhang

  • Elysia Saputra, University of Pittsburgh, United States
  • Ali Tugrul Balci, University of Pittsburgh, United States
  • Nathan Clark, University of Utah, United States
  • Maria Chikina, University of Pittsburgh, United States


Presentation Overview: Show

Comparative genomics can unveil genetic changes that underlie morphological differences, but elucidating adaptations of regulatory elements (REs) is still challenging. Most existing methods are limited to computing element-level signals from multiple sequence alignments, which is unsuitable for REs that are modularly composed of transcription factor (TF) binding motifs that can turn over and vary in arrangement. To quantify how selective forces act on such flexible sequence space requires a model that can account for functional conservation in the presence of sequence divergence. We introduce AFconverge, an “alignment-free” method that predicts the patterns of regulatory motif adaptations underlying phenotypic convergence. AFconverge quantifies the presence of TF motifs across RE orthologs and correlates each motif with the phenotype using a phylogenetically-constrained association test. Benchmarking experiments with the convergent case of mammalian vision loss showed that AFconverge outperformed competing methods at predicting significant divergence of ocular-related promoters. We then applied AFconverge to investigate the pan-mammalian regulatory adaptations underlying extended lifespan, and identified global gains and losses of longevity-associated motif features. AFconverge also revealed correlation patterns of motif selection, highlighting pluripotency, germline development, immunity, and pancreatic functions as key drivers of longevity. Thus, AFconverge introduces new paradigms for interrogating regulatory adaptations at multiple scales.

10:50-11:00
SNPtotree – a software for sorting haploid variants into phylogenetic trees
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Nadia Mabrouk

  • Zehra Köksal, University of Copenhagen, Denmark
  • Leonor Gusmão, State University of Rio de Janeiro (UERJ), Brazil
  • Claus Børsting, University of Copenhagen, Denmark
  • Vania Pereira, University of Copenhagen, Denmark


Presentation Overview: Show

The investigation of variants on haploid chromosomes and their geographic distribution may help to characterize the genetic evolution of a species and provide further information in forensic genetics and genetic genealogy. Here, the hierarchical order of variants in a phylogenetic tree is fundamental. Currently, there is no straightforward method for sorting haploid variants into phylogenetic trees. Most often character-based tree construction methods, in particular utilizing maximum likelihood (ML), are applied. With the help of manual sorting, variants are arranged into phylogenetic trees. However, substantial amount of missing data due to poor sequencing quality or compilation of data from different sources, causes difficulties when creating reliable phylogenetic trees.

Here, we introduce SNPtotree, the only available software which can determine the hierarchical order of haploid variants in a phylogenetic tree without requiring error-prone manual labor. The algorithm infers pairwise variant relationships which are combined to a complete tree.

SNPtotree creates a reliable phylogenetic tree for sequences with a high amount of missing data, that is closer to the expected phylogeny than ML-based approaches. SNPtotree enables the creation and maintenance of phylogenetic tree databases for organisms that may be either unexplored or thoroughly investigated, and allow studies on population, forensic, and evolutionary genetics.

11:00-11:20
Proceedings Presentation: Cell type matching across species using protein embeddings and transfer learning
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Nadia Mabrouk

  • Kirti Biharie, Delft University of Technology, Netherlands
  • Lieke Michielsen, Leiden University Medical Center, Netherlands
  • Marcel Reinders, Delft University of Technology, Netherlands
  • Ahmed Mahfouz, Leiden University Medical Center, Netherlands


Presentation Overview: Show

Motivation: Knowing the relation between cell types is crucial for translating experimental results from mice to humans. Establishing cell type matches, however, is hindered by the biological differences between the species. A substantial amount of evolutionary information between genes that could be used to align the species is discarded by most of the current methods since they only use one-to-one orthologous genes. Some methods try to retain the information by explicitly including the relation between genes, however, not without caveats.
Results: In this work, we present a model to transfer and align cell types in cross-species analysis (TACTiCS). First, TACTiCS uses a natural language processing model to match genes using their protein sequences. Next, TACTiCS employs a neural network to classify cell types within a species. Afterward, TACTiCS uses transfer learning to propagate cell type labels between species. We applied TACTiCS on scRNA-seq data of the primary motor cortex of human, mouse, and marmoset. Our model can accurately match and align cell types on these datasets. Moreover, our model outperforms Seurat and the state-of-the-art method SAMap. Finally, we show that our gene matching method results in better cell type matches than BLAST in our model.

11:20-11:40
Uncovering the Dynamics of CRISPR Array Evolution with a New Maximum Likelihood Approach
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Nadia Mabrouk

  • Axel Fehrenbach, University of Tübingen, Germany
  • Alexander Mitrofanov, University of Freiburg, Germany
  • Omer Alkhnbashi, King Fahd University of Petroleum and Minerals, Saudi Arabia
  • Rolf Backofen, University of Freiburg, Germany
  • Franz Baumdicker, University of Tübingen, Germany


Presentation Overview: Show

The CRISPR-Cas technology has revolutionized gene-editing by allowing precise and efficient editing of DNA sequences. Initially discovered within bacteria and archaea, CRISPR-Cas serves as a powerful immune system that effectively defends against foreign invaders by incorporating short snippets of DNA, called spacers, into the CRISPR array within the cell’s genome. Notably, insertions occur at one end of the CRISPR array, therefore they provide a chronology of foreign invasions. The inserted spacers are utilized to identify and eliminate matching foreign DNA during subsequent invasions. CRISPR arrays rapidly evolve due to spacer insertions and deletions.
Commonly used tools for ancestral reconstruction are unsuitable for CRISPR arrays as they do not consider the insertion order.
We introduce SpacerPlacer, a tool that utilizes probabilistic models of CRISPR array evolution and a maximum-likelihood approach to reconstruct ancestral states of a group of CRISPR arrays while respecting the insertion order.
With SpacerPlacer we analyzed a large database of CRISPR arrays to estimate their evolutionary behavior and compare between CRISPR types and different species. Interestingly, we found that spacer deletions are not more frequent at the back end of the array and that multiple spacers are likely to be lost in blocks rather than exclusively individually.

11:40-11:50
Evolutionary conservation of RNA editing - a case study in Filamin genes
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Nadia Mabrouk

  • Andrea Tanzer, Medical University of Vienna, Center for Anatomy and Cell Biology, Department of Cell and Developmental Biology, Austria


Presentation Overview: Show

The epitranscriptome provides an additional layer of regulation by chemically modifying the nucleotides of RNA. RNA modifications alter the physico-chemical properties and thus the identity of nucleotides, which affects RNA-RNA and RNA-protein interactions. They are introduced by ‘writer‘ enzymes, and the stoichiometry of modified and native RNAs follows a spatio-temporal pattern. A-to-I, m6A, m5C and pseudoU are the most prominent modification types. The reasons why the cell relies on temporal modifications rather than hard coded mutations remains largely unknown.

In this study we present our results on the evolutionary history of an A-to-I editing site in filamin genes. We used bioinformatic tools to search for homologous genes, reconstruct gene phylogenies, predict consensus structures and study compensatory mutations. The double-stranded ADAR target structure is formed between sequences in an exon and its downstream intron. While editing in the exon causes recoding of the protein sequence (Q/R), the intronic region resides in an ultraconserved element. This structural element is conserved between human and lamprey, suggesting that editing of filamins dates back to the origin of vertebrates. We confirmed the functionality of our structures by calling editing events in previously published transcriptomics data sets, including those of shark and the lamprey Petromyzon.

11:50-12:00
Bioinformatics Analysis of Mutations Sheds Light on the Evolution of Dengue NS1 Protein With Implications in the Identification of Potential Functional and Druggable Sites
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Nadia Mabrouk

  • Abhishek Sharma, National Centre for Biological Science, Bangaluru, India
  • Sudhir Krishna, National Centre for Biological Science, Bangaluru, India
  • Ramanathan Sowdhamini, National Centre for Biological Science, Bangaluru, India


Presentation Overview: Show

Non-structural protein (NS1) is a 350 amino acid long conserved protein in the dengue virus. Conservation of NS1 is expected due to its importance in dengue pathogenesis. The protein is known to exist in dimeric and hexameric states. In this work, we performed extensive structure and sequence analysis of NS1 protein, and uncovered the role of NS1 quaternary states in its evolution. Three-dimensional modeling of unresolved loop regions in NS1 structure is performed. “Conserved” and “Variable” regions within NS1 protein were identified from sequences obtained from patient samples and the role of compensatory mutations in selecting destabilizing mutations were identified. Molecular dynamics (MD) simulations were performed to extensively study the effect of a few mutations on NS1 structure stability and compensatory mutations. Virtual saturation mutagenesis, predicting the effect of every individual amino acid substitution on NS1 stability sequentially, revealed virtual-conserved and variable sites. The increased number of observed and virtual-conserved regions across NS1 quaternary states suggest the role of higher-order structure formation in its evolutionary conservation. Our sequence and structure analysis could enable in identification of possible protein–protein interfaces and druggable sites. Virtual screening of nearly 10,000 small molecules, permitted us to recognize six drug-like molecules targeting the dimeric sites.

13:20-13:40
Proceedings Presentation: A weighted distance-based approach for deriving consensus tumor evolutionary trees
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Edward Braun

  • Ziyun Guang, Carleton College, United States
  • Matthew Smith-Erb, Carleton College, United States
  • Layla Oesper, Carleton College, United States


Presentation Overview: Show

Motivation: The acquisition of somatic mutations by a tumor can be modeled by a type of evolutionary tree. However, it is impossible to observe this tree directly. Instead, numerous algorithms have been developed to infer such a tree from different types of sequencing data. But such methods can produce conflicting trees for the same patient, making it desirable to have approaches that can combine several such tumor trees into a consensus or summary tree. We introduce The Weighted m-Tumor Tree Consensus Problem (W-m-TTCP) to find a consensus tree among multiple plausible tumor evolutionary histories, each assigned a confidence weight, given a specific distance measure between tumor trees. We present an algorithm called TuLiP that is based on integer linear programming (ILP) which solves the W-m-TTCP, and unlike other existing consensus methods, allows the input trees to be weighted differently.

Results: On simulated data we show that TuLiP outperforms two existing methods at correctly identifying the true underlying tree used to create the simulations. We also show that the incorporation of weights can lead to more accurate tree inference. On a Triple-Negative Breast Cancer data set we show that including confidence weights can have important impacts on the consensus tree identified.

13:40-14:00
Joint copy number and mutation phylogeny reconstruction from single-cell amplicon sequencing data
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Edward Braun

  • Etienne Sollier, DKFZ, Germany
  • Jack Kuipers, ETH Zürich, Switzerland
  • Koichi Takahasi, MD Anderson, United States
  • Niko Beerenwinkel, ETH Zürich, Switzerland
  • Katharina Jahn, FU Berlin, Germany


Presentation Overview: Show

Reconstructing the history of somatic DNA alterations can help understand the evolution of a tumor and predict its resistance to treatment. Single-cell DNA sequencing (scDNAseq) can be used to investigate clonal heterogeneity and to inform phylogeny reconstruction. However, most existing phylogenetic methods for scDNAseq data are designed either for single nucleotide variants (SNVs) or for large copy number alterations (CNAs), or are not applicable to targeted sequencing. Here, we develop COMPASS, a computational method for inferring the joint phylogeny of SNVs and CNAs from targeted scDNAseq data. COMPASS assigns a likelihood to trees of somatic events based on a probabilistic model and uses a Markov Chain Monte Carlo approach to search for the best tree. It is applicable to targeted sequencing datasets where the coverage is not uniform across regions, and scales to datasets of more than 10,000 cells. We evaluate COMPASS on simulated data and apply it to several datasets including a cohort of 123 patients with acute myeloid leukemia. COMPASS detected clonal CNAs that could be orthogonally validated with bulk data, in addition to subclonal ones that require single-cell resolution, some of which point toward convergent evolution.

14:00-14:20
PhyClone: Accurate Bayesian reconstruction of cancer phylogenies from bulk sequencing
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Edward Braun

  • Emilia Hurtado, The University of British Columbia, Canada
  • Alexandre Bouchard-Côté, The University of British Columbia, Canada
  • Andrew Roth, The University of British Columbia, Canada


Presentation Overview: Show

Cancer is driven by somatic mutations that result in genomically distinct sub-populations of cells called clones. Identifying the clonal composition of tumours and understanding the evolutionary relationships between clones is crucial in cancer studies. Previous methods have limitations in inferring the phylogeny and capturing the uncertainty in mutational clustering from bulk DNA sequencing data.
Leveraging the clonal population deconvolution model of PyClone, we present an accurate, efficient, and robust method for constructing clonal phylogenies — PhyClone. It uses a novel non-parametric Bayesian prior called the Forest Structured Chinese Restaurant Process (FSCRP) to capture the underlying distribution of clusters and tree topologies. A Particle Gibbs sampler based on a novel auxiliary variable construction is used to fit PhyClone. Furthermore, through outlier modelling, PhyClone is robust to violations of the infinite sites assumption common to cancer datasets.
We demonstrate the performance of PhyClone on simulated and real-world datasets and show that it outperforms previous methods in terms of accuracy and scalability. PhyClone accurately clusters mutations into clonal groups and reconstructs their phylogenetic relationships, remaining accurate when mutations which violate the infinite sites assumption are present. PhyClone thereby presents a scalable, accurate, and robust solution to inferring clonal phylogenies from bulk sequencing data.

14:20-14:40
GRITIC sheds light on the evolution of copy number gains in genome doubled tumors
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Lars Arvestad

  • Toby Baker, The Francis Crick Institute, London, United Kingdom
  • Siqi Lai, Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States
  • Stefan Dentro, DKFZ, Heidelberg, Germany
  • Maxime Tarabichi, Institute for Interdisciplinary Research (IRIBHM), Université Libre de Bruxelles, Brussels, Belgium
  • Peter Van Loo, Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States


Presentation Overview: Show

Tumors frequently have a high degree of copy number instability and often contain genomic regions that have undergone a series of genomic gains resulting in multiple copies of both alleles. This is particularly common in tumors that have undergone a whole genome duplication (WGD).

The relative timing of gains in such complex copy number regions is complicated by the fact that there are multiple plausible evolutionary histories that could give rise to the final copy number state, with the most parsimonious history often assumed. Here we describe a method, GRITIC, that overcomes this problem by inferring the likelihood of all possible routes and associated gain timings that lead to a complex state.

By applying GRITIC to 5718 tumors with a WGD, we measure an average posterior probability of 31.7% on non-parsimonious route histories across complex copy number states. As this was measured with a penalty on non-parsimony, we are likely underestimating the true amount of non-parsimonious evolution in tumor development.

GRITIC allows for a more accurate and complete inference of evolutionary histories in different cancer types and better insights into the early copy number events in genomically unstable tumors.

14:40-14:50
Visualizing Clonal Evolution with clevRvis
Room: Pasteur Auditorium
Format: Live-stream

Moderator(s): Lars Arvestad

  • Sarah Sandmann, Institute of Medical Informatics, Germany
  • Clara Inserte, Institute of Medical Informatics, Germany
  • Julian Varghese, Institute of Medical Informatics, Germany


Presentation Overview: Show

Accurate reconstruction of clonal evolution is essential for the application of precision oncology. It allows for the early detection of newly developing, highly aggressive subclones that might be resistant to therapy and can potentially lead to relapse. However, analysis is characterized by challenges: Often, data on only few time points are available. As a consequence, clonal evolution is incomplete, lacking information on a tumor's development over time and its response to therapy. Furthermore, bi-allelic events, which are considered of high relevance with respect to many cancers, cannot be depicted properly.
To address these challenges, we developed clevRvis – an R/Bioconductor package providing a wide set of innovative visualization techniques for clonal evolution in R. Three different representations are available: shark plots (graph-based), dolphin plots (fish plot-like) and plaice plots (allele-aware). Phylogeny-aware color coding is available by default. Plots are generated on the basis of a seaObject, optionally containing automatically interpolated time points and/or estimated therapy effect. Alternative trees, determined on the basis of the user-defined cancer cell fractions, can be explored interactively. A shiny interface allows for user-friendly analysis.
Concluding, clevRvis provides novel techniques for visualizing clonal evolution, contributing to a better understanding of a tumor's development.

14:50-15:00
Enhancing Phylogenetic Data Interpretation with TreeProfiler, PhyloCloud, and ETE Toolkit
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Lars Arvestad

  • Ziqi Deng, Centro de Biotecnología y Genómica de Plantas, Spain
  • Jorge Botas, Baylor College of Medicine, Houston, Texas, Spain
  • Jordi Burguet-Castell, Centro de Biotecnología y Genómica de Plantas, Madrid, Spain
  • Ana Hernández-Plaza, Centro de Biotecnología y Genómica de Plantas, Madrid, Spain
  • Jaime Huerta-Cepas, Centro de Biotecnología y Genómica de Plantas, Madrid, Spain


Presentation Overview: Show

Phylogenomic data analysis and interpretation require custom bioinformatic and visualization workflows that are largely inaccessible to non-expert users. Here, we present the latest advances in three bioinformatic tools: TreeProfiler, PhyloCloud, and ETE Toolkit, which aim to simplify processes of annotating, handling, and visualizing large phylogenies. TreeProfiler is a command-line tool that allows users to easily annotate phylogenies using any kind of additional metadata, automatically computing summary annotations for internal nodes and providing multiple predefined layouts for interactive visualization (Fig 1). PhyloCloud is an online platform for hosting, indexing, and exploring tree collections of any size, with access to common analyses and operations like tree rooting and orthology detection (Fig 2). TreeProfiler and PhyloCloud utilize the latest version of ETE Toolkit, a Python library for programmatic and interactive visualization of large trees. Together, they offer an integrated solution for comparative genomics, metagenomics, and phylogenomics. During this presentation, we will focus on showcasing most useful features, including recent use-case examples such the phylogenomic analysis of 28,211 Spongilla lacustris gene trees (Musser et al. 2021), the orthology prediction and visualization of 4 million phylogenies in the eggNOG v6 database (Hernández-Plaza et al. 2023), and the visual exploration of taxonomic profile results in metagenomics.

15:30-15:40
Orthology inference at scale with FastOMA
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Lars Arvestad

  • Sina Majidian, University of Lausanne, Switzerland
  • Yannis Nevers, University of Lausanne, Switzerland
  • Ali Yazdizadeh Kharrazi, University of Lausanne, Switzerland
  • David Moi, University of Lausanne, Switzerland
  • Natasha Glover, SIB Swiss Institute of Bioinformatics., Switzerland
  • Adrian Altenhoff, ETH Zurich, Switzerland
  • Christophe Dessimoz, University of Lausanne, Switzerland


Presentation Overview: Show

Genome data keeps piling up, with efforts to sequence as many as 1.5 million eukaryotic species within a decade. This denser sampling of the tree of life could transform our understanding of evolution by allowing the reconstruction of lineage-defining biological processes: where and when they emerged, how they evolved, and what genetic changes enabled biological innovation. However, delivering on these promises requires a profound overhaul of conventional comparative genomics methods which are suitable to compare tens to hundreds of genomes but struggling beyond that. Inferring orthologous groups is complex and computationally demanding, and state-of-the-art methods tend to scale at best quadratically. To overcome this challenge, we developed “FastOMA”—a complete rewrite of the OMA algorithm focused on scalability from the ground up. FastOMA combines k-mer-based placement, species-tree guided subsampling, and highly parallel computing to achieve near linear performance in the number of input genomes. This makes it possible to process all 2180 eukaryotic UniProt reference proteomes within a single day using 300 CPUs. Remarkably, FastOMA maintains the high accuracy of the well-established OMA approach in Quest for Orthologs benchmarks. FastOMA is available at https://github.com/DessimozLab/FastOMA/.

15:40-16:00
A Probabilistic Programming Approach to Investigate the Coevolution of Genes and Phenotypes in Birds
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Lars Arvestad

  • Viktor Senderov, Ecole normale supérieure, France
  • Amaury Lambert, École Normale Supérieure, France
  • Marie Manceau, Collège de France, France
  • Carole Desmarquet, Collège de France, France
  • Caitlyn Jean-Baptiste, Ecole normale supérieure, France
  • Ingrid Lafontaine, Sorbonne Université, France
  • Hélène Morlon, Ecole normale supérieure, France


Presentation Overview: Show

Understanding the molecular basis of phenotypic evolution is essential, yet quantitative tools for exploring this association are limited. We present progress on a novel phylogenetic tool using TreePPL, a universal probabilistic programming language, for identifying genomic regions coevolving with phenotypes at the macroevolutionary scale. Our Monte-Carlo inference-based computational framework allows the detection of simultaneous evolution between DNA sequences and phenotypes across a phylogeny. Simulations demonstrate the ability to identify simultaneous versus independent evolutionary events.

We apply our framework to study the molecular basis of bird color patterns, using a dataset comprising high-resolution images and measurements of bird color patterns and homologous sequences of pattern formation genes (e.g., agouti). By applying our new phylogenetic tool to these data, we aim to detect simultaneous evolution between DNA sequences and phenotypes and discover associations between development and evolution. This will provide a proof of concept for our phylogenetic approach's ability to detect genes underlying phenotypic differences, with potential applications to other genes and systems. Our approach should offer a valuable tool for the scientific community.

16:00-16:10
L-shaped distribution of the relative substitution rate (c/μ) observed for SARS-COV-2’s genome, inconsistent with the selectionist theory, the neutral theory and the nearly neutral theory but a near-neutral balanced selection theory
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Lars Arvestad

  • Chun Wu, Rowan University, United States
  • Nicholas Paradis, Rowan University, United States
  • Phillip Lakernick, Rowan University, United States
  • Mariya Hryb, Rowan University, United States


Presentation Overview: Show

The COVID-19 pandemic which has claimed over 6 million lives, is caused by SARS-CoV-2. Understanding the evolution nature of this virus is critical toward elucidating its origin and updating vaccines and therapeutics to mitigate this pandemic. Yet all three existing evolution theories (the Selectionist Theory/ST, Kimura’s Neutral Theory/KNT and Ohta’s Nearly Neutral Theory/ONNT) fail to explain the evolutionary nature of this virus. In this study, we proposed a new hybrid theory between ST and a nearly neutral theory to explain the observed genomic features of this virus: the Near-Neutral Balanced Selection Theory (NNBST). For the very first time, our NNBST can explain a molecular clock feature of time-independent GSR from a balanced selection mechanism rather than a neutral mechanism that has been the mainstream belief over the last 60 years. In other words, the higher substitution rates of genomic segments (e.g., genes) under positive selection are balanced out with the lower substitution rates of genomic segments under negative selection, leading to an apparent time-independent GSR under apparent neutral selection. Our relative substitution rate method provides a tool to resolve the long standing “neutralist-selectionist” controversy. Implications of NNBST in resolving Lewontin’s Paradox is also discussed.

16:10-16:30
Panel: Panel discussion
Room: Pasteur Auditorium
Format: Live from venue

Moderator(s): Nadia Mabrouk

  • Nadia Mabrouk
  • Dannie Durand