EvolCompGen COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Wednesday, July 13th
10:30-11:00
Proceedings Presentation: Phylovar: Towards scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data
Room: GJ
Format: Live from venue

Moderator(s): Giltae Song

  • Mohammadamin Edrisi, Rice University, United States
  • Monica Valecha, University of Vigo, Spain
  • Sunkara B. V. Chowdary, Indian Institute of Technology Kanpur, India
  • Sergio Robledo, University of Houston, United States
  • Huw Ogilvie, Rice University, United States
  • David Posada, University of Vigo, Spain
  • Hamim Zafar, Indian Institute of Technology Kanpur, United States
  • Luay Nakhleh, Rice University, United States


Presentation Overview: Show

Single-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing (scDNAseq) data, such as SCIΦ and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data.
Here we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCIΦ in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases. We implemented Phylovar and made it publicly available at https://github.com/mae6/Phylovar.git.

11:00-11:20
Phyloformer: fast and accurate phylogeny estimation with self-attention networks
Room: GJ
Format: Live-stream

Moderator(s): Giltae Song

  • Luca Nesterenko, LBBE, UMR 5558, Université Lyon 1, CNRS, France
  • Johanna Trost, LBBE, UMR 5558, Université Lyon 1, CNRS, France
  • Bastien Boussau, LBBE, UMR 5558, Université Lyon 1, CNRS, France
  • Laurent Jacob, LBBE, UMR 5558, Université Lyon 1, CNRS, France


Presentation Overview: Show

State of the art likelihood-methods for phylogenetic reconstruction
have limited applicability due to their high computational
cost. Recently, supervised learning approaches have been proposed,
with the hope to reach the same accuracy with a faster inference time.
These approaches simulate multiple sequence alignments (MSAs) evolved
along known trees, and use this simulated data to train a deep neural
network that classify MSAs among possible topologies. These attempts
have been mostly limited to the reconstruction of quartet trees, as
adding more leaves rapidly leads to an untractable number of possible
topologies.

Here we introduce Phyloformer, a radically different approach relying
on self-attention. Given an MSA, Phyloformer estimates all pairwise
evolutionary distances between sequences, which then allows us to
accurately reconstruct the tree topology with a classical
distance-based algorithm. Self-attention provides an expressive
mechanism that models how each pair (resp. site) should share information to the
others within the MSA. It also yields permutation-equivariant
functions and accomodates MSAs of varying sizes.

We show on simulations under different evolution models that
Phyloformer outperforms both previous supervised learning models and
distance methods, and reaches accuracies comparable to maximum
likelihood methods in a fraction of the time.

11:20-11:40
Studying the Evolution of CRISPR-Cas Systems using SuperDTL Reconciliation
Room: GJ
Format: Live-stream

Moderator(s): Giltae Song

  • Yoann Anselmetti, University of Sherbrooke, Canada
  • Mattéo Delabre, University of Montreal, Canada
  • Nadia El-Mabrouk, University of Montreal, Canada


Presentation Overview: Show

CRISPR-Cas are adaptive bacterial immunity systems that target bacteriophages. One of its members, CRISPR-Cas9, is well-known as being the most reliable and accurate “molecular scissor” biotechnology for genome editing. Cas genes, the CRISPR-associated genes, which are organized in clusters (or syntenies), represent important functional elements of all CRISPR-Cas systems. To the best of our knowledge, none of the studies undertaken so far on CRISPR-Cas gene syntenies take into account the species tree topology. Duplication-Loss-Transfer (DTL) reconciliation is one of the most powerful tools for studying the evolution of microbial gene families. Although very powerful, one of its major drawbacks is that gene families are considered independently one from each other, excluding co-evolution of genes grouped into clusters (or syntenies). In this poster, we present SuperDTL, an algorithm that we developed for solving an extended DTL reconciliation model, namely a model accounting for segmental events. In this model, in addition to inferring events as with classical reconciliation, ancestral syntenies (modeled as gene sets) also need to be inferred. We also present the evolutionary scenarios inferred by SuperDTL on the Cas gene synteny subtypes from Class 1 described by Makarova et al. in bacterial genomes, using different cost values for events.

11:40-12:10
Proceedings Presentation: Reconstructing tumor clonal lineage trees incorporating single nucleotide variants, copy number alterations, and structural variations
Room: GJ
Format: Live from venue

Moderator(s): Giltae Song

  • Xuecong Fu, Carnegie Mellon University, United States
  • Haoyun Lei, Carnegie Mellon University, United States
  • Yifeng Tao, Carnegie Mellon University, United States
  • Russell Schwartz, Carnegie Mellon University, United States


Presentation Overview: Show

Cancer develops through a process of clonal evolution in which an initially healthy cell gives rise to progeny gradually differentiating through the accumulation of genetic and epigenetic mutations. These mutations can take various forms, including single nucleotide variants (SNVs), copy number alterations (CNAs), or structural variations (SVs), with each variant type providing complementary insights into tumor evolution as well as offering distinct challenges to phylogenetic inference. In the present work, we develop a tumor phylogeny method, TUSV-ext, that incorporates SNVs, CNAs, and SVs into a single inference framework. We demonstrate on simulated data that the method produces accurate tree inferences in the presence of all three variant types. We further demonstrate the method through application to real prostate tumor data, showing how our approach to coordinated phylogeny inference and clonal construction with all three variant types can reveal a more complicated clonal structure than is suggested by prior work, consistent with extensive polyclonal seeding or migration.

12:10-12:30
Multi-modal Transformer based deep neural network for determining false positive structural variation calls
Room: GJ
Format: Live from venue

Moderator(s): Giltae Song

  • Taeyoung Kim, Pusan National University, South Korea
  • Giltae Song, Pusan National University, South Korea


Presentation Overview: Show

Structural variation (SV) changes large genomic segments. They are generally considered to be associated with genetic diversity and complex diseases. Although many SV calling tools have been developed as sequencing technology has advanced, they suffer due to high FDR (False discovery rates) for complex sequence patterns. It is important to filter out false positive SV. There are some tools for filtering out SV based on random forest and convolutional neural network, but some false positive SVs remain unresolved due to high FDR over 30%
In this study, we propose a multi-modal Transformer based deep neural network for filtering out false positive SV obtained by three major SV callers. This SV outcome data is converted to images, signals, and tabular information in pre-processing steps. They are fed into the multi-modal Transformer. After building a filter-out model, we evaluate it using the other reserved data that are collected from 1000 Genomes Project phase 3 reanalyzed with GRCh38 for testing and quantify how many false positive SVs are filtered out using F1-score. We believe that our model can be a useful toolset to validate SV and contribute to the studies for the association of SV and complex diseases.

14:30-15:00
Proceedings Presentation: A LASSO-based approach to sample sites for phylogenetic tree search
Room: GJ
Format: Live from venue

Moderator(s): Dannie Durand

  • Noa Ecker, Tel Aviv University, Israel
  • Dana Azouri, Tel Aviv University, Israel
  • Ben Bettisworth, Heidelberg Institute for Theoretical Studies, Germany
  • Alexandros Stamatakis, Heidelberg Institute for Theoretical Studies, Germany
  • Yishay Mansour, Tel Aviv University, Israel
  • Itay Mayrose, Tel Aviv University, Israel
  • Tal Pupko, Tel Aviv University, Israel


Presentation Overview: Show

In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100,000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likeli-hood based phylogenetic inference programs and usually require using a powerful computer clus-ter. Current tools for alignment trimming prior to phylogenetic analysis do not promise a signifi-cant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree.
Here, we propose an artificial-intelligence based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance.

15:00-15:20
Sampling representative taxa on a phylogeny with PARNAS
Room: GJ
Format: Live from venue

Moderator(s): Dannie Durand

  • Alexey Markin, National Animal Disease Center, USDA-ARS, United States
  • Sanket Wagle, Iowa State University, United States
  • Siddhant Grover, Iowa State University, United States
  • Amy Vincent-Baker, National Animal Disease Center, USDA-ARS, United States
  • Oliver Eulenstein, Iowa State University, United States
  • Tavis Anderson, National Animal Disease Center, USDA-ARS, United States


Presentation Overview: Show

The broad use of next-generation sequencing technology has enabled phylogenetic studies with hundreds of thousands of taxa. Such large-scale phylogenies have become a critical component in genomic epidemiology in pathogens such as SARS-CoV-2 and influenza A virus. However, detailed phenotypic characterization of viruses or generating a computationally tractable dataset for detailed phylogenetic analyses requires bias free subsampling of taxa. To address this need, we propose PARNAS, an objective and flexible algorithm to sample and select taxa that best represent the diversity (by solving a generalized k-medoids problem on a phylogenetic tree). PARNAS solves this problem efficiently and exactly by novel optimizations and adapting algorithms from operations research. For more nuanced studies, taxa can be weighed with metadata or genetic sequence parameters, and the pool of potential representatives can be user-constrained. Further, motivated by influenza A virus genomic surveillance and vaccine design, PARNAS can be applied to identify representative taxa that optimally cover the phylogeny within a specified distance radius. We demonstrated that PARNAS is more efficient and flexible than current approaches, and applied it to influenza A virus in swine to identify antigenically advanced viruses circulating in US pigs. PARNAS is available at https://github.com/flu-crew/parnas.

15:20-15:30
Comparative Phylogenomic Analysis of the Liberibacter Pathogens Associated with Huanglongbing and Zebra Chip
Room: GJ
Format: Live from venue

Moderator(s): Dannie Durand

  • Yongjun Tan, Saint Louis University, United States
  • Cindy Wang, Saint Louis University, United States
  • Theresa Schneider, Saint Louis University, United States
  • Huan Li, Saint Louis University, United States
  • Kylie Swisher Grimm, United States Department of Agriculture-Agricultural Research Service, United States
  • Dapeng Zhang, Saint Louis University, United States


Presentation Overview: Show

Liberibacter pathogens are the causative agents of several severe crop diseases worldwide, including citrus Huanglongbing and potato zebra chip. These bacteria are endophytic and nonculturable, which makes experimental approaches challenging and highlights the need for bioinformatic analysis in advancing our understanding about Liberibacter pathogenesis. Here, we performed an in-depth comparative phylogenomic analysis of the Liberibacter pathogens and their free-living, nonpathogenic, ancestral species, aiming to identify major genomic changes and determinants associated with their evolutionary transitions in living habitats and pathogenicity. By using ortholog clustering, we identified two sets of genomic genes, which were either lost or gained in the ancestor of the pathogens. Importantly, among the gained genes, we uncovered several previously unrecognized toxins, including new toxins homologous to the EspG/VirA effectors, a YdjM phospholipase toxin, and a secreted endonuclease/exonuclease/phosphatase (EEP) protein. Besides, we conducted analyses to understand genome difference among multiple strains of zebra chip associated Liberibacter pathogens and identify several genes involved in host-pathogen interactions. Our results substantially extend the knowledge of the evolutionary events and potential determinants leading to the emergence of endophytic, pathogenic Liberibacter species, which will facilitate the design of functional experiments and the development of new methods for detection and blockage of these pathogens.

16:00-16:30
Proceedings Presentation: QuCo: Quartet-based Co-estimation of Species Trees and Gene Trees
Room: GJ
Format: Live-stream

Moderator(s): Edward Braun

  • Siavash Mirarab, University of California San Diego, United States
  • Maryam Rabiee, University of California San Diego, United States


Presentation Overview: Show

Motivation: Phylogenomics faces a dilemma: on the one hand, most accurate species and gene tree estimation methods are those that co-estimate them; on the other hand, these co-estimation methods do not scale to moderately large numbers of species. The summary-based methods, which first infer gene trees independently and then combine them, are much more scalable but are prone to gene tree estimation error, which is inevitable when inferring trees from limited-length data. Gene tree estimation error is not just random noise and can create biases such as long-branch attraction. Co-estimating gene trees and the species tree is known to reduce the gene tree error.
Results: We introduce a scalable likelihood-based approach to co-estimation under the multi-species coalescent model. The method, called Quartet Coestimation (QuCo), takes as input independently inferred distributions over gene trees and computes the most likely species tree topology and internal branch length for each quartet, marginalizing over gene tree topologies. It then updates the gene tree posterior probabilities based on the species tree. The focus on gene tree topologies and the heuristic division to quartets enables fast likelihood calculations.
We benchmark our method with extensive simulations for quartet trees in zones known to produce biased species trees and further with larger trees. We also run QuCo on a biological dataset of bees. Our results show better accuracy than the summary-based approach ASTRAL run on estimated gene trees.

16:30-16:50
Read2Tree: scalable and accurate phylogenetic trees from raw reads
Room: GJ
Format: Live from venue

Moderator(s): Edward Braun

  • David Dylus, University of Lausanne, Switzerland
  • Adrian Altenhoff, ETH, Switzerland
  • Sina Majidian, University of Lausanne, Switzerland
  • Fritz Sedlazeck, Baylor College of Medicine, United States
  • Christophe Dessimoz, University of Lausanne, Switzerland


Presentation Overview: Show

The inference of phylogenetic trees from raw sequencing reads is foundational to biology. However, state-of-the-art phylogenomics requires running complex pipelines, at significant computational and labour costs, with additional constraints in sequencing coverage, assembly and annotation quality. To overcome these challenges, we present Read2tree, which directly processes raw sequencing reads into groups of corresponding genes. In a benchmark encompassing a broad variety of datasets, our assembly-free approach was 10-100x faster than conventional approaches, and in most cases more accurate—the exception being when sequencing coverage was high and reference species very distant. To illustrate the broad applicability of the tool, we reconstructed a yeast tree of life of 435 species spanning 590 million years of evolution. Applied to Coronaviridae samples, Read2Tree accurately classified highly diverse animal samples and near-identical SARS-CoV-2 sequences on a single tree—thereby exhibiting remarkable breadth and depth. The speed, accuracy, and versatility of Read2Tree ​enables comparative genomics at scale.

16:50-17:20
Proceedings Presentation: Quintet Rooting: Rooting Species Trees under the Multi-Species Coalescent Model
Room: GJ
Format: Live-stream

Moderator(s): Edward Braun

  • Yasamin Tabatabaee, University of Illinois at Urbana-Champaign, United States
  • Kowshika Sarker, University of Illinois at Urbana-Champaign, United States
  • Tandy Warnow, University of Illinois at Urbana-Champaign, United States


Presentation Overview: Show

Rooted species trees are a basic model with multiple applications throughout biology,
including understanding adaptation, biodiversity, phylogeography, and co-evolution. Because most species tree estimation methods produce unrooted trees, methods for rooting these trees have been developed. However, most rooting methods either rely on prior biological knowledge or assume that evolution is close to clock-like, which is not usually the case. Furthermore, most prior rooting methods do not account for biological processes that create discordance between gene trees and species trees.

Results: We present Quintet Rooting, a polynomial time method for rooting species trees from multi-locus datasets, which is based on a proof of identifiability of the rooted species tree under the multi-species coalescent (MSC) model established by Allman, Degnan, and Rhodes (J Math Biol, 2011). Our simulation study shows that Quintet Rooting is generally more accurate than other rooting methods, except under extreme levels of gene tree estimation error or when the number of genes is very small.

Availability and implementation: Quintet Rooting is available in open source form at https://github.com/ytabatabaee/Quintet-Rooting.
Links to datasets used in this study are available at https://tandy.cs.illinois.edu/datasets.html

17:20-17:40
Non-binary Tree Reconciliation with Endosymbiotic Gene Transfer
Room: GJ
Format: Live-stream

Moderator(s): Edward Braun

  • Nadia El-Mabrouk, University of Montreal, Canada
  • Yoann Anselmetti, Université de Sherbrooke, Canada
  • Mathieu Gascon, Université de Montréal, Canada


Presentation Overview: Show

Understanding how both nuclear and mitochondrial genomes have been shaped by gene loss, duplication and transfer is important to shed light on several open questions regarding the origin, evolution, and characteristics of gene coding capacity of eukaryotes. In this presentation, we will explore various strategies for simultaneous resolution and reconciliation of a multifurcated gene tree, with reconciliation accounting for Duplications (D), Losses (L) and endosymbiotic gene transfers (EGT) events. For a given polytomy, a first strategy would be to proceed in two steps: ignoring the genome labeling of leaves, first apply PolytomySolver (an algorithm to solve a single polytomy) to output all resolutions minimizing the cost of a DL-Reconciliation and then, for each resolution (i.e. binary tree), assign a known number of mitochondrial and nuclear genome labels to the leaves in a way minimizing EGT events. We show that the problem is NP-Complete in the general case, and becomes polynomial if genes are specific to a single genome in all but one species. We present an algorithm for this special case, and apply it to a plant dataset. Another strategy would be to proceed in a single step by generalizing PolytomySolver to account simultaneously for Duplications, losses and EGT events.

17:40-18:00
An Open and Continuously Updated Fern Tree of Life (FTOL)
Room: GJ
Format: Live-stream

Moderator(s): Edward Braun

  • Joel Nitta, The University of Tokyo, Japan
  • Eric Schuettpelz, Smithsonian Institution, United States
  • Santiago Ramírez-Barahona, Universidad Nacional Autónoma de México, Mexico
  • Wataru Iwasaki, The University of Tokyo, Japan


Presentation Overview: Show

Thoroughly sampled phylogenies are the foundation of modern evolutionary research. With the continuous growth of DNA sequences in GenBank, it is now possible to assemble maximally sampled phylogenies for nearly any study group. However, as sequences rapidly accumulate, any such phylogeny will become quickly outdated. Furthermore, many sequences in GenBank are mis-identified or poorly annotated, so producing a high-quality phylogeny is not straightforward.

Here, we develop a mostly automated, reproducible, open pipeline to generate a continuously updated phylogeny of ferns (Polypodiopsida) from data in GenBank. Our sampling strategy combines whole plastomes (few taxa, many loci) with commonly sequenced plastid regions (many taxa, few loci) to obtain a global, species-level fern tree of life (FTOL) with high resolution along the backbone and maximal sampling across the tips. We use a curated reference taxonomy in combination with a newly developed R package, ‘taxastand’, to resolve synonyms and remove erroneous accessions.

The current FTOL includes 5,582 species, or nearly half of extant fern diversity (ca. 12,000 species). FTOL and its accompanying datasets will be updated on a regular basis and are available via a web portal (https://fernphy.github.io) and R packages, enabling immediate access to the most up-to-date, comprehensively sampled fern phylogeny.

Thursday, July 14th
10:15-10:25
Comparative Genomic Analysis of Primary Medulloblastoma and Leptomeningeal Metastasis
Room: GJ
Format: Live-stream

Moderator(s): Aida Ouangraoua

  • Ana Isabel Castillo Orozco, McGill University Health Centre Research Institute, Canada
  • Niusha Khazaei, McGill University Health Centre Research Institute, Canada
  • Livia Garzia, McGill University Health Centre Research Institute, Canada


Presentation Overview: Show

Medulloblastoma (MB) is a highly aggressive and the most common pediatric brain tumor that arises mainly in the cerebellum. MB can metastasize to the leptomeningeal space, which is known as Leptomeningeal Disease (LMD). Although LMD represents a main clinical challenge, it is a vastly understudied field, and its molecular mechanisms are poorly characterized. Accordingly, there is an urgent need to develop strategies to study metastatic Medulloblastoma. We hypothesize than an in-depth knowledge of the molecular events driving subclones of the primary tumor to metastasize will offer therapeutic targets for effective therapies to treat or prevent LMD. To test this hypothesis, we have established metastatic Patient-Derived Xenografts (PDXs) that faithfully recapitulate LMD features. We have addressed our efforts in performing bulk RNA seq of PDXes models to profile LMD intertumoral heterogeneity and to identify genetic drivers/pathways that sustain this compartment. Using ssGSEA, we have identified PDXes models retain neoplastic subpopulations previously identified in MB single-cell sequencing studies with slight changes between primary and leptomeningeal compartments. Furthermore, we observe profound differences in gene expression between primary and LMD. Our results show that primary and LMD are transcriptionally different, with various DEG and signally pathways enriched in more than one LMD PDx model.

10:25-10:55
Proceedings Presentation: Simulating Domain Architecture Evolution
Room: GJ
Format: Live from venue

Moderator(s): Aida Ouangraoua

  • Xiaoyue Cui, Carnegie Mellon University, United States
  • Yifan Xue, Carnegie Mellon University, United States
  • Collin McCormack, Carnegie Mellon University, United States
  • Alejandro Garces, Carnegie Mellon University, United States
  • Thomas Rachman, Carnegie Mellon University, United States
  • Yang Yi, Carnegie Mellon University, United States
  • Maureen Stolzer, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States


Presentation Overview: Show

Simulation is an essential technique for generating biomolecular data with a “known” history for use in validating phylogenetic inference and other evolutionary methods. On longer time scales, simulation supports investigations of equilibrium behavior and provides a formal framework for testing competing evolutionary hypotheses. Twenty years of molecular evolution research have produced a rich repertoire of simulation methods. However, current models do not capture the stringent constraints acting on the domain insertions, duplications, and deletions by which multidomain architectures evolve. Although these processes have the potential to generate any combination of domains, only a tiny fraction of possible domain combinations are observed in nature. Modeling these stringent constraints on domain order and co- occurrence is a fundamental challenge in domain architecture simulation that does not arise with sequence and gene family simulation. Here we introduce a stochastic model of domain architecture evolution to simulate evolutionary trajectories that reflect the constraints on domain order and co-occurrence observed in nature. This framework is implemented in a novel domain architecture simulator, DomArchov, using the Metropolis Hastings algorithm with data-driven transition probabilities. The use of a data-driven event module enables quick and easy redeployment of the simulator for use in different taxonomic and protein function contexts. Using empirical evaluation with metazoan datasets, we demonstrate that domain architectures simulated by DomArchov recapitulate properties of genuine domain architectures that reflect the constraints on domain order and adjacency seen in nature. This work expands the realm of evolutionary processes that are amenable to simulation.

10:55-11:15
Domain Promiscuity Correlates with Rates of Domain Gain and Loss
Room: GJ
Format: Live from venue

Moderator(s): Aida Ouangraoua

  • Yuting Xiao, Carnegie Mellon University, United States
  • Maureen Stozler, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States


Presentation Overview: Show

Domain promiscuity is the propensity of a domain to form different combinations with other domains in the same protein. Domain promiscuity varies greatly among domains. One hypothesis is that domain mobility drives domain promiscuity: domains that are easily copied and inserted in new contexts tend to co-occur with many different domains. However, because domain mobility cannot be observed directly, the mobility hypothesis is difficult to test.

Here, we probe the relationship between promiscuity and mobility using estimated rates of domain gain, loss, and duplication as a proxy for domain mobility. Since many measures of domain promiscuity have been proposed, we first asked whether these measures capture different properties. Among 11 proposed measures of domain promiscuity applied to 1283 domain families in 21 selected species, we identified three groups, where the measures are highly correlated within each group, but uncorrelated across groups. Choosing one measure from each group is sufficient for representing all promiscuity measures. We next probed the relationship between domain promiscuity and domain rates. Domain event rates were inferred using the probabilistic birth-death-gain model in COUNT. Regression analysis of the promiscuity measures and the inferred evolutionary rates revealed highly significant correlations, suggesting mobility may indeed contribute to domain promiscuity.

11:15-11:35
TranscriptDB : A transcript-centric database to study transcript conservation and evolution within gene trees
Room: GJ
Format: Live-stream

Moderator(s): Aida Ouangraoua

  • Wend Yam Donald Davy Ouédraogo, Université de Sherbrooke, Canada
  • Abigail Djossou, Université de Sherbrooke, Canada
  • Aida Ouangraoua, Université de Sherbrooke, Canada


Presentation Overview: Show

The increasing amount of available genomic sequences calls for effective tools for annotating biological sequences. Inferring the function of a gene from its orthologs has been of great use in comparative genomics. The conjecture on orthologous genes specifies that they diverged little during evolution and they share similar functions. The interest for orthology between genes has led to the design of several databases centred on genes. They differ mainly by their method of computing orthology relations between genes, by the number of genomes incorporated and also by the tools provided by their WEB interfaces. Alternative splicing, which contributes widely to the diversity of transcriptomes and proteomes in eukaryotes makes the transcript a refined level of functional homology relationships, thus calling for orthology inference methods and databases at the level of transcripts. In this work, we present a transcript-centric database and a new method based on splicing structure to compute clusters of conserved transcripts for the reconstruction of transcript and gene phylogenies.

11:35-11:45
Genomic Diversity and Associated Phenotyping of Escherichia coli Isolated from Poultry in the Southern United States
Room: GJ
Format: Live from venue

Moderator(s): Aida Ouangraoua

  • Aijing Feng, University of Missouri, United States
  • Spencer Leigh, Poultry Research Unit, USDA Agricultural Research Service, Mississippi State, United States
  • Hui Wang, Mississippi State University, United States
  • Todd Pharr, Mississippi State University, United States
  • Jeff Evans, Poultry Research Unit, USDA Agricultural Research Service, Mississippi State, United States
  • Martha Pulido Landinez, Mississippi State University, United States
  • Lanny Pace, Mississippi State University, United States
  • Xiu-Feng Wan, University of Missouri, United States


Presentation Overview: Show

In domestic poultry, E. coli are typically present as commensal bacteria in the gastro-intestinal tracts but some avian pathogenic E. coli (APEC) strains can cause localized and systematic infections. Here we sequenced 188 E. coli isolates from the sick poultry samples collected between May 2017 and July 2021. Phylogenetic analyses suggested a large extent of genetic variation were present among these isolates at the whole genome level whereas their 16s rRNA genes were clustered into two major groups. These isolates belong to 32 H and 61 O serotypes, and APEC associated pathogen islands were found in all and can be isolate-dependent. Based on clinical data, we grouped these samples based on the types of diseases and the infection locations in the birds. Multi-task LASSO model was used to learn genetic features associated with types of diseases and infection localization. Results showed that 32 genes we identified had biological functions of binding, transporter activity, ion transport, transcription regulator activity, ATP-dependent activity and catalytic activity, and that four of them were independent of disease type and infection localization. The knowledge derived from this study could be useful for designing a broadly protective E. coli vaccine candidate for domestic poultry.

11:45-11:55
Bacterial lipoxygenases could facilitate cross-kingdom host jumps
Room: GJ
Format: Live-stream

Moderator(s): Aida Ouangraoua

  • Georgy Kurakin, Pirogov Russian National Research Medical University, Russia


Presentation Overview: Show

Lipoxygenases are enzymes that participate in the biosynthesis of oxylipins – oxidized PUFA derivatives. These products perform cell-to-cell signalling functions in multicellular eukaryotes. Lipoxygenases are also present in bacteria, but the functions of these enzymes remain poorly characterized. Most data are available for Pseudomonas aeruginosa, whose lipoxygenase is found to suppress the immune response through host-microbe oxylipin signalling.
In our recently published bioinformatic research, we have found bacterial lipoxygenases to be associated with complex structure formation, pathogenicity and symbiosis. Here, we present follow-up research of the link between lipoxygenases, pathogenicity and symbiosis.
We performed phylogenetic analysis of lipoxygenase sequences belonging to plant symbionts, cross-kingdom (plant/animal) pathogens and animal/human pathogens. We have found that there were at least three independent series of horizontal transfer of lipoxygenase gene, that link plant symbionts, plant/animal pathogens and animal pathogens together. It means lipoxygenases are involved in the host-microbe signalling in a wide range of bacteria in a similar way like in Pseudomonas aeruginosa. Many of these bacteria are associated with plants, others are dangerous nosocomial pathogens. We concluded that lipoxygenases may facilitate cross-kingdom host jumps of bacteria between plants and animals/humans.

11:55-12:15
CACTUS: integrating clonal architecture with genomic clustering and transcriptome profiling of single tumor cells
Room: GJ
Format: Live from venue

Moderator(s): Aida Ouangraoua

  • Shadi Darvish Shafighi, University of Warsaw, Poland
  • Szymon Kielbasa, Leiden University Medical Center, Netherlands
  • Julieta Sepúlveda-Yáñez, Leiden University Medical Center, Netherlands
  • Ramin Monajemi, Leiden University Medical Center, Netherlands
  • Davy Cats, Leiden, Netherlands
  • Leon Mei, LUMC, Netherlands
  • Roberta Menafra, LUMC, Netherlands
  • Susan Kloet, Leiden University Medical Center, Netherlands
  • Hendrik Veelken, Leiden University Medical Center, Netherlands
  • Cornelis A.M. Van Bergen, Leiden University Medical Center, Netherlands
  • Ewa Szczurek, University of Warsaw, Poland


Presentation Overview: Show

Drawing genotype-to-phenotype maps in tumors is of paramount importance for understanding tumor heterogeneity. Assignment of single cells to their tumor clones of origin can be approached by matching the genotypes of the clones to the mutations found in RNA sequencing of the cells. The confidence of the cell-to-clone mapping can be increased by accounting for additional measurements. Follicular lymphoma, a malignancy of mature B cells that continuously acquire mutations in parallel in the exome and in B cell receptor loci, presents a unique opportunity to join exome-derived mutations with B cell receptor sequences as independent sources of evidence for clonal evolution.

Here, we propose CACTUS, a probabilistic model that leverages the information from an independent genomic clustering of cells and exploits the scarce single-cell RNA sequencing data to map single cells to given imperfect genotypes of tumor clones.

We apply CACTUS to two follicular lymphoma patient samples, integrating three measurements: whole exome, single-cell RNA, and B cell receptor sequencing. CACTUS outperforms a predecessor model by confidently assigning cells and B cell receptor-based clusters to the tumor clones. The integration of independent measurements is the key to improving model performance in the challenging task of charting the genotype-to-phenotype maps in tumors.

13:15-13:45
Proceedings Presentation: Bridging the gaps in statistical models of protein alignment
Room: GJ
Format: Live-stream

Moderator(s): Christophe Dessimoz

  • Dinithi Sumanaweera, Wellcome Sanger Institute, United Kingdom
  • Lloyd Allison, Monash University, Australia
  • Arun Konagurthu, Monash University, Australia


Presentation Overview: Show

Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together. To overcome this gap, this paper demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed using a time-parameterised substitution matrix and a time-parameterised alignment state machine. Methods to derive all parameters of such a model from any benchmark collection of aligned protein sequences are described here. This has not only allowed us to generate a unified statistical model for each of the nine widely-used substitution matrices (PAM, JTT, BLOSUM, JO, WAG, VTML, LG, MIQS, and PFASUM), but also resulted in a new unified model, MMLSUM. Our underlying methodology measures the Shannon information content using each model to explain losslessly any given collection of alignments, which has allowed us to quantify the performance of all the above models on six comprehensive alignment benchmarks. Our results show that MMLSUM results in a new and clear overall best performance, followed by PFASUM, VTML, BLOSUM and MIQS, respectively, amongst the top five. We further analyse the statistical properties of MMLSUM model and contrast it with others.

13:45-14:05
Cross-database integration using evolution and machine learning to identify multiscale molecular building blocks for antibiotic resistance
Room: GJ
Format: Live-stream

Moderator(s): Christophe Dessimoz

  • Vignesh Sridhar, Michigan State University, United States
  • Joseph Burke, Michigan State University, United States
  • Elliot Majlessi, Michigan State University, United States
  • Karn Jongnarangsin, Michigan State University, United States
  • Arjun Krishnan, Michigan State University, United States
  • Janani Ravi, Michigan State University, United States


Presentation Overview: Show

Antibiotic resistance (AR) is a high-priority urgent threat. AR pathogens such as the ESKAPE group cause millions of infections and hundreds of thousands of deaths. While current strategies such as genetic and drug screens have helped identify genes and mutations critical for AR in specific pathogens, there is a broad lack of methods to help understand AR’s origin and continuous adaptation. AR can arise in a pathogen via a variety of molecular changes, including acquiring protein domains, individual genes, or metabolic capabilities. Hence, predicting and overcoming AR in emerging pathogens or discovering new AR mechanisms requires a holistic understanding of AR evolution across multiple molecular scales. However, leveraging these diverse datasets is challenging because original databases are siloed from each other. Further, the different data types are hard to integrate in a biologically-meaningful way across scales. Here, we describe a computational discovery framework combining evolutionary analyses and machine learning to integrate microbial data across multiple scales to gain deep mechanistic insights into AR in ESKAPE pathogens. Our framework will be broadly applicable to advance understanding of AR in understudied and emerging AR pathogens (beyond ESKAPE) towards ending the arms race between microbes and drugs by creating better treatment outcomes.

14:05-14:25
BUSTED-PH: distinguish adaptive innovation from reduced constraint
Room: GJ
Format: Live from venue

Moderator(s): Christophe Dessimoz

  • Maria Chikina, University of Pittsburgh, United States
  • Nathan Clark, University of Utah, United States
  • Sergei Pond, Temple University, United States
  • Avery Selberg, Temple University, United States


Presentation Overview: Show

Convergent traits, whereby traits evolve independently in multiple lineages, provide a powerful tool for the study of organism-level genotype-phenotype relationships. A particularly effective approach for associating convergent phenotypes with genomic features, such as genes or non-coding elements, is to use the variation in evolutionary rates. This approach has been used to investigate diverse phenotypes ranging from lifespan to hair density. However, a major limitation of rate-based methods is that a positive association between phenotype and evolutionary rate is consistent with different evolutionary scenarios. Relaxation of selection and intensification of positive selection both elevate evolutionary rates, but arise from distinct evolutionary mechanisms. While methods based on dN/dS ratios can be employed to detect positive selection, the typical approach for adapting this to the convergent setting results in a high rate of false positives by failing to account for non-specific selection. Here we propose a purpose built likelihood framework, BUSTED-PH, that directly tests for a shift in positive selection associated with a convergent phenotype. We apply BUSTED-PH to the convergent marine habitat phenotype and show that unlike rate based tests BUSTED-PH identifies developmental and body plan pathways that are good candidates for drivers of adaptive innovation underlying the trait.

14:25-15:15
Panel: EvolCompGen panel discussion
Room: GJ
Format: Live from venue

Moderator(s): Christophe Dessimoz

  • Aida Ouangraoua
  • Dannie Durand
  • Edward Braun


Presentation Overview: Show

Discussion and summary of presentations and future directions in evolutionary comparative genomics.