Posters - Schedules

Posters Home

View Posters By Category

Monday, July 11 and Tuesday, July 12 between 12:30 PM CDT and 2:30 PM CDT
Wednesday July 13 between 12:30 PM CDT and 2:30 PM CDT
Session A Poster Set-up and Dismantle Session A Posters set up:
Monday, July 11 between 7:30 AM CDT - 10:00 AM CDT
Session A Posters dismantle:
Tuesday, July 12 at 6:00 PM CDT
Session B Poster Set-up and Dismantle Session B Posters set up:
Wednesday, July 13 between 7:30 AM - 10:00 AM CDT
Session B Posters dismantle:
Thursday. July 14 at 2:00 PM CDT
Virtual: An Open and Continuously Updated Fern Tree of Life (FTOL)
COSI: EvolCompGen
  • Joel Nitta, The University of Tokyo, Japan
  • Eric Schuettpelz, Smithsonian Institution, United States
  • Santiago Ramírez-Barahona, Universidad Nacional Autónoma de México, Mexico
  • Wataru Iwasaki, The University of Tokyo, Japan


Presentation Overview: Show

Thoroughly sampled phylogenies are the foundation of modern evolutionary research. With the continuous growth of DNA sequences in GenBank, it is now possible to assemble maximally sampled phylogenies for nearly any study group. However, as sequences rapidly accumulate, any such phylogeny will become quickly outdated. Furthermore, many sequences in GenBank are mis-identified or poorly annotated, so producing a high-quality phylogeny is not straightforward.

Here, we develop a mostly automated, reproducible, open pipeline to generate a continuously updated phylogeny of ferns (Polypodiopsida) from data in GenBank. Our sampling strategy combines whole plastomes (few taxa, many loci) with commonly sequenced plastid regions (many taxa, few loci) to obtain a global, species-level fern tree of life (FTOL) with high resolution along the backbone and maximal sampling across the tips. We use a curated reference taxonomy in combination with a newly developed R package, ‘taxastand’, to resolve synonyms and remove erroneous accessions.

The current FTOL includes 5,582 species, or nearly half of extant fern diversity (ca. 12,000 species). FTOL and its accompanying datasets will be updated on a regular basis and are available via a web portal (https://fernphy.github.io) and R packages, enabling immediate access to the most up-to-date, comprehensively sampled fern phylogeny.

Virtual: Bacterial lipoxygenases could facilitate cross-kingdom host jumps
COSI: EvolCompGen
  • Georgy Kurakin, Pirogov Russian National Research Medical University, Russia


Presentation Overview: Show

Lipoxygenases are enzymes that participate in the biosynthesis of oxylipins – oxidized PUFA derivatives. These products perform cell-to-cell signalling functions in multicellular eukaryotes. Lipoxygenases are also present in bacteria, but the functions of these enzymes remain poorly characterized. Most data are available for Pseudomonas aeruginosa, whose lipoxygenase is found to suppress the immune response through host-microbe oxylipin signalling.
In our recently published bioinformatic research, we have found bacterial lipoxygenases to be associated with complex structure formation, pathogenicity and symbiosis. Here, we present follow-up research of the link between lipoxygenases, pathogenicity and symbiosis.
We performed phylogenetic analysis of lipoxygenase sequences belonging to plant symbionts, cross-kingdom (plant/animal) pathogens and animal/human pathogens. We have found that there were at least three independent series of horizontal transfer of lipoxygenase gene, that link plant symbionts, plant/animal pathogens and animal pathogens together. It means lipoxygenases are involved in the host-microbe signalling in a wide range of bacteria in a similar way like in Pseudomonas aeruginosa. Many of these bacteria are associated with plants, others are dangerous nosocomial pathogens. We concluded that lipoxygenases may facilitate cross-kingdom host jumps of bacteria between plants and animals/humans.

Virtual: Candidate Gene Prioritization and Disease-Gene Discovery Through Phylogenetic Profiling
COSI: EvolCompGen
  • Christina Canavati, The Hebrew University of Jerusalem, Israel
  • Dana Sherill-Rofe, The Hebrew University of Jerusalem, Israel
  • Idit Bloch, The Hebrew University of Jerusalem, Israel
  • Moien Kanaan, Istishari Atab Hospital, Ramallah, Palestine
  • Ephrat Levy-Lahad, Shaare Zedek Medical Center, Jerusalem, Israel
  • Fouad Zahdeh, Shaare Zedek Medical Center, Jerusalem, Israel
  • Yuval Tabach, The Hebrew University of Jerusalem, Israel


Presentation Overview: Show

Computational interpretations of exome sequencing and genome sequencing data have been revolutionary, providing a precise molecular diagnosis of hundreds of thousands of patients. However, a large number of individuals remain to have no molecular diagnosis. This could stem from our limited understanding of the genetic basis of human disease or from limitations in the experimental or computational analysis of genomic data. One way to improve computational analysis of such data is to integrate unbiased information that reaches beyond the current knowledge base. Phylogenetic profiling offers such a new perspective. This led us to develop EvoRanker, which uses clade-wise phylogenetic profiling across 1028 genomes to identify candidate disease-causing genes based on their global or local co-evolution with genes previously associated with the disease. We aim to prioritize patient candidate genes specifically in “unsolved” NGS case studies that may harbor unannotated genes or have no clear link to the patient’s phenotype. Benchmarking of previously solved exome data revealed that the “true” gene was ranked among the top 5 candidates in ~78% of the cases based on PP alone. This approach yielded comparable results to other available tools. Yet, the results show complementarity amongst the top-ranked genes, identifying genes that other existing tools could not identify. Remarkably, analysis of unsolved exome cases revealed two potential novel genes to be associated with previously undescribed genetic syndromes. Our platform, which scans global and local coevolution across hundreds of genomes, presents a complementary approach to pinpoint patient candidate genes that merit further investigation.

Virtual: Comparative Genomic Analysis of Primary Medulloblastoma and Leptomeningeal Metastasis
COSI: EvolCompGen
  • Ana Isabel Castillo Orozco, McGill University Health Centre Research Institute, Canada
  • Niusha Khazaei, McGill University Health Centre Research Institute, Canada
  • Livia Garzia, McGill University Health Centre Research Institute, Canada


Presentation Overview: Show

Medulloblastoma (MB) is a highly aggressive and the most common pediatric brain tumor that arises mainly in the cerebellum. MB can metastasize to the leptomeningeal space, which is known as Leptomeningeal Disease (LMD). Although LMD represents a main clinical challenge, it is a vastly understudied field, and its molecular mechanisms are poorly characterized. Accordingly, there is an urgent need to develop strategies to study metastatic Medulloblastoma. We hypothesize than an in-depth knowledge of the molecular events driving subclones of the primary tumor to metastasize will offer therapeutic targets for effective therapies to treat or prevent LMD. To test this hypothesis, we have established metastatic Patient-Derived Xenografts (PDXs) that faithfully recapitulate LMD features. We have addressed our efforts in performing bulk RNA seq of PDXes models to profile LMD intertumoral heterogeneity and to identify genetic drivers/pathways that sustain this compartment. Using ssGSEA, we have identified PDXes models retain neoplastic subpopulations previously identified in MB single-cell sequencing studies with slight changes between primary and leptomeningeal compartments. Furthermore, we observe profound differences in gene expression between primary and LMD. Our results show that primary and LMD are transcriptionally different, with various DEG and signally pathways enriched in more than one LMD PDx model.

Virtual: Comparative Genomic Analysis of Primary Medulloblastoma and Leptomeningeal Metastasis
COSI: EvolCompGen
  • Ana Isabel Castillo Orozco, McGill University Health Centre Research Institute, Canada
  • Niusha Khazaei, McGill University Health Centre Research Institute, Canada
  • Livia Garzia, McGill University Health Centre Research Institute, Canada


Presentation Overview: Show

Medulloblastoma (MB) is a highly aggressive and the most common pediatric brain tumor that arises mainly in the cerebellum. MB can metastasize to the leptomeningeal space, which is known as Leptomeningeal Disease (LMD). Although LMD represents a main clinical challenge, it is a vastly understudied field, and its molecular mechanisms are poorly characterized. Accordingly, there is an urgent need to develop strategies to study metastatic Medulloblastoma. We hypothesize than an in-depth knowledge of the molecular events driving subclones of the primary tumor to metastasize will offer therapeutic targets for effective therapies to treat or prevent LMD. To test this hypothesis, we have established metastatic Patient-Derived Xenografts (PDXs) that faithfully recapitulate LMD features. We have addressed our efforts in performing bulk RNA seq of PDXes models to profile LMD intertumoral heterogeneity and to identify genetic drivers/pathways that sustain this compartment. Using ssGSEA, we have identified PDXes models retain neoplastic subpopulations previously identified in MB single-cell sequencing studies with slight changes between primary and leptomeningeal compartments. Furthermore, we observe profound differences in gene expression between primary and LMD. Our results show that primary and LMD are transcriptionally different, with various DEG and signally pathways enriched in more than one LMD PDx model.

Virtual: Evolution of the non-triplet genetic code in ciliate Euplotes
COSI: EvolCompGen
  • Mikhail Moldovan, Skolkovo Institute of Science and Technology, Russia
  • Sofya Gaydukova, Skolkovo Institute of Science and Technology, Russia
  • Mikhail Gelfand, Skolkovo Institute of Science and Technology, Russia
  • Pavel Baranov, School of Biochemistry and Cell Biology, University College Cork, Ireland
  • Adriana Vallesi, School of Biosciences and Veterinary Medicine, University of Camerino, Italy
  • John Atkins, School of Biochemistry and Cell Biology, University College Cork, Ireland


Presentation Overview: Show

Although several variants of the standard genetic code are known, its triplet character is universal with an exception in ciliates Euplotes, where stop codons at internal mRNA positions specify ribosomal frameshifting. How did Euplotes evolve such unusual genetic code remains a mystery. To shed the light on this, we explored the evolution of frameshifting occurrence in Euplotes genes. We sequenced and analyzed several transcriptomes from different Euplotes to characterize the gain-and-loss dynamics of frameshift sites. Surprisingly, we found a sharp asymmetry between frameshifting gain and frameshifting loss rates with the former exceeding the latter by about 25 folds. Further analysis of nucleotide substitution rates in protein-coding and non-coding regions revealed that this asymmetry is expected based on single nucleotide mutation rates and does not require positive selection for frameshifting. We found that the number of frameshifting sites in Euplotes is increasing and is far from the steady state. The steady equilibrium state is expected in about 0.1 to 1 billion years leading to about 10 fold increase in the number of frameshift sites in Euplotes genes.

Virtual: Identification of potentially new CAZymes acting on mannan and xylan polysaccharides.
COSI: EvolCompGen
  • Diego Mauricio Riaño-Pachón, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Beatiz R. Estevam, Laboratory of Computational, Evolutionary, and Systems Biology - CENA/USP, Brazil
  • Danilo B. Rocha, Laboratory of Computational, Evolutionary, and Systems Biology - CENA/USP, Brazil


Presentation Overview: Show

Renewable energy from biomass has potential to contribute to the solutions of the growing energy demand. However an important fraction of that energy is stored in compounds that are not readily available. Carbohydrate-degrading enzymes, as some CAZymes, can be applied to release that energy from polysaccharides. Mannan and Xylan are important hemicellulosic polysaccharides in monocots, and there is interest in finding novel CAZymes that can break them into mono or oligosaccharides. The continuously decreasing cost of genome sequencing, and the genomes deposited in public repositories can be mined for potentially novel enzymes. We aimed to identify CAZymes by doing phylogenetic analyses and selecting sequences from less-studied and well-diverse clades that have a potential to break mannan or xylan. To reach this we a) made the prediction, using dbCAN2, of approximately 1 million CAZymes related to mannan and 800 thousand related to xylan present in bacteria, fungi and Archaea groups from 380,522 genomes deposited on NCBI b) we extracted from CAZy website those sequences that had information about structure or functional characterization; and c) we build phylogenies, using IQ-tree, for the families of interest; d) finally we identified clades that have been poorly explored in structural and functional studies.

Virtual: Identification of putative Transcription Factor families involved in the regulation and modification of Plant Cell Wall biosynthesis enzymes in sugarcane cultivar SP80-3280
COSI: EvolCompGen
  • Verusca Semmler Rossi, Centro de Energia Nuclear na Agricultura, Brazil
  • Diego Mauricio Riaño-Pachón, Centro de Energia Nuclear na Agricultura, Brazil


Presentation Overview: Show

Brazil is the world's largest producer of sugarcane, with the state of São Paulo responsible for 54.1% of production in the 2020/21 harvest. Sugarcane cultivars produce about 80% of the world's sugar, being one of the main tropical crops that produces ethanol and biomass. Has a monoploid genome of 900Mbp, highly polymorphic, with polyloidy and aneuploidy, making it highly complex. With advances in sequencing technologies, it was possible to generate two genome drafts of the Brazilian genotype SP80-3280, however these have not been extensively compared so far. In this work, we identified a non-redundant set of transcripts derived from these genome drafts and from this database we identified the complete putative set and transcripts encoding Transcription-Associated Proteins (TAPs) and Carbohydrate-Active Enzymes (CAZymes). We identified 99 co-expressed gene clusters involving TAPs and CAZymes. Six clusters have family enrichment (Fisher test adjusted p-value <= 0.01), GRAS and SNF2 transcription factors (TFs) and AA0, GH14, GH3 and GH19 (CAZymes). In addition to the enriched families, these 6 clusters had 50 TFs from the families MYB, NAC, WRKY, GRAS, bHLH and bZIP, and 49 CAZymes from the families GH17, GH27, GH28, GH3, GH31, GH79, GH9, GT0, GT1, GT106, GT2, GT31, GT48, GT8 and PL4.

Virtual: Inference and Annotation of the Sugarcane Pan-Transcriptome
COSI: EvolCompGen
  • Felipe Vaz Peres, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Diego Mauricio Riaño-Pachón, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Jorge Mario Muñoz Perez, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil


Presentation Overview: Show

Sugarcane breeding programs are increasingly exploring molecular approaches for the development of new varieties. However, due to the sugarcane genome complexity, there is not a single version of the genome that represents the multiple allelic copies (i.e., ploidy >10) and achieves chromosome scale, difficulting the identification of genes responsible for phenotypes of interest. An alternative to identify target genes responsible for the phenotypes of interest is the large-scale sequencing of mRNA. Many of the RNA-Seq datasets generated so far are available in public repositories and can be used to generate new insights.
We downloaded RNA-Seq datasets of 48 Sugarcane genotypes to generate high-quality transcriptomes that we used in the inference and annotation of the Sugarcane Pan-Transcriptome. We assembled 16,237,098 transcripts (5,240,794 of these have CDS). Clustering based on sequence similarity classified all transcripts with CDS into 153,841 groups. Total number of transcript groups increased as additional transcriptomes were added and approached a plateau when n >= 24 genotypes were included (143,290 groups and 5,077,629 transcripts). Similarly, the core transcriptome size also reaches a plateau, even faster than the pan-transcriptome, when n >= 11 genotypes (13,978 groups and 2,853,218 transcripts).

Virtual: Phyloformer: fast and accurate phylogeny estimation with self-attention networks
COSI: EvolCompGen
  • Luca Nesterenko, LBBE, UMR 5558, Université Lyon 1, CNRS, France
  • Johanna Trost, LBBE, UMR 5558, Université Lyon 1, CNRS, France
  • Bastien Boussau, LBBE, UMR 5558, Université Lyon 1, CNRS, France
  • Laurent Jacob, LBBE, UMR 5558, Université Lyon 1, CNRS, France


Presentation Overview: Show

State of the art likelihood-methods for phylogenetic reconstruction
have limited applicability due to their high computational
cost. Recently, supervised learning approaches have been proposed,
with the hope to reach the same accuracy with a faster inference time.
These approaches simulate multiple sequence alignments (MSAs) evolved
along known trees, and use this simulated data to train a deep neural
network that classify MSAs among possible topologies. These attempts
have been mostly limited to the reconstruction of quartet trees, as
adding more leaves rapidly leads to an untractable number of possible
topologies.

Here we introduce Phyloformer, a radically different approach relying
on self-attention. Given an MSA, Phyloformer estimates all pairwise
evolutionary distances between sequences, which then allows us to
accurately reconstruct the tree topology with a classical
distance-based algorithm. Self-attention provides an expressive
mechanism that models how each pair (resp. site) should share information to the
others within the MSA. It also yields permutation-equivariant
functions and accomodates MSAs of varying sizes.

We show on simulations under different evolution models that
Phyloformer outperforms both previous supervised learning models and
distance methods, and reaches accuracies comparable to maximum
likelihood methods in a fraction of the time.

Virtual: Supervised machine learning reveals high efficacy of mobile elements to predict Salmonella outbreak linkage
COSI: EvolCompGen
  • Chao Chun Liu, Simon Fraser University, Canada
  • William Hsiao, Simon Fraser University, Canada


Presentation Overview: Show

Existing methods to infer clonal strains in surveillance settings primarily involve the comparison of genetic variation in a stable segment of pathogen genomes known as the core genome. In consequence, the non-conserved genetic elements are often neglected and contribute no information to the estimation of genetic distance. To demonstrate the analytical value of non-conserved elements to link cases that share a common cause of infection, our study characterized a comprehensive set of genetic features predictive of outbreak linkage by training multivariate regularized regression models on 24 historical foodborne outbreaks caused by Salmonella enterica. In total, 5,037 genetic features of high predictive value were identified, consisting of indels, nucleotide substitutions, and carriage of extrachromosomal elements. The outbreak predictive features included a wide range of non-conserved genetic elements that were found unique to specific outbreaks such as plasmids, CRISPR arrays and phage genomes. We rationalized that these non-conserved elements have high predictive values due to the strong environmental influence on the transmission dynamics and evolution of these features in bacterial populations. Sequence comparison of the predictive elements identified in our study can complement current analytical practices for cluster detection and outbreak tracing to improve the concordance between genomic inferences and epidemiology.

Virtual: TranscriptDB : A transcript-centric database to study transcript conservation and evolution within gene trees
COSI: EvolCompGen
  • Wend Yam Donald Davy Ouédraogo, Université de Sherbrooke, Canada
  • Abigail Djossou, Université de Sherbrooke, Canada
  • Aida Ouangraoua, Université de Sherbrooke, Canada


Presentation Overview: Show

The increasing amount of available genomic sequences calls for effective tools for annotating biological sequences. Inferring the function of a gene from its orthologs has been of great use in comparative genomics. The conjecture on orthologous genes specifies that they diverged little during evolution and they share similar functions. The interest for orthology between genes has led to the design of several databases centred on genes. They differ mainly by their method of computing orthology relations between genes, by the number of genomes incorporated and also by the tools provided by their WEB interfaces. Alternative splicing, which contributes widely to the diversity of transcriptomes and proteomes in eukaryotes makes the transcript a refined level of functional homology relationships, thus calling for orthology inference methods and databases at the level of transcripts. In this work, we present a transcript-centric database and a new method based on splicing structure to compute clusters of conserved transcripts for the reconstruction of transcript and gene phylogenies.

U-001: Challenges and Inconsistencies in Type II CRISPR-Associated Nuclease Subtype Classification
COSI: EvolCompGen
  • Ariel Gispan, Emendo Biotherapeutics Ltd, Israel
  • Nurit Meron, Emendo Biotherapeutics Ltd, Israel
  • Ophir Adiv Tal, Emendo Biotherapeutics LTD., Israel
  • Anat London Drori, Emendo Biotherapeutics Ltd, Israel
  • Rachel Diamant, Emendo Biotherapeutics Ltd, Israel
  • Lior Izhar, Emendo Biotherapeutics Ltd, Israel
  • Idit Buch, Emendo Biotherapeutics Ltd, Israel


Presentation Overview: Show

CRISPR-associated nucleases were found and classified as proteins of the bacterial immune defense system that combat foreign DNA. The discovery of SpCas9, and its repurpose as a genome editing tool led to the discovery of additional nucleases with diverse properties currently utilized in various applications. Initial categorization of CRISPR-related nucleases was based on a narrow pool of nucleases and could not foresee the heterogeneity of nucleases known today. Current subtype classification is based on several methodologies; however, no differential weight was assigned to each method, and occasionally, the discrepancy among them resulted in subjective and/or arbitrary classifications. Here we utilized three classification methods of type II nucleases: loci architecture, cas9 phylogeny and cas1 phylogeny. For each classification method, the distribution of nucleases by subtype was studied and the agreement between the methods was measured. We found that about 30% of the nucleases analyzed were inconsistently classified. In some cases, the nucleases did not fit with any of the established subtype classifications. Overall, our findings question the present paradigm of affiliating nucleases into allegedly homogenous groups with shared properties and functions. Accordingly, we suggest that newly discovered and/or engineered nucleases should be carefully characterized prior to being confined with existing classifications.

U-002: Inference of phylogenetic networks from sequence data using composite likelihood
COSI: EvolCompGen
  • Sungsik Kong, The Ohio State University, United States
  • David Swofford, University of Florida, United States
  • Laura Kubatko, The Ohio State University, United States


Presentation Overview: Show

While phylogenies have been essential in understanding how species evolve, they do not adequately describe some evolutionary processes. For example, hybridization, a phenomenon where interbreeding between two species leads to formation of a new species, must be depicted by a phylogenetic network, a structure that modifies a phylogeny by allowing two branches to merge into one. However, existing methods for estimating networks are computationally expensive as the dataset size and/ topological complexity increase. While summarization of sequence data into gene trees increases the speed of the method in general, it is still computationally intensive, and prone to gene tree estimation error and results in the loss of large amount of relevant phylogenetic signals. Here, we propose a novel method to estimate phylogenetic networks directly from sequence data. Our method achieves computational efficiency by implementing composite likelihood and accuracy by using the full genomic data to incorporate all sources of variability. To efficiently search the network space, we implement hill-climbing and simulated annealing algorithms. Our method will be useful in practice, particularly when large data are available, as no gene tree estimation is required. This method is implemented in the Julia programming language and is evaluated both simulated and empirical datasets.

U-003: Comparative Phylogenomic Analysis of the Liberibacter Pathogens Associated with Huanglongbing and Zebra Chip
COSI: EvolCompGen
  • Yongjun Tan, Saint Louis University, United States
  • Cindy Wang, Saint Louis University, United States
  • Theresa Schneider, Saint Louis University, United States
  • Huan Li, Saint Louis University, United States
  • Kylie Swisher Grimm, United States Department of Agriculture-Agricultural Research Service, United States
  • Dapeng Zhang, Saint Louis University, United States


Presentation Overview: Show

Liberibacter pathogens are the causative agents of several severe crop diseases worldwide, including citrus Huanglongbing and potato zebra chip. These bacteria are endophytic and nonculturable, which makes experimental approaches challenging and highlights the need for bioinformatic analysis in advancing our understanding about Liberibacter pathogenesis. Here, we performed an in-depth comparative phylogenomic analysis of the Liberibacter pathogens and their free-living, nonpathogenic, ancestral species, aiming to identify major genomic changes and determinants associated with their evolutionary transitions in living habitats and pathogenicity. By using ortholog clustering, we identified two sets of genomic genes, which were either lost or gained in the ancestor of the pathogens. Importantly, among the gained genes, we uncovered several previously unrecognized toxins, including new toxins homologous to the EspG/VirA effectors, a YdjM phospholipase toxin, and a secreted endonuclease/exonuclease/phosphatase (EEP) protein. Besides, we conducted analyses to understand genome difference among multiple strains of zebra chip associated Liberibacter pathogens and identify several genes involved in host-pathogen interactions. Our results substantially extend the knowledge of the evolutionary events and potential determinants leading to the emergence of endophytic, pathogenic Liberibacter species, which will facilitate the design of functional experiments and the development of new methods for detection and blockage of these pathogens.

U-004: scPairtree: a fast, scalable method for inference of a cancer's evolutionary history
COSI: EvolCompGen
  • Jarrett Barber, University of Toronto, Canada
  • Philip Awadalla, Ontario Institute of Cancer Research, Canada
  • Quaid Morris, Memorial Sloan Kettering Cancer Center, United States


Presentation Overview: Show

Most cancers are made up of genetically distinct subpopulations called subclones. Greater levels of this intratumoral heterogeneity (ITH) correlates with increased aggressiveness and resistance, and so better understanding of ITH is essential. To this end, many algorithms have been created to determine the evolutionary history of tumors through the construction of mutation trees. Single cell sequencing-based methods promise to allow for greater reconstruction resolution than was previously possible. However, existing single cell-based methods tend to suffer from several issues: they report a single tree when a distribution of trees is more appropriate; only consider a single mutation type; and do not scale well with the number of mutations and cells.

Here I present scPairtree, a fast, scalable method to infer mutation trees using single-cell sequencing data. Briefly, the method works by first constructing a pairs tensor, an object which stores the probability of each mutation pair being in a particular ancestral relationship. From here, an importance sampling technique is employed to sample trees from the pairs tensor. Because of this novel approach, scPairtree is faster and more scalable to large datasets than existing methods.

U-005: Genomic Diversity and Associated Phenotyping of Escherichia coli Isolated from Poultry in the Southern United States
COSI: EvolCompGen
  • Aijing Feng, University of Missouri, United States
  • Spencer Leigh, Poultry Research Unit, USDA Agricultural Research Service, Mississippi State, United States
  • Hui Wang, Mississippi State University, United States
  • Todd Pharr, Mississippi State University, United States
  • Jeff Evans, Poultry Research Unit, USDA Agricultural Research Service, Mississippi State, United States
  • Martha Pulido Landinez, Mississippi State University, United States
  • Lanny Pace, Mississippi State University, United States
  • Xiu-Feng Wan, University of Missouri, United States


Presentation Overview: Show

In domestic poultry, E. coli are typically present as commensal bacteria in the gastro-intestinal tracts but some avian pathogenic E. coli (APEC) strains can cause localized and systematic infections. Here we sequenced 188 E. coli isolates from the sick poultry samples collected between May 2017 and July 2021. Phylogenetic analyses suggested a large extent of genetic variation were present among these isolates at the whole genome level whereas their 16s rRNA genes were clustered into two major groups. These isolates belong to 32 H and 61 O serotypes, and APEC associated pathogen islands were found in all and can be isolate-dependent. Based on clinical data, we grouped these samples based on the types of diseases and the infection locations in the birds. Multi-task LASSO model was used to learn genetic features associated with types of diseases and infection localization. Results showed that 32 genes we identified had biological functions of binding, transporter activity, ion transport, transcription regulator activity, ATP-dependent activity and catalytic activity, and that four of them were independent of disease type and infection localization. The knowledge derived from this study could be useful for designing a broadly protective E. coli vaccine candidate for domestic poultry.

U-006: CACTUS: integrating clonal architecture with genomic clustering and transcriptome profiling of single tumor cells
COSI: EvolCompGen
  • Shadi Darvish Shafighi, University of Warsaw, Poland
  • Szymon Kielbasa, Leiden University Medical Center, Netherlands
  • Julieta Sepúlveda-Yáñez, Leiden University Medical Center, Netherlands
  • Ramin Monajemi, Leiden University Medical Center, Netherlands
  • Davy Cats, Leiden, Netherlands
  • Leon Mei, LUMC, Netherlands
  • Roberta Menafra, LUMC, Netherlands
  • Susan Kloet, Leiden University Medical Center, Netherlands
  • Hendrik Veelken, Leiden University Medical Center, Netherlands
  • Cornelis A.M. Van Bergen, Leiden University Medical Center, Netherlands
  • Ewa Szczurek, University of Warsaw, Poland


Presentation Overview: Show

Drawing genotype-to-phenotype maps in tumors is of paramount importance for understanding tumor heterogeneity. Assignment of single cells to their tumor clones of origin can be approached by matching the genotypes of the clones to the mutations found in RNA sequencing of the cells. The confidence of the cell-to-clone mapping can be increased by accounting for additional measurements. Follicular lymphoma, a malignancy of mature B cells that continuously acquire mutations in parallel in the exome and in B cell receptor loci, presents a unique opportunity to join exome-derived mutations with B cell receptor sequences as independent sources of evidence for clonal evolution.

Here, we propose CACTUS, a probabilistic model that leverages the information from an independent genomic clustering of cells and exploits the scarce single-cell RNA sequencing data to map single cells to given imperfect genotypes of tumor clones.

We apply CACTUS to two follicular lymphoma patient samples, integrating three measurements: whole exome, single-cell RNA, and B cell receptor sequencing. CACTUS outperforms a predecessor model by confidently assigning cells and B cell receptor-based clusters to the tumor clones. The integration of independent measurements is the key to improving model performance in the challenging task of charting the genotype-to-phenotype maps in tumors.

U-007: Systematic evaluation of TF diversity in Heliconius erato and Heliconius melpomene
COSI: EvolCompGen
  • Diego A. Rosado-Tristani, University of Puerto Rico, Rio Piedras, Puerto Rico
  • José A. Rodríguez-Martínez, University of Puerto Rico, Rio Piedras, Puerto Rico


Presentation Overview: Show

Transcription is a key biological process regulated in part by a set of DNA-binding proteins called transcription factors (TFs), which aid in the recruitment of the transcriptional machinery required to modulate gene expression. They turn genes on or off, denoting their importance on how observable traits are derived from the genome. A first step in understanding transcription is by identifying the TFs present in a species. In this work, we apply the Cis-regulatory Element-binding Protein Elucidator (CREPE) pipeline to identify and catalog the TFs of the butterflies Heliconius erato and Heliconius melpomene. These butterflies have independently evolved mimicry in their wing patterns, making them popular among scientists studying phenotypic determination. CREPE works by scanning a proteome against a database of TF family profiles. Matches are selected through orthology inferences using tree-based methods. Putative TFs are assigned the name of its closest neighbor. The purpose of the orthology inferences is to harmonize annotations, which may not be available for non-model organisms. We identified the putative TFs for Heliconius erato and Heliconius melpomene, and found that they possess similar TF family distributions. The largest three families identified were as expected in Metazoa. CREPE streamlines the task of TF identification on non-model organisms.

U-008: Read2Tree: scalable and accurate phylogenetic trees from raw reads
COSI: EvolCompGen
  • David Dylus, University of Lausanne, Switzerland
  • Adrian Altenhoff, ETH, Switzerland
  • Sina Majidian, University of Lausanne, Switzerland
  • Fritz Sedlazeck, Baylor College of Medicine, United States
  • Christophe Dessimoz, University of Lausanne, Switzerland


Presentation Overview: Show

The inference of phylogenetic trees from raw sequencing reads is foundational to biology. However, state-of-the-art phylogenomics requires running complex pipelines, at significant computational and labour costs, with additional constraints in sequencing coverage, assembly and annotation quality. To overcome these challenges, we present Read2tree, which directly processes raw sequencing reads into groups of corresponding genes. In a benchmark encompassing a broad variety of datasets, our assembly-free approach was 10-100x faster than conventional approaches, and in most cases more accurate—the exception being when sequencing coverage was high and reference species very distant. To illustrate the broad applicability of the tool, we reconstructed a yeast tree of life of 435 species spanning 590 million years of evolution. Applied to Coronaviridae samples, Read2Tree accurately classified highly diverse animal samples and near-identical SARS-CoV-2 sequences on a single tree—thereby exhibiting remarkable breadth and depth. The speed, accuracy, and versatility of Read2Tree ​enables comparative genomics at scale.

U-009: Reconstructing Cancer Phylogenies with Pairtree
COSI: EvolCompGen
  • Jeff Wintersinger, Deep Genomics, United States
  • Quaid Morris, Sloan Kettering Institute, United States
  • Ethan Kulman, Sloan Kettering Institute, United States


Presentation Overview: Show

Cancer is a heterogeneous disease that typically originates from genetic mutations which enable cancer cells to evade cellular controls. As the disease progresses, cancers acquire further mutations which lead to genetically distinct subpopulations or subclones. By sequencing the bulk DNA in one or more samples from a single cancer, we can resolve the subclones in a cancer and construct a tree that describes the evolutionary relationship between subclones. Clone trees have clinical significance as they provide insight into how cancer progresses and responds to treatment.

Pairtree is a software package that uses bulk DNA sequencing data to construct and visualize candidate clone trees. Prior to building clone trees, Pairtree can be used to identify and remove mutations with complex inheritance patterns, and cluster mutations into subclones. Pairtree uses Bayesian inference to compute the posterior probability of pairwise relationships between pairs of subclones, and uses these probabilities to infer clone trees via a stochastic tree search. Pairtree differs from existing clone tree reconstruction tools as it was designed to construct clone trees using up to 100 cancer samples and/or 30 subclones. The Pairtree manuscript is now readily available in the AACR journal Blood Cancer Discovery.

U-010: Characterizing effector-metaeffector pairs in Legionella pneumophila
COSI: EvolCompGen
  • Ethan Wolfe, Michigan State University, United States
  • Joseph Burke, Michigan State University, United States
  • Stephanie Shames, Kansas State University, United States
  • Janani Ravi, Michigan State University, United States


Presentation Overview: Show

Bacterial effector proteins are virulence factors critical for parasitism in eukaryotic hosts. Metaeffectors — effectors that regulate the activity of cognate effectors — were recently discovered exclusively in Legionella pneumophila (Lp). Lp, which has co-evolved extensively with its natural host amoebae, is the causative agent of Legionnaires’ Disease. This project will focus on seven effector-metaeffector (EM) pairs involved in virulence. Since little is known about metaeffectors, we first characterize these EM pairs using their sequence-structural features, such as domain architectures, and delineate their evolution using MolEvolvR (https://doi.org/10.1101/2022.02.18.461833; http://jravilab.org/molevolvr). We also quantify the coevolution of all EM pairs to discover lone effectors (occurring without cognate metaeffectors) that could potentially lead to cytotoxicity. We determine the co-occurrence (and co-evolution) of EM pairs across ~130 Legionella genomes, including ~100 from Lp. The domain/motif building blocks constituting these EM pairs will populate the first comprehensive EM feature repository that will enable the discovery of novel EM pairs in Lp and other understudied, emerging pathogens.

U-011: The properties of various amino acid distances for phylogenetic estimation
COSI: EvolCompGen
  • Edward Braun, University of Florida, United States


Presentation Overview: Show

There is growing recognition that distance methods have desirable properties for species tree estimation when they are used to analyze data that reflects a mixture of gene trees that differ due to the multispecies coalescent. They also have the potential to be applied to data in a way that bypasses the multiple sequence alignment step, which can be a source of error. However, it is difficult estimate distances using stochastic that are as realistic as the models used in most maximum-likelihood analyses. I examine several distances that: 1) incorporate protein structure; 2) incorporate heterogeneity among sites in amino acid frequencies using Dirichlet mixtures to partition sites; and 3) recode amino acids based on their physicochemical properties. Those methods rely on sequence alignments, so I also examine a fundamentally different method based on conditional Kolmogorov complexity, which can be approximated using data compression. The simplest version of the compression distance is highly biased, but a simple modification motivated by the commonly used gamma distance can yield much more realistic estimates of evolutionary distances. The performance of neighbor-joining using these distances will be compared to standard maximum-likelihood analyses of concatenated sequences and two-step multispecies coalescent analyses.

U-012: Domain Promiscuity Correlates with Rates of Domain Gain and Loss
COSI: EvolCompGen
  • Yuting Xiao, Carnegie Mellon University, United States
  • Maureen Stozler, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States


Presentation Overview: Show

Domain promiscuity is the propensity of a domain to form different combinations with other domains in the same protein. Domain promiscuity varies greatly among domains. One hypothesis is that domain mobility drives domain promiscuity: domains that are easily copied and inserted in new contexts tend to co-occur with many different domains. However, because domain mobility cannot be observed directly, the mobility hypothesis is difficult to test.

Here, we probe the relationship between promiscuity and mobility using estimated rates of domain gain, loss, and duplication as a proxy for domain mobility. Since many measures of domain promiscuity have been proposed, we first asked whether these measures capture different properties. Among 11 proposed measures of domain promiscuity applied to 1283 domain families in 21 selected species, we identified three groups, where the measures are highly correlated within each group, but uncorrelated across groups. Choosing one measure from each group is sufficient for representing all promiscuity measures. We next probed the relationship between domain promiscuity and domain rates. Domain event rates were inferred using the probabilistic birth-death-gain model in COUNT. Regression analysis of the promiscuity measures and the inferred evolutionary rates revealed highly significant correlations, suggesting mobility may indeed contribute to domain promiscuity.

U-013: Multi-modal Transformer based deep neural network for determining false positive structural variation calls
COSI: EvolCompGen
  • Taeyoung Kim, Pusan National University, South Korea
  • Giltae Song, Pusan National University, South Korea


Presentation Overview: Show

Structural variation (SV) changes large genomic segments. They are generally considered to be associated with genetic diversity and complex diseases. Although many SV calling tools have been developed as sequencing technology has advanced, they suffer due to high FDR (False discovery rates) for complex sequence patterns. It is important to filter out false positive SV. There are some tools for filtering out SV based on random forest and convolutional neural network, but some false positive SVs remain unresolved due to high FDR over 30%
In this study, we propose a multi-modal Transformer based deep neural network for filtering out false positive SV obtained by three major SV callers. This SV outcome data is converted to images, signals, and tabular information in pre-processing steps. They are fed into the multi-modal Transformer. After building a filter-out model, we evaluate it using the other reserved data that are collected from 1000 Genomes Project phase 3 reanalyzed with GRCh38 for testing and quantify how many false positive SVs are filtered out using F1-score. We believe that our model can be a useful toolset to validate SV and contribute to the studies for the association of SV and complex diseases.

U-014: PiMaker: Vectorized diversity statistics to measure evolutionary pressure at scale
COSI: EvolCompGen
  • Joseph Lalli, University of Wisconsin-Madison, United States
  • Donna Werling, Department of Genetics, University of Wisconsin-Madison, United States


Presentation Overview: Show

Advances in machine learning and evolutionary modelling have opened doors for researchers looking to understand the patterns of natural selection. However, interpretations of these new methods are often complicated by insufficient understanding of these models limitations and assumptions. In contrast, classic population diversity metrics (such as piN/piS and FST) benefit from a rich theoretical understanding of their strengths and limitations.
PiMaker is a tool designed to calculate common diversity metrics at scale. Given a reference FASTA sequence(s), a VCF of SNPs, and a GTF of annotations, it can calculate overall, synonymous, and nonsynonymous pi and FST per genetic site. Utilizing vectorized calculations and modern distributed computing methods, PiMaker is able to analyze SNV diversity within mammalian genetic populations. PiMaker calculated within-host influenza diversity statistics for 328 influenza genomes ~3000 times faster than SNPGenie and Popoolation2, with identical results. PiMaker was able to replicate population diversity analysis in Mycobacterial populations, SARS-CoV-2 populations, and Drosophila populations, and calculated the same statistics in human tumor samples. With only bulk sequencing data, this tool can measure the sites of microbial or tumor populations that are under purifying or diversifying selection at scale.

U-015: nRCFV: a sequence, taxon and character state-normalised metric for the pre-reconstruction evaluation of compositional heterogeneity
COSI: EvolCompGen
  • James Fleming, University of Oslo Natural History Museum, Norway
  • Torsten Struck, University of Oslo Natural History Museum, Norway


Presentation Overview: Show

Compositional heterogeneity – when the proportions of nucleotides and amino acids are not broadly similar across the dataset - is a cause of a great number of phylogenetic artefacts. Whilst a variety of methods can identify it post-hoc, few metrics exist to quantify compositional heterogeneity prior to the computationally intensive task of phylogenetic tree reconstruction. Moreover, the influence of the number of positions and taxa on such metrics have never been properly assessed before. Here we investigate the effect of taxa and positions on one such existing metric, RCFV, and find that for the number of positions, the impact decreases with increasing numbers, whilst for the number of positions, the variability of RCFV, rather than its absolute value increases. Hence, shorter alignments in both length and breadth are more heavily affected by these biases affecting RCFV. We propose a new metric, nRCFV, which utilises a new normalising formulae to overcome these biases.
The nRCFV metric has been included in a new version of BaCoCa and is available at: https://github.com/JFFleming/BaCoCa.

U-016: Annotating Microbial Functions Using ProkFunFind
COSI: EvolCompGen
  • Keith Dufault-Thompson, National Institutes of Health, United States
  • Xiaofang Jiang, NLM/NIH, United States


Presentation Overview: Show

The rapidly expanding catalog of microbial genomes and metagenomes has provided a wealth of information about what microbes are present in different environments and what functions are encoded in their genomes. Further analyses of these data can provide insights into the physiology of the organisms, their ecological significance, and potential clinical relevance. To facilitate the analysis of microbial genomes, we have developed a flexible and extensible annotation and search tool, ProkFunFind, that can be used to search for genes and gene clusters within collections of microbial genomes. ProkFunFind was designed to be flexible, incorporating multiple annotation tools, including eggNOG-mapper, KofamScan, and InterProScan, allowing users to perform searches based on sequence similarity, HMM profiles, or using established orthology definitions like NCBI’s COGs. Furthermore, we have designed our tool to be extensible, allowing for the future integration of additional annotation and search approaches. ProkFunFind has been successfully applied in multiple projects from our research group involving the characterization of metabolic pathways in the human gut microbiome, providing insights into their distribution and relevance to human health. Our goal is to further refine and develop ProkFunFind, providing the microbial research community with an easy-to-use and flexible platform for the annotation of new functions.

U-017: A Fast and Accurate Method for Sampling Phylogenetic Trees using the Neighbor Joining Algorithm
COSI: EvolCompGen
  • Hazal Koptagel, KTH Royal Institute of Technology, Sweden
  • Oskar Kviman, KTH Royal Institute of Technology, Sweden
  • Harald Melin, KTH Royal Institute of Technology, Sweden
  • Jens Lagergren, KTH Royal Institute of Technology, Sweden


Presentation Overview: Show

In phylogenetic tree inference, a major challenge concerns the efficient exploration of the tree topology space. Many recent MCMC and variational inference (VI) based methods provide sophisticated techniques for approaching this problem, and advancing these methods is indeed an active area of research. However, given the complexity of the tree topology space, obtaining good initial tree topologies is often crucial. In this work, we combine the newly proposed posterior distribution over evolutionary distances for the Jukes-Cantor (JC69) substitution model with the neighbor-joining (NJ) algorithm to obtain a novel method for sampling phylogenetic trees. Our method demonstrates good results compared to a large set of popular baselines while having the lowest time complexity. For instance, in real data experiments, our method achieves the smallest total variation distance with respect to a long-run MCMC ground truth tree in comparison to existing state-of-the-art NJ methods, UFBoot, and a CSMC method. Also, our method provides a larger tree space support than the MCMC-based MrBayes algorithm.

U-018: Evolink: a Phylogenetic Approach for Rapid Identification of Phenotype-Genotype Associations in Large-scale Microbial Data
COSI: EvolCompGen
  • Xiaofang Jiang, National Library of Medicine, National Institutes of Health, United States
  • Yiyan Yang, National Library of Medicine, National Institutes of Health, United States


Presentation Overview: Show

The discovery of genetic variants underlying phenotypes is a fundamental task in microbial genomics. Phylogenetic approaches are frequently used to adjust the population structure in finding microbial genotype-phenotype associations, yet being scaled to trees with thousands of leaves representing heterogeneous populations is highly challenging. This greatly hinders the mining of prevalent genetic features that contribute to phenotypes observed in a wide diversity of species. In this study, Evolink was developed to rapidly identify genes positively or negatively associated with phenotypes on large-scale datasets. Evolink calculates the phylogenetic diversity of species with and without this gene in the phenotype-positive and phenotype-negative trees to measure to what extent a gene is associated with a given phenotype. We performed an analysis on a dataset of 3,034 species composed of 88,315 gene families to detect genes associated with flagella using Evolink. Compared with phylogeny-naive methods such as Fisher’s exact test, Evolink provided results with higher accuracy. The analysis was completed in ~7 minutes with 2 CPUs and ~30 GB of memory, significantly faster than the other phylogenetic-based tools we compared.

U-019: Genomic diversity, invasion history and local adaptation of Drosophila suzukii
COSI: EvolCompGen
  • Siyuan Feng, University of Wisconsin Madison, Laboratory of Genetics, United States
  • Samuel DeGrey, University of Wisconsin Madison, Department of Entomology, United States
  • Sean Schoville, University of Wisconsin Madison, Department of Entomology, United States
  • John Pool, University of Wisconsin Madison, Laboratory of Genetics, United States


Presentation Overview: Show

Biological invasions are of great research interest as they often carry significant economic and ecological costs, but also constitute natural experiments that allow investigations of evolutionary processes on contemporary timescales. The fruit pest Drosophila suzukii, which has rapidly invaded the globe within the past few decades, stands out as an excellent model for studying invasion genomics and local adaptation owing to its occupation of distinct environments. However, despite the recent availability of genomic resources in D. suzukii, inference of the invasion route has only been made based on limited microsatellite markers. Here, we investigated genomic diversity, invasion history and local adaptations of D. suzukii using whole-genome sequencing data and environmental metadata from 29 population samples from four distant continents and three islands. Our results suggested strong founder event bottlenecks of invasive populations and separate Asia-sourced invasion events into America and Europe, as well as gene flow among continents following the first founder events. We also detected a large number of genomic targets of local adaptations under different environmental pressure. Our findings provide insights into the population history of D. suzukii, and have the potential to further our understanding of the genetic architecture underlying this species’ invasion success.