General Session

Schedule subject to change
All times listed are in EDT

Tuesday, May 16^th

9:00-10:00

Invited Presentation: Bioinformatics methods for precision farming: examples from dairy

Room: Leacock 132

Format: Live from venue

Abdoulaye Diallo

Presentation Overview: Show

10:30-11:00

Towards Computing Attributions for Dimensionality Reduction Techniques

Room: Leacock 132

Format: Live from venue

Matthew Scicluna, Departement de biochimie et medecine moleculaire, Universite de Montreal, Canada
Jean-Christophe Grenier, Institut de Cardiologie de Montréal, Canada
Raphael Poujol, Institut de Cardiologie de Montréal, Canada
Sebastien Lemieux, Institut de recherche en immunologie et en cancérologie, Universite de Montreal, Canada
Julie Hussin, Departement de Medecine, Universite de Montreal, Canada

Presentation Overview: Show

11:00-11:15

CLARUS: An Interactive Explainable AI Platform for Manual Counterfactuals in Graph Neural Networks

Room: Leacock 132

Format: Live from venue

Jacqueline Beinecke, Institute for Medical Informatics at the University Medical Center Goettingen, Germany
Anna Saranti, Human-Centered AI Lab, University of Natural Resources and Life Sciences, Vienna, Austria
Alessa Angerschmid, Human-Centered AI Lab, University of Natural Resources and Life Sciences, Vienna, Austria
Bastian Pfeifer, Institute for Medical Informatics, Statistics and Documentation, Medical University Graz, Austria, Austria
Vanessa Klemt, Biomedical Datascience lab, Philipps University Marburg, Germany, Germany
Andreas Holzinger, Human-Centered AI Lab, University of Natural Resources and Life Sciences, Vienna, Austria
Anne-Christin Hauschild, Institute for Medical Informatics at the University Medical Center Goettingen, Germany

Presentation Overview: Show

Lack of trust in artificial intelligence (AI) models in medicine is still the key blockage for the use of AI in clinical decision support systems (CDSS). Although AI models are already performing excellently in medicine, their black-box nature entails that it is often impossible to understand why a particular decision was made. This is especially true for very complex models such as graph neural networks (GNNs), which have arisen in recent years to tackle the problem of modelling biological networks such as protein-protein-interaction graphs (PPIs). In the field of explainable AI (XAI), many algorithms have already been developed to ""explain"" to a human expert, which input features influenced a specific prediction. However, in the clinical domain, it is essential that these explanations lead to some degree of causal understanding by a clinician in the context of a specific application.

Therefore, we developed the CLARUS platform, aiming to promote human understanding of GNN predictions by allowing the domain expert to validate and improve the GNN decision-making process. CLARUS enables the visualization of the biological networks used to train and test the GNN, where node and edge correspond to gene products and their interactions for instance. Relevance values of genes and interactions are computed by XAI models, such as GNNExplainer are highlighed in the visualized graph by color intensity and line thickness. This enables domain experts to gain deeper insights into what parts of the biological network were most influential in the GNN decision-making process. More importantly, the expert can interactively manipulate the patient PPI based on their understanding and initiate a retraining or re-prediction. This interactivity allows them to ask manual counterfactual questions and analyse the resulting effects on the GNN decision.

We will present the first interactive XAI platform prototype, CLARUS, that allows not only the evaluation of specific human counterfactual questions based on user defined alterations of patient PPI networks and a re-prediction of classes, but also a retraining of the entire GNN after changing the underlying graph structures. The platform is currently hosted by the GWDG on https://rshiny.gwdg.de/apps/clarus/ and the prepint of our paper is available in bioarxiv [1].

[1] https://doi.org/10.1101/2022.11.21.517358

11:15-11:30

Building generalised protein-protein interaction models for robust out-of-distribution, cross-species interactions using RAPPPID

Room: Leacock 132

Format: Live from venue

Joseph Szymborski, Department of Electrical and Computer Engineering, McGill University. Mila, Quebec AI Institute., Canada
Amin Emad, Department of Electrical and Computer Engineering, McGill University. Mila, Quebec AI Institute., Canada

Presentation Overview: Show

Model organisms like Homo sapiens or Mus musculus enjoy the privilege of having their protein-protein interaction (PPI) networks largely characterised through high-confidence experimental evidence. While the networks of these organisms are mature and well-studied, it would take far too much effort to perform the same experimental validation on more obscure, lesser-studied species.

In silico methods are ideal for bridging the gap between well- and lesser-studied organisms, as they typically require fewer resources and less time than their in vitro and in vivo counterparts. Machine learning (ML) models that infer PPIs have been long proposed for this purpose. Unfortunately, supervised ML methods face the challenge that the lesser-studied species which would most benefit from having their PPIs inferred lack sufficient data to train accurate models.

ML methods which exhibit strong out-of-distribution (OOD) performance can overcome the small-dataset challenges by training on PPI networks of organisms with many edges, and inferring the edges of the incomplete, lesser-studied network. Achieving strong OOD performance, however, is a very difficult task for computational PPI prediction methods as it requires that they sufficiently generalise to the overall problem space. In fact, existing PPI methods often have a demonstrably difficult time making inferences to PPIs of proteins that (1) are outside of the training set and (2) are of a different species.

We have developed RAPPPID (PMID: 35771595), a method for Regularised Automatic Prediction of Protein-Protein Interactions using Deep Learning, that makes accurate PPI predictions on OOD data samples from evolutionarily distant species. RAPPPID takes as its only input pairs of amino acid (AA) sequences. These AA sequences are encoded using a deep twin AWD-LSTM neural network which generates latent embeddings for both sequences. These embeddings are subsequently inputted into a multi-layer perceptron (MLP) classification network. RAPPPID was trained on high-confidence edges (>95%) from the STRING dataset.

RAPPPID outperforms leading PPI prediction methods, including D-SCRIPT and SPRINT, when tested and evaluated on datasets which carefully control for data leakage. RAPPPID is capable of accurately predicting the interaction between the therapeutic antibodies Trastuzumab and Pertuzumab and their target HER-2. RAPPPID’s performance does not degrade when trained and tested on the various other species tested. Further, RAPPPID models trained on human training data accurately predict edges from other species, often achieving comparable performance to models trained on those very species themselves. RAPPPID models trained on human species gain even greater performance gains when tested on other species when transfer-learning is used to fine-tune the RAPPPID model on those species.

11:30-11:45

Genome-wide inference of eukaryotic coding regions with deep learning

Room: Leacock 132

Format: Live from venue

Xavier Lapointe, Université de Sherbrooke, Canada
Marie A. Brunet, Université de Sherbrooke, Canada

Presentation Overview: Show

Recent work has presented new evidence of translation for thousands of previously unknown coding sequences (CDS), drastically expanding the potential landscape of the human proteome. However, due to experimental limitations and inherent biases, many more CDS are likely to be overlooked or underestimated, thereby limiting our understanding of the full range of proteome diversity. Accurate annotation of functional elements holds crucial implications for clinical and fundamental research, hence we need revised tools to exhaustively evaluate the functional ORFeome. The astounding success of deep learning on sequence modeling tasks combined with quality -omics data provides hope in the search for an approximation of the universal features that underlie translation. Here we present FOMOnet (Fear Of Missing ORFs neural network), a 1D convolutional neural network derived from the UNet architecture. FOMOnet is trained with one-hot encoded representations of human protein coding transcripts and performs segmentation of coding regions within a transcript. With a receiver operating characteristic (ROC) area under the curve (AUC) of 99.3% and a precision-recall (PR) AUC of 96.1%, it vastly outperforms current tools aiming to predict canonical human CDS. Interestingly, our model displays no significant decrease in performance when evaluated on non-orthologous genes (with respect to human) in five other species: Mus musculus, Danio rerio, Xenopus laevis, Caenorhabditis elegans, and Saccharomyces cerevisiae. These results suggest FOMOnet as a viable tool for annotating the proteome of distantly related eukaryotes and newly sequenced species. Moreover, several studies have recently attributed function to proteins derived from pseudogenes (e.g. NOTCH2NL, SRGAP2, UBBP4) widely defined as evolutionary relics and thereby non-functional. However, the homology of pseudogenes with their parent genes has hindered large-scale functional studies. To address this blind spot in gene annotation, we used the OpenProt proteogenomics resource to extract 1784 human pseudogene transcripts with at least one ORF exhibiting strong evidence of translation (i.e. two independent detections by Ribo-seq or mass spectrometry). Among these translated ORFs, FOMOnet confidently identifies 702 as functional CDS, while also predicting hundreds of pseudogene CDS with no previous evidence for translation. Overall, this approach provides an unbiased assessment of CDS within the human genome. It outperforms current state-of-the-art tools, is generalizable to other eukaryotes, accurately predicts some newly found CDS and predicts novel ones.

11:45-12:00

STRUCTURAL CHARACTERIZATION OF NATURAL ANTISENSE TRANSCRIPTS WITH NANOPORE SEQUENCING

Room: Leacock 132

Format: Live from venue

J. White Bear, McGill University, Canada
Grégoire De Bisschop, Institut de recherches de cliniques de Montréal (IRCM), Canada
Éric Lecuyer, Institut de recherches de cliniques de Montréal (IRCM), Canada
Jérôme Waldispühl, McGill University, Canada

Presentation Overview: Show

Natural antisense transcripts (NAT) are RNA pairs transcribed from overlapping, opposite DNA strands. NAT are expressed in all three domains of life, including retroviruses. They are involved in regulation of RNA expression including RNA maturation, stability, localization and translation. As such, they are frequently indicated in diseases pathways, such as cancer. Yet, it is unclear how NAT pairs bind as the assumption that they form long intermolecular duplexes has never been challenged. We, thus, hypothesize that NAT pairs assemble through a wide range of structures spanning from mostly intramolecular to mostly intermolecular base pairings. Many chemical probing techniques have been developed and considerable progress has been made in the investigation of RNA structure. However, they provide an average signal that is hardly amenable to deconvolution and precludes the identification of discrete structures within a complex equilibrium. We focus on, cen and ik2, two natural antisense mRNA localized to the centrosome during mitosis in Drosophila embryos. They share a 59-nucleotide long antisense region located in their 3’UTR, a prerequisite for their interaction and localized translation.
We employ direct RNA sequencing to identify adduct positions on single RNA molecules using nanopore reads. Nanopore reads utilize a 5-mer, dwell times, and current signal to characterize RNA sequences. Sequenced reads are aligned and normalized to produce reactivity profiles that are used to predict modification of unpaired nucleotides via statistical analysis or machine learning methods, such as SVM. These methods have been limited in the scope of the features set and predictions, and the size of available data. To address these challenges, we collect a large training set of both modified and unmodified cen and ik2 (n >= 20000) and their reactivity profiles. We, first, examine the consequence of information loss by expanding the feature set and incorporating reactivity profiles to create semi-supervised models which detect structural features and predict, de novo, reactivity profiles with improved correlations to ground truth. We, next, leverage these expanded feature and data sets to develop multi-class and multi-output deep learning models that jointly predict sequence, induced modifications, and secondary structural features. Preliminary results suggest that our methods yield comparable or improved identifications to standard SHAPE and existing direct RNA analyses.

13:30-14:30

Invited Presentation: Leveling Up Citizen Science

Room: Leacock 132

Format: Live from venue

Jérôme Waldispühl

Presentation Overview: Show

14:30-14:45

Advanced prediction of linear B-cell epitopes using a protein language model

Room: Leacock 132

Format: Live from venue

Minh Nguyen, University of Cincinnati, United States
Alexey Porollo, Children's Hospital Medical Center, United States

Presentation Overview: Show

The humoral immune response is a component of the adaptive immune system that involves the production and secretion of antibodies by B-cells. These antibodies are designed to recognize and bind to specific antigens, which are foreign substances originating from bacteria, viruses, or other pathogens. The majority of the actual targets of antibodies, known as epitopes, are small parts of microbial proteins, which can be continuous (linear) or discontinuous (conformational). Identification of B-cell epitopes is an important step in developing diagnostic methods, therapeutic antibodies, and epitope-based vaccines. Many methods have been developed in the past to predict B-cell epitopes based on profiles of the physico-chemical properties of amino acids in the sequence, their predicted secondary structure and relative solvent accessibility, and experimentally resolved or predicted 3D structures. However, the performance of these methods remains modest. Recently published applications of large language models to protein sequence embedding suggest that such embeddings, while independent of multiple sequence alignment, are capable of capturing long-range interactions and therefore encode structural (and potentially functional) information. Following these observations, we employed the ProtT5 protein language model to derive sequence embeddings and built a model to predict B-cell epitopes. The Immune Epitope Database (IEDB) was used to compile a non-redundant set of human pathogen antigens with all known linear epitopes mapped. This set was split into 5-fold training, validation, and control subsets with a 65:10:25% ratio, respectively. Then, the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) database was used to compile a blind set of 175 antigens non-redundant between themselves and against the IEDB dataset. Our prediction model was compared with the BepiPred2 and BepiPred3 methods and showed superior performance. Other advantages of our model include independence of multiple sequence alignment and/or 3D protein structure and unlimited length of the protein sequence.

14:45-15:00

DiRLaM: Diversity-Regularized Autoencoder for Modeling Longitudinal Microbiome Data

Room: Leacock 132

Format: Live from venue

Derek Reiman, Toyota Technological Institute at Chicago, United States
Yang Dai, Univ. of Illinois at Chicago, United States

Presentation Overview: Show

Background: The human gut microbiome has been shown to impact host development, normal metabolic processes, as well as the pathogenesis of various diseases. Based on these discoveries, engineering the gut microbiome for the treatment of such diseases has become an exciting new direction in medical science. Uncovering the nature of how to precisely control a patient’s microbiome requires accurate modeling the dynamics of the microbiome community under varying conditions. However, the modeling of longitudinal microbiome data faces many challenges due to the inherent noise of microbiome data. Therefore, the development of robust and accurate models will empower the identification of microbiome-targeted therapies, as clinicians and researchers will be able to identify which factors and stimuli can be used in order to drive a patient’s microbiome to a healthier composition.

Method: Here we present DiRLaM, a deep-learning framework combining an autoencoder and deep neural network for modeling microbiome dynamics. By representing the microbiome community in a reduced latent space using an autoencoder, DiRLaM can capture the essential intrinsic community structure while making the model more robust to noise. Furthermore, DiRLaM interpolates microbiome communities within the learned latent space. In order to construct smooth transitions between different microbiome community samples, a novel regularization is applied to the Beta diversity of the observed and interpolated communities. Next, a deep neural network is trained to combine the latent microbiome community with additional information about the host and external stimuli to predict the microbiome community at the next time point. Lastly, using the trained models, DiRLaM identifies microbe-microbe interactions and significant host and external factors that contribute to the dynamic changes of the microbiome community structure.

Results: Using synthetic datasets and three real-world longitudinal datasets, we show that DiRLaM provides a more robust interpolation under increasing levels of noise compared to standard B-Spline interpolations. DiRLaM also outperforms the state-of-the-art dynamic Bayesian network model for predicting subsequent microbiome communities in longitudinal data. Additionally, we demonstrate DiRLaM’s ability to identify significant host characteristics and environmental factors contributing to the dynamics of the microbiome community.

Conclusion: We present DiRLaM, a combination of an autoencoder with a novel Beta diversity regularization and deep neural network. In both synthetic and real-world conditions, DiRLaM was both more robust and more accurate when modeling longitudinal microbiome data.

15:00-15:15

Identifying patterns and addressing pitfalls in beta diversity analyses

Room: Leacock 132

Format: Live from venue

Susan Hoops, University of Minnesota, United States
Dan Knights, University of Minnesota, United States

Presentation Overview: Show

A series of choices in microbiome beta diversity analysis can dramatically impact findings in distance-based association testing and visual representation of sample relationships [1]. Through a comparative approach to pairwise beta diversity analysis in microbiome studies, we evaluate differences among and between established common practices and newer machine-learning approaches. Comparative analysis of a range of published microbiome datasets spanning human, animal, and environmental samples reveals how some beta diversity indices resemble one another, while others accentuate distinct features of sample variation. Identifying relationships between diversity indices can aid in interpretation and comparison of beta diversity analyses across microbiome studies, including findings from distance-based statistical tests. We then visualize these indices with established ordination methods such as PCoA and t-SNE as well as newer visualization tools such as the DeepMicro autoencoder [2]. This additional comparative analysis allows us to demonstrate advantages and disadvantages of different methods based on interpretability, distortion of original data, and susceptibility to model tuning. Finally, we describe some common pitfalls in beta diversity, such as oversaturation of pairwise indices leading to the commonly seen “arch effect” or “horseshoe effect” in visualization, and explore the efficacy of a novel approach, Local Gradient Distance (LGD), for correcting oversaturated distances. As a whole, our comparative analysis provides a new, data-driven framework for choosing an appropriate beta diversity analysis approach for a particular dataset.
[1] Knight, R., Vrbanac, A., Taylor, B. C., Aksenov, A., Callewaert, C., Debelius, J., ... & Dorrestein, P. C. (2018). Best practices for analysing microbiomes. Nature Reviews Microbiology, 16(7), 410-422.
[2] Oh, M., & Zhang, L. (2020). DeepMicro: deep representation learning for disease prediction based on microbiome data. Scientific reports, 10(1), 6026.

15:15-15:30

ml-mr: Software for nonparametric and nonlinear Mendelian randomization estimation using machine learning

Room: Leacock 132

Format: Live from venue

Marc-André Legault, McGill University, Canada
Jason Hartford, Mila, Canada
Michael Lu, McGill University, Canada
Archer Y. Yang, McGill University, Canada
Joëlle Pineau, McGill University, Canada

Presentation Overview: Show

Mendelian randomization (MR) is a method to estimate the causal effect of an exposure on an outcome in the presence of unmeasured confounding variables by leveraging the framework of instrumental variable (IV) estimation. An IV is a variable that induces variability in the exposure independently from confounding variables, and MR uses genetic variants as IVs. MR is widely used to predict the effect of interventions on modifiable disease risk factors to validate drug targets or therapeutic pathways. However, assumptions are needed to estimate causal relationships. For example, MR methods typically assume that the effect of the genetic variant used as the IV on the exposure is linear and homogeneous. These assumptions may be invalid in the presence of genetic interactions (e.g. different genetic effects in men vs women) or nonlinear relationships. Methods for nonparametric IV estimation that relax the effect homogeneity and linearity assumptions have been developed. For example, the DeepIV algorithm uses neural networks to model the instrument-exposure and exposure-outcome relationships in two distinct stages, making no additional assumptions about the functional forms and allowing for heterogeneity due to observed variables (Hartford et al. 2017). Alternative approaches include machine learning estimators based on the generalized method of moments or kernel instrumental variable regression.

Few of these recent nonparametric IV estimators have been evaluated in the context of MR. This is due in part to the difficulty of training some of these models which requires the use of heuristic optimization strategies when training neural network-based models. To bridge this gap, we have developed ml-mr, a bioinformatics package that implements various nonparametric IV estimators to enable their use and evaluation in the context of MR. We also provide a framework for simulation analyses which includes previously published simulation scenarios enabling the head-to-head comparison of different methods. To assess the precision of these MR estimators, we have also included tools to estimate valid prediction intervals from black box machine learning models using conformal prediction. We used ml-mr to evaluate different formulations of the DeepIV algorithm that simplify the model for the instrument-exposure relationship. Using simulation models, we evaluated the sensitivity of the MR estimators to important parameters such as the heritability and number of samples. We also report on possible uses of these methods to estimate conditional treatment effects for drug target validation in targeted patient populations.

16:00-16:30

Signatures of co-evolution and co-regulation in the CYP3A and CYP4F genes in humans

Room: Leacock 132

Format: Live from venue

Alex Richard-St-Hilaire, Université de Montréal, Canada
Isabel Gamache, Université de Montréal, Canada
Justin Pelletier, Université de Montréal, Canada
Jean-Christophe Grenier, Montreal Heart Institute, Canada
Raphael Poujol, Montreal Heart Institute, Canada
Julie G Hussin, Université de Montréal, Canada

Presentation Overview: Show

16:30-16:45

A seed-guided topic model to improve phenotyping for PheWAS analysis in UK Biobank data

Room: Leacock 132

Format: Live from venue

Ziyang Song, McGill University, Canada
Ziqi Yang, McGill University, Canada
Ruohan Wang, McGill University, Canada
Shadi Zabad, McGill University, Canada
Marc-André Legault, McGill, Canada
Yue Li, McGill University, Canada

Presentation Overview: Show

Phenome-wide association studies (PheWAS) promise to detect shared genetic variants across a wide spectrum of phenotypes, thereby elucidating the molecular etiology of disease comorbidities. UK Biobank (UKB) provides a valuable resource to conduct PheWAS in over half a million individuals with readily available phenotype and genotype data. To aid PheWAS, we can harness the PheCode system, which maps ICD-10 codes to 1500 expert curated phenotype codes. Despite this effort, PheWAS based on the PheCode definitions may have limited statistical power because of the imperfect classification of cases and controls.

To tackle this challenge, we applied our recently developed MixEHR-SAGE [1] to infer 1500 PheCode-guided probabilistic topics from the UKB data using not only the 6800 unique ICD-10 codes but also 803 unique ATC medication codes and 2560 unique OPCS medical procedure codes. MixEHR-SAGE introduces a PheCode-driven initialization of phenotype topic priors. For each phenotype, we infer a dual-form of topic distribution: a seed-topic distribution over a core set of ICD-10 codes defined under the PheCode and a regular topic distribution over the entire ICD-10, OPCS and ATC vocabularies to capture complementary information from these additional data modalities.

To evaluate the accuracy of our phenotyping algorithm, we extracted 139 PheCodes that can be treated with any known ATC drug and removed subjects that used any of those drugs in their initial visits. For the remaining subjects, we used their PheCode-topic probabilities inferred in their initial visits as their surrogate disease severity scores. We evaluated these scores based on whether those subjects were prescribed linked medications in their follow-up visits. We achieved on average 60% area under the precision-recall curve in contrast to 40% using mixture models and 20% using the raw PheCodes.

We conducted PheWAS using the UKB data over the 1 million HapMap3 SNPs and the 1500 inferred PheCode-guided topics modelled as continuous phenotypes. Our PheWAS results identified genome-wide significant loci that did not reach significance in conventional facade-based PheWAS analyses. At least one new genome-wide significant locus was identified for 111 phenotypes. For example, we identified a novel association with our PheCode-guided topic for “preeclampsia and eclampsia,” a common complication of pregnancy. The lead variant at the locus, rs10405410 had an association p-value of 2.6 x 10-9 and was located near a cluster of carcinoembryonic antigen (CEA) genes including CEACAM1 and CEACAM8. CEACAM1 has been associated with insulin sensitivity in pregnant women with gestational diabetes and a recent prospective study showed that CEACAM1 levels were increased in the first trimester in women with preeclampsia when compared to health controls [2], providing external evidence supporting our finding.
Source code is available at https://github.com/li-lab-mcgill/MixEHR-Seed

[1] Song et al. (2022) Automatic phenotyping by a seed-guided topic model. In Proceedings of the 28th ACM SIGKDD.

[2] Mach, P. et al. Evaluation of carcinoembryonic antigen-related cell adhesion molecule 1 blood serum levels in women at high risk for preeclampsia. Am. J. Reprod. Immunol. 85, e13375 (2021)

16:45-17:15

Comparison of polygenic risk scores for coronary artery disease highlights obstacles to overcome for clinical use

Room: Leacock 132

Format: Live from venue

Holly Trochet, Quantics Biostatistics, Edinburgh,Scotland, UK, United Kingdom
Justin Pelletier, McGill University \& McGill CERC Genomic Medicine, Montreal, Canada, Canada
Rafik Tadros, Montreal Heart Institute, Department of Medicine, Université de Montréal, Montreal, QC, Canada, Canada
Julie G Hussin, Montreal Heart Institute, Department of Medicine, Université de Montréal, Montreal, QC, Canada, Canada

Presentation Overview: Show

17:15-17:30

Genomic epidemiology reveals the dominance of Hennepin County in transmission of SARS-CoV-2 in Minnesota from 2020-2022

Room: Leacock 132

Format: Live from venue

Matthew Scotch, Arizona State University, United States
Kimberly Lauer, Mayo Clinic, United States
Eric D. Wieben, Mayo Clinic, United States
Yesesri Cherukuri, Mayo Clinic, United States
Julie M. Cunningham, Mayo Clinic, United States
Eric W. Klee, Mayo Clinic, United States
Jonathan J. Harrington, Mayo Clinic, United States
Julie S. Lau, Mayo Clinic, United States
Samantha McDonough J., Mayo Clinic, United States
Mark Mutawe, Mayo Clinic, United States
John C. O'Horo, Mayo Clinic, United States
Chad E. Rentmeester, Mayo Clinic, United States
Nicole R. Schlicher, Mayo Clinic, United States
Valerie T. White, Mayo Clinic, United States
Susan K. Schneider, Mayo Clinic, United States
Peter T. Vedell, Mayo Clinic, United States
Xiong Wang, Minnesota Department of Health, United States
Joseph D. Yao, Mayo Clinic, United States
Bobbi S. Pritt, Mayo Clinic, United States
Andrew P. Norgan, Mayo Clinic, United States

Presentation Overview: Show

SARS-CoV-2 has had an unprecedented impact on human health and highlights the need for genomic epidemiology studies to increase our understanding of the evolution and spread of pathogens and to inform policy decisions. We sequenced viral genomes from over 22,000 patients tested at Mayo Clinic Laboratories between 2020-2022 and leveraged detailed patient metadata to describe county and regional spread in Minnesota via Bayesian phylodynamics. We found that spread in the state was mostly dominated by viruses from Hennepin County, which contains the state’s largest metropolis. This includes the spread of earlier clades as well as variants of concern Alpha and Delta.
The earliest introduction into Minnesota was to Hennepin from a domestic (USA) source on around January 22, 2020; six weeks before the first confirmed case in the state. The first county-to-county introductions were estimated to originate from Hennepin to bordering Ramsey County around February 23 and from Hennepin to somewhere in Central Minnesota around February 25. Both international and domestic introductions were most abundant in Hennepin (home to an international airport). Hennepin also was, by far, the most dominant source of in-state transmissions to other Minnesota locations (n=2,119) over the two-year period.
We measured the ratio of introductions to total viral flow into and out of each county/region. A value of 1 suggests a location as being a “sink” (accepts SARS-CoV-2 lineages but never exports them to other counties), while a value of 0 indicates a county as being a “source”. Most locations were ”sinks” throughout the pandemic. Central Minnesota (outside of Hennepin and Ramsey) was dominated by introductions early, but later in 2020 and early 2021 experienced brief fluctuating trends of higher virus exportation. Hennepin showed a different trend than all others as it consistently acted as a source for other Minnesota counties. However, there were brief periods of fluctuation such as an increase in the ratio of introductions towards the end of 2020 and early 2021, potentially driven by introductions from international locations.
As the virus continues to evolve, more within state genomic epidemiology studies are needed to inform local and state public health response by highlighting the roles of various counties on state-wide transmission. In addition, they can elucidate the impact of out of state introductions on local spread which can inform policies such as travel.

17:30-17:45

Complex interactions between common variants impact COVID-19 infection severity

Room: Leacock 132

Format: Live from venue

Wen Wang, University of Minnesota, United States
Mathew Fischbach, University of Minnesota, United States
Chad Myers, University of Minnesota, United States

Presentation Overview: Show

The impact of COVID-19 on our lives over the past three years has been profound. COVID-19 cases vary in severity depending on a number of factors, including age, race, gender, blood type, and underlying conditions. While a few host genetic variants (such as the 3p21.31 locus) have been identified, our knowledge of how host genetics contribute to COVID-19 outcomes is still limited. Previous COVID-19 host genetic studies primarily analyze single-variant associations. However, like many human diseases, the genetic architecture underlying COVID-19 is likely to be complex, with potential interactions between variants determining disease outcomes.

In this study, we applied BridGE, a method we previously developed, to search for pathway-level genetic interactions associated with COVID-19 severity. We applied this approach to the UK Biobank England cohort and validated the discovered pathway-level interactions using the Scotland and Wales cohorts. In total, we discovered and replicated 21 between-pathway and 5 within-pathway interactions (FDR<0.05), all of which were associated with increased COVID-19 severity. Four of these interactions could be strongly replicated in the independent cohorts (FDR<0.05), including interactions involving antigen processing and presentation, androgen receptor signaling, and interferon‐γ signaling. We also found that some of the strongest variants driving these pathway-level interactions were located in or nearby the HLA super-locus (chrom. 6p21), which is intriguing given the important role of this locus in the immune response. While several of the discovered pathways showed clear relevance to COVID-19, most of these have not been implicated by previous COVID-19-related GWAS studies, and cannot be discovered through pathway enrichment analysis of collections of individually associated variants.

Finally, we found that by incorporating the discovered genetic interactions into a COVID-19 severity prediction model, the overall performance could be significantly improved using interaction information (increased AUC-ROC from 0.68 to 0.83). Similar improvements in this machine learning model performance could not be achieved by incorporating only variants discovered through single-variant GWAS analysis or through gene-set enrichment analysis. Our results suggest that complex genetic interactions between common variants play a role in determining COVID-19 severity.

17:45-18:00

Genome-wide association study identifies novel candidate malaria resistance genes in Cameroon

Room: Leacock 132

Format: Live-stream

Kevin K Esoh, Division of Human Genetics, University of Cape Town, South Africa
Tobias O Apinjoh, Department of Biochemistry and Molecular Biology, University of Buea, Cameroon
Alfred Amambua-Ngwa, Medical Research Council Unit The Gambia at LSHTM, Gambia
Steven G Nyanjom, Department of Biochemistry, Jomo Kenyatta University of Agriculture and Technology, Kenya
Emile R Chimusa, Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, United Kingdom
Lucas Amenga-Etego, West African Centre for Cell Biology of Infectious Pathogens, University of Ghana, Ghana
Ambroise Wonkam, Department of Genetic Medicine, Johns Hopkins University School of Medicine, United States
Eric A Achidi, Department of Biochemistry and Molecular Biology, University of Buea, Cameroon

Presentation Overview: Show

Wednesday, May 17^th

9:00-10:00

Invited Presentation: Cell fate specification and oncogenic vulnerability: the Ying and Yang of cellular dynamics

Room: Leacock 132

Format: Live from venue

Claudia Kleinman, McGill University, Canada

Presentation Overview: Show

10:30-11:00

DepLink: an R Shiny app to systematically link genetic and pharmacologic dependencies of cancer

Room: Leacock 132

Format: Live from venue

Tapsya Nayak, Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, United States
Li-Ju Wang, UPMC Hillman Cancer Center, University of Pittsburgh, United States
Michael Ning, UPMC Hillman Cancer Center, University of Pittsburgh, United States
Gabriela Rubannelsonkumar, Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, United States
Eric Jin, Department of Human Ecology, The University of Texas at Austin, United States
Siyuan Zheng, Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, United States
Peter Houghton, Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, United States
Yufei Huang, UPMC Hillman Cancer Center, University of Pittsburgh, United States
Yu-Chiao Chiu, UPMC Hillman Cancer Center, University of Pittsburgh, United States
Yidong Chen, Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, United States

Presentation Overview: Show

11:00-11:15

Computational pharmacogenomics screening identifies synergistic statin-compound combinations targeting the mevalonate pathway in breast cancer

Room: Leacock 132

Format: Live from venue

Jenna van Leeuwen, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada, M5G 1L7, Canada
Wail Ba-Alawi, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada, M5G 1L7, Canada
Emily Branchard, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada, M5G 1L7, Canada
Jennifer Cruickshank, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada, M5G 1L7, Canada
Weibke Schormann, Biological Sciences, Sunnybrook Research Institute, Toronto, ON, M4N 3M5, Canada, Canada
Joseph Longo, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada, M5G 1L7, Canada
Jennifer Silvester, The Campbell Family Institute for Breast Cancer Research, Toronto, ON, Canada, M5G 2C1, Canada
Peter Gross, Department of Medicine, McMaster University, Hamilton, ON, Canada, L8S 4L8, Canada
David Andrews, Biological Sciences, Sunnybrook Research Institute, Toronto, ON, M4N 3M5, Canada, Canada
David Cescon, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada, M5G 1L7, Canada
Benjamin Haibe-Kains, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada, M5G 1L7, Canada
Linda Penn, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada, M5G 1L7, Canada
Deena Gendoo, Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, Birmingham B15 2TT, United Kingdom, United Kingdom

Presentation Overview: Show

Triple-negative breast cancer (TNBC) remains a difficult-to-treat and aggressive subtype of breast cancer, with poorer prognosis compared to other subtypes. Aberrant activation of the metabolic mevalonate (MVA) pathway is a hallmark of many cancer types, including TNBC, due to the production of cholesterol and non-sterol isoprenoids that promote cellular proliferation and survival. Statins are FDA-approved, cholesterol-lowering agents that demonstrate therapeutic potential, by inhibiting the rate-limiting enzyme of MVA pathway. Statins trigger tumour-specific cell death via a restorative feedback response that triggers the induction of mevalonate pathway genes, a process which ultimately dampens the pro-apoptotic activity of statins significantly.
In this work, we sought to identify synergistic statin-compound combinations that would potentiate statin-induced tumour cell death as an improved treatment strategy. Dipyridamole (DP) was previously identified as an FDA-approved agent that synergizes with statins, and potentiates statin-induced tumour cell death. However, DP’s unclear drug mechanisms, complex polypharmacology and antiplatelet activity limits its clinical use in cancer patients.
We leverage a pathway-centric, computational pharmacogenomics approach to identify new compounds that phenocopy DP, based on high-throughput integration of drug structure, drug-gene perturbation and drug sensitivity profiles into a comprehensive network. We restricted some elements of our network to only those genes that are involved in the MVA pathway. Our approach, called mevalonate drug network fusion (MVA-DNF), identified 19 drugs that phenocopy DP behaviour in the regulation of MVA pathway genes at phenotypic and molecular levels. Importantly, MVA-DNF facilitated identification of synergistic statin-compound combinations that would potentiate statin-induced tumour cell death. We validated two top-ranked compounds, Nelfinavir and Honokiol; we demonstrated that these drugs synergize with fluvastatin to potentiate tumour cell death, by testing the drugs on a panel of breast cancer cell-lines, and 3D primary breast-cancer patient-derived tumour organoids. The synergistic responses of fluvastatin-nelfinavir and fluvastatin-honokiol drug combinations share similar mechanistic behaviour in targeting of the transcriptomic and proteomic pathways as DP, presenting more targeted alternatives for statin-drug combinations in TNBC treatment.
Our computational pharmacogenomic approach presents a time- and cost-effective strategy to identify novel, actionable compounds with pathway-specific activities. We are adapting this approach to identify novel compounds that phenocopy a compound of interest while targeting various key metabolic pathways. This sets the framework for future pathway-centric identification of drug combinations as anti-cancer therapeutic strategies.

Reference: van Leeuwen et al. Computational pharmacogenomic screen identifies drugs that potentiate the anti-breast cancer activity of statins. Nature Communications 13, 6323 (2022). doi: 10.1038/s41467-022-33144-9

11:15-11:30

Selection acting on mosaic chromosomal alterations in blood impacts molecular function and cancer risk among humans

Room: Leacock 132

Format: Live from venue

Kimberly Skead, Ontario Institute for Cancer Research, Canada
David Soave, Wilfrid Laurier University, Canada
Vanessa Bruat, Ontario Institute for Cancer Research, Canada
Michelle Harwood, Ontario Institute for Cancer Research, Canada
Quaid Morris, Memorial Sloan Kettering Cancer Center, United States
Marie-Julie Fave, Ontario Institute for Cancer Research, Canada
Philip Awadalla, Ontario Institute for Cancer Research, Canada

Presentation Overview: Show

11:30-11:45

Pan cancer classification of single cells in the tumour microenvironment

Room: Leacock 132

Format: Live from venue

Ido Nofech-Mozes, Ontario Institute for Cancer Research, Canada
David Soave, Wilfrid Laurier University, Canada
Philip Awadalla, Ontario Institute for Cancer Research, Canada
Sagi Abelson, Ontario Institute for Cancer Research, Canada

Presentation Overview: Show

Single-cell RNA sequencing can reveal valuable insights into cellular heterogeneity within tumour microenvironments (TMEs), thus paving the way for a deep understanding of the cellular mechanisms contributing to cancer. However, high heterogeneity among the same cancer types and low transcriptomic variation in immune cell subsets present challenges for accurate, high-resolution confirmation of cells’ identities. Here we present scATOMIC; a modular annotation tool for malignant and non-malignant cells. We trained the core scATOMIC model on >250,000 cancer, immune, and stromal cells defining a pan-cancer reference across 19 common cancer types and employed a novel hierarchical approach, outperforming current classification methods. We extensively confirmed scATOMIC‘s accuracy in an external validation set of 198 tumour biopsies and 54 blood samples encompassing >420,000 cancer and a variety of TME cells, achieving median F1 scores of 0.99 across cell types. Compared with 6 other methods, scATOMIC is the only tool that can accurately predict cancer type among single cancer cells. Moreover, scATOMIC identifies distinct malignant subclones among multiple samples that are not identified with copy number variation inference methods. We demonstrate increased cell type resolution across epithelial, blood and stromal cell types across cancer types. We highlight scATOMIC’s practical significance as a modular method by extending a classification branch to accurately subset breast cancers into their clinically relevant molecular subtypes. In a rare ER-low breast cancer scATOMIC annotates distinct populations of ER+ breast cancer cells and triple negative breast cancer cells with different copy number variation profiles, suggesting a genome instability evolution event leading to the loss of ER on most cancer cells. Additionally, we applied scATOMIC to predict tumours’ primary origin across metastatic cancers and achieved accurate predictions in 87% of tumours. Finally, our ability to annotate samples from across cancer types with high resolution has allowed for new insights into pan cancer cell-cell communication networks. We show that increased resolution of annotations can highlight inferred interactions between malignant cells and normal epithelial cells, as well as interactions related to the PD1-PDL1 pathways between dendritic cells and exhausted T cells. Our approach represents a broadly applicable strategy to analyze multicellular cancer TMEs.

11:45-12:00

Characterizing Gene Regulatory Networks in Breast Cancer Progression Model

Room: Leacock 132

Format: Live from venue

Cong Gao, The University of Vermont, United States
Seth Frietze, The University of Vermont, United States

Presentation Overview: Show

13:30-14:30

Invited Presentation: Molecular mechanisms of evolutionary innovation

Room: Leacock 132

Format: Live from venue

Anne-Ruxandra Carvunis

Presentation Overview: Show

14:30-14:45

Microexons are pervasive misannotation in plants, evolutionarily conserved, and crucial for gene function

Room: Leacock 132

Format: Live from venue

Chi Zhang, University of Nebraska - Lincoln, United States
Huihui Yu, University of Nebraska - Lincoln, United States
Jeffrey Mower, University of Nebraska - Lincoln, United States
Bin Yu, University of Nebraska - Lincoln, United States

Presentation Overview: Show

14:45-15:00

Simulation of scRNAseq data controlled by a causal gene regulatory network

Room: Leacock 132

Format: Live from venue

Yazdan Zinati, McGill University, Electrical and Computer Engineering, Montréal, Canada, Canada
Abdulrahman Takiddeen, McGill University, Electrical and Computer Engineering, Montréal, Canada, Canada
Amin Emad, McGill University, Electrical and Computer Engineering, Montréal, Canada, Canada

Presentation Overview: Show

The reconstruction of gene regulatory networks (GRNs) from single-cell gene expression data has been a topic of interest since the previous decade. However, benchmarking GRN inference algorithms remains challenging due to the absence of gold-standard ground truth. While reference GRNs can be built based on experimental data such as ChIP-Seq, or curated from literature, interactions might only partially correspond to the biological context under investigation, requiring lengthy and expensive perturbation experiments.

To overcome these issues, we present GRouNdGAN, a single-cell RNA-seq simulator based on causal generative adversarial networks. In this model, genes are causally expressed under the control of regulating transcription factors (TFs), guided by a user-provided GRN. GRouNdGAN enables simulation of single-cell RNA-seq data, in silico perturbation experiments and benchmarking of GRN inference methods. It is trained using a reference dataset to capture non-linear TF-gene dependencies, as well as technical and biological noise in real scRNAseq data to generate realistic datasets in which GRN properties are captured and gene identities are preserved.

GRouNdGAN outperforms state-of-the-art simulators in generating realistic cells indistinguishable from real ones despite the rigid constraints of an imposed GRN. Moreover, perturbing a TF results in significant perturbation of its targets, while other genes’ expression remain unchanged. In addition, GRouNdGAN can simulate cells at different states of a biological process. Using a dataset corresponding to the differentiation of stem cells, we show that the simulated cells conserve trajectories and pseudo-time orderings consistent with those of the real dataset. We use these properties to benchmark a variety of GRN inference methods, including those that utilize the concept of pseudo-time.

GRouNdGAN learns meaningful causal regulatory dynamics and can sample from interventional in addition to observational distributions and synthesize cells under conditions that do not occur in the dataset at inference time. This property allows for predicting perturbation and TF knockdown experiments in-silico. Using a scRNA-seq dataset corresponding to 11 cell types to generate simulated data, we show that excluding top three differentially expressed TFs of each cell type results in disappearance of that cell type from generated samples. In another experiment, removing lineage-determining TFs in hematopoiesis results in cells differentiating into other lineages consistent with in vitro knockout experiments.

In summary, GRouNdGAN is a powerful scRNAseq simulator with many utilities from simulating data for GRN inference to simulating in silico knockout experiments.

15:00-15:15

Predicting medulloblastoma subtype from single-cell RNA-seq data with pair-based classifiers

Room: Leacock 132

Format: Live from venue

Steven M. Foltz, Alex's Lemonade Stand Foundation, University of Pennsylvania, United States
Chante Bethell, Alex's Lemonade Stand Foundation, United States
Casey S. Greene, University of Colorado Anschutz, University of Pennsylvania, United States
Jaclyn N. Taroni, Alex's Lemonade Stand Foundation, United States

Presentation Overview: Show

Medulloblastoma (MB) is an aggressive pediatric cancer with subtypes that each have unique molecular features and patient outcomes (Taylor et al., 2012). The four main MB subtypes – SHH, WNT, Group 3, and Group 4 – can be predicted using gene expression or methylation data from bulk samples. SHH and WNT are easy to distinguish, but existing classification methods struggle to discriminate between Group 3 and Group 4 (Weishaupt et al., 2019). Existing methods are also often applied to entire cohorts, rather than predicting subtype labels for individual samples as they are collected. Here, we introduce a single sample predictor that accurately classifies individual samples without the need to normalize values to match a training distribution. We applied k top-scoring pairs, a classification method based on the ordering of a set of paired measurements, and random forest approaches to make subtype predictions based on within-sample relative gene expression levels. We demonstrate comparable performance across RNA-seq and microarray profiling. After training models using bulk microarray and RNA-seq, we tested the performance of our single sample predictor on single-cell RNA-seq data from a set of 36 medulloblastoma samples representing all four subtypes. Our model correctly predicted the subtype in the majority of pseudo-bulked samples constructed by averaging genes’ expression levels across all cells. We applied the classifiers to individual cells in the single cell data. The predicted subtype of the majority of individual cells matched the patient’s subtype in 35 out of 36 samples. In three samples, however, the predicted subtypes were a mix of Group 3 and Group 4, with low confidence predictions suggesting an intermediate phenotype. Notably, Group 3 and Group 4 have previously been found to exist as intermediates on a transcriptomic spectrum (Williamson et al., 2022). Our results provide single-cell support for a model of Group 3 and Group 4 existing along a continuum and illustrate the value of the ability to classify individual cells. In summary, k top-scoring pairs and random forest single sample predictors accurately predict MB subtype labels across platforms and for both bulk and single-cell transcriptomic samples.

15:15-15:30

A unified model for Bayesian integration and interpretation of single-cell RNA-sequencing data

Room: Leacock 132

Format: Live from venue

Ariel Madrigal, McGill University, Canada
Tianyuan Lu, McGill University, Canada
Larisa Morales-Soto, McGill University, Canada
Adrien Osakwe, McGill University, Canada
Hamed S. Najafabadi, McGill University, Canada

Presentation Overview: Show

One of the current challenges in the analysis of single-cell data is the harmonized analysis of expression profiles across samples, where sample-to-sample variability exists and is driven by technical and biological effects. Lately, various computational methods have been developed with the aim of removing unwanted sources of technical variation. However, these methods have various limitations, including the inability to distinguish technical and biological sources of sample-to-sample variability, and low interpretability of the integrated low-dimensional space. We introduce Gene Expression Decomposition and Integration (GEDI), a model that unifies various concepts from normalization and imputation to integration and interpretation of single-cell transcriptomics data in a single framework. GEDI finds a common coordinate frame that defines a reference gene expression manifold and sample-specific transformations of this coordinate frame. The common coordinate frame can be expressed as a function of gene-level variables, enabling the projection of pathway and regulatory network activities onto the cellular state space. The coordinate transformation matrices, on the other hand, provide a compact and harmonized representation of differences in the gene expression manifolds across samples, enabling cluster-free differential gene expression analysis along a continuum of cell states, as well as machine learning-based prediction of sample characteristics. Comparison of GEDI to a panel of single-cell integration methods using different benchmark datasets and previously established metrics suggests that GEDI is consistently among the top performers in batch effect removal and cell type conservation, while it can uniquely deconvolve the effects of different sources of sample-to-sample variability. We also show GEDI's ability to learn condition-associated gene expression changes at single-cell resolution using a recent single-cell atlas of PBMCs profiled in healthy, mild, and severe COVID-19 cases. GEDI is able to reconstruct disease-associated cell state vector fields that are consistent with pseudo-bulk approaches, while offering improved reproducibility between different cohorts. By projecting the activity of multiple transcription factors (TFs) onto our reference manifold, we also identified various groups of TFs whose activity correlated with COVID-19-associated gene-expression changes in a cell-type-specific manner, including CEBPA, SP1, and AHR in monocytes. Finally, we demonstrate GEDI’s ability to generalize to different data-generating distributions, which in addition to the analysis of gene expression, allows the study of alternative splicing and mRNA stability landscapes. We showcase this capability using single-cell RNA-seq data of mouse neurogenesis, revealing cell type-specific cassette exon-inclusion events, mRNA stability changes that accompany neuronal differentiation, as well as RNA-binding proteins and microRNAs that drive these changes. Together, these analyses highlight GEDI as a unified framework for modeling sample-to-sample variability, pathway and network activity analysis, as well as analysis of both transcriptional and post-transcriptional programs of the cell.

16:00-16:15

Graphylo: A Deep Learning Approach for Predicting Transcription Factor Binding Sites from Whole-Genome Multiple Alignments

Room: Leacock 132

Format: Live from venue

Dongjoon Lim, McGill University, Canada
Mathieu Blanchette, McGill University, Canada

Presentation Overview: Show

Transcription factor binding site prediction is a fundamental aspect of understanding gene regulatory networks. Motif overrepresentation and machine learning approaches are commonly used methods to predict where transcription factors bind to DNA, but they often suffer from poor specificity. This can partially be alleviated by approaches making use of evolutionary information that can yield important clues about sequence function and has long been combined with other types of sequence-based analyses to improve the detection of functional sites, although existing approaches remain relatively crude.

This study combines information on genome sequences and evolutionary history from placental mammals to improve the prediction of transcription factor binding sites in the human genome. We introduce Graphylo, which integrates both Convolutional Neural Networks and Graph Convolutional Networks to accurately predict transcription factor binding sites. The former is ideal for identifying short sequence motifs that are essential for transcription factor binding, while the latter is well-suited for analyzing graph-structured data such as phylogenetic trees that depict the evolutionary relationships between DNA sequences. The model takes a set of orthologous and computationally reconstructed ancestral DNA sequences from various species, including a reference species of interest, such as humans, and a phylogenetic tree that represents their evolutionary history as input. By combining these inputs, Graphylo can offer a more comprehensive understanding of gene regulation and enable researchers to gain evolutionary insights into how these regulatory networks have evolved across different species without the need for direct input of conservation scores or evolutionary constraints, which is an improvement over previous models.

We show on a wide variety of data sets that Graphylo consistently outperforms both state-of-the-art single-species deep learning approaches as well as approaches where sequence analysis is complemented by inter-species conservation scores. The use of a species-based attention model enables evolutionary insights to be gained, while the integrated gradient approach provides nucleotide-level model interpretability. Overall, our results suggest that by combining convolutional neural networks with graph convolutional networks, Graphylo is able to take advantage of evidence of negative selection on TFBS binding to enhance the sequence signal observed in humans. Unlike approaches that assume functional sites remain at the same alignment positions, Graphylo only uses the alignments to extract orthologous/ancestral sequences, making it more robust to binding site turnover. Graphylo is a powerful tool for understanding gene regulation and evolution across different species.

16:15-16:30

A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome

Room: Leacock 132

Format: Live from venue

Zhenhao Zhang, University of Michigan, United States
Fan Feng, University of Michigan, United States
Yiyang Qiu, University of Michigan, United States
Jie Liu, University of Michigan, United States

Presentation Overview: Show

16:30-16:45

Adversarial attack identifies conserved features of enhancer chromatin architecture

Room: Leacock 132

Format: Live from venue

Jamil Gafur, University of Iowa, United States
Olivia Lang, Cornell University, United States
William Lai, Cornell University, United States

Presentation Overview: Show

The wide range of cellular phenotypes demonstrated by multi-cellular organisms is due in large part to the complex and synergistic interplay of regulatory complexes spread throughout the eukaryotic genome. These regulatory elements 'enhance' specific gene programs and have been shown to operate in diverse and distinct networks across cell types. Deep-learning approaches to enhancer prediction typically leverage information-dense DNA sequence with newer approaches additionally incorporating relevant epigenomic datasets (i.e., ATAC-seq, PRO-seq, ChIP-seq, etc.) to improve accuracy and precision. However, clonal expansion of DNA mutations in cancers limits the potential utility of these approaches biomedically as the DNA sequence used for training may no longer exist in the target material (i.e., biological overfitting).
We examined the feasibility of enhancer prediction using only epigenomic chromatin datasets. By training simultaneously across multiple cell types, we successfully generated a cell-type invariant enhancer prediction platform that utilized only the pattern of chromatin marks for inference. We demonstrated that chromatin datasets are sufficient to identify enhancers genome-wide relative to networks trained using DNA-sequence. Combined with reference-genome free alignment of epigenomic datasets, we believe this approach serves as a proof-of-concept for future biomedical applications.
We next investigated what features our epigenomic enhancer-prediction network had learned. However, deep-learning neural network approaches are considered 'black boxes' with regards to human interpretability of what the features the network uses for inference. This makes refining networks to avoid biological overfitting challenging as techniques to interpret what neural networks have learned often require a priori knowledge and/or are linked to specific network architectures.
In order to understand what our enhancer-prediction neural networks learned we applied adversarial attack by Particle Swarm Optimization (PSO). PSO is network-architecture independent, has dozens of algorithmic variants, and can be applied to any prediction engine to derive the features that drive inference. PSO was used to generate adversarial inputs that in turn were used to characterize chromatin architectures predictive of enhancers and conserved across distinct cell types. By inverting the loss function, we were also able to identify chromatin architectures that were anti-correlated to predicted enhancer function. PSO is highly computational efficient and can provide human-readable output that reflects what trained networks consider predictive. As human interpretation of neural networks is a prerequisite to trusting and therefore applying these networks, we believe adversarial PSO is a critical addition to the toolset of deep learning.

16:45-17:00

Neural-network based classification of regulatory elements active in human gliomas identifies DNA shape features as important for regulatory activity

Room: Leacock 132

Format: Live from venue

Magdalena Machnicka, University of Warsaw, Warsaw, Poland, Poland
Marlena Osipowicz, University of Warsaw, Warsaw, Poland, Poland
Julia Smolik, University of Warsaw, Warsaw, Poland, Poland
Bartek Wilczynski, Institute of Informatics, University of Warsaw, Poland

Presentation Overview: Show

Gene regulatory DNA sequences, and enhancers and promoters in particular, are very important for gene expression regulation in eukaryotes. However, even though cells of these organisms seem to have no problem in identification of these functional elements among millions of bases of DNA that has no regulatory function, our computational models have great difficulty in recognizing the functional regulatory elements from non-functional and discerning between enhancers and promoters showing activity in different conditions or cell-types proves to be even more difficult.

In the last decade, many approaches to gene regulatory element classification were proposed and tested. In the last years, we have been using Bayesian networks (Bonn et al. 2012), Random Forests (Herman-Iżycka et al. 2017), support vector machines (Podsiadło et al. 2013) to predict enhancer and promoter positions in human and model organism genomes. However, despite the fact that each of these models was quite effective at the respective dataset, it seems to be very difficult to translate the results obtained in one of the studied systems to other biological contexts. In recent years, the wave of methods based on artificial neural networks has shown great success in many areas including classification tasks originating from molecular biology. One such approach, Basset (Kelley et al. 2016) applied a particular type of convolutional neural network to predicting DNA regions of accessible (open) chromatin in different tissues.

Since we have recently published (Stepniak et al. 2021) an atlas of regulatory elements (promoters and enhancers) active in gliomas we were interested to see if we can modify the Basset model to suit our task of discerning between active and inactive enhancers and promoters in the context of glioma samples taken from multiple patients. This was an especially interesting case for study, as we did not only have the positions and activity of these elements measured (by means of ChIP-Seq of histone modifications), but we also had these patients genotyped, allowing us to ascertain the potential role of DNA variants on the activity of tested regulatory elements. After modifications of the model and creating several different training(?) datasets we can not only show that our convolutional neural network provides better classification accuracy than the classical methods (AUC>80% vs <70% for Random Forests), but we can also see that integration of patient mutations in the process of neural network training can further increase the method performance (AUC even above 90%). After careful study of the internal structure of the filters learned by the network we can also show connections between features used by our model to classify sequences and DNA sequence specificity of transcription factors as well as DNA shape parameters (as defined by Zhou et al. 2013). What came to us as a surprising outcome of this study is that many filters that are essential for the network performance are solely attributable to DNA shape rather than transcription factor binding.

In summary, we can present a novel, neural network based approach to regulatory element classification that shows performance superior to our earlier methods and allows for introspection that identifies novel features important for regulatory sequence activity.

17:00-17:15

Allo: an easily integrable multi-mapped read rescue strategy for repetitive region inclusive ChIP-seq analysis pipelines

Room: Leacock 132

Format: Live from venue

Alexis Morrissey, The Pennsylvania State University, United States
Shaun Mahony, The Pennsylvania State University, United States

Presentation Overview: Show

Several studies have demonstrated that transposable elements (TEs) and other repetitive regions can harbor gene regulatory elements such as transcription factor binding sites. Unfortunately, repetitive regions pose problems for short-read sequencing assays such as ChIP-seq. The same TE can exist in multiple genomic regions, creating what are known as multi-mapped reads. In most ChIP-seq analysis pipelines, reads that align to multiple genomic locations are discarded during preprocessing and thus regulatory signals occurring in repetitive regions have largely been overlooked. Here, we develop an approach to allocate multi-mapped ChIP-seq reads in an efficient, accurate, and user-friendly manner. Our method, Allo, combines the probabilistic mapping of ChIP-seq reads using nearby uniquely mapped read counts with a convolutional neural network that recognizes the read distribution features of potential ChIP-seq peaks. Allo not only provides increased accuracy in multi-mapping read assignment compared to previously published methods, it also allows for read level output in the form of a corrected alignment file. Therefore, the output of our method can be input into any peak-caller downstream and is easily added to existing pipelines with very few modifications. To show the utility and validity of our approach, we analyzed a CTCF ChIP-seq dataset using Allo. We used both motif analysis as well as Hi-C data at TAD boundaries to validate the thousands of new peaks found only using Allo. Additionally, we show the application of Allo in finding peaks within paralogous gene families using a collection of ENCODE datasets. The effect of multi-mapped reads on duplicated gene families has not been extensively studied before. We highlight the importance of including multi-mapped reads arising from paralogous gene families using a PARIS ChIP-seq dataset. Peaks found after the inclusion of Allo suggest a novel pattern of PARIS binding within the transcription start sites of the Wiskott-Aldrich Syndrome family of proteins.

17:15-17:30

Identifying the expression determinants of vertebrate snoRNAs based on a machine learning approach

Room: Leacock 132

Format: Live from venue

Étienne Fafard-Couture, Université de Sherbrooke, Canada
Pierre-Étienne Jacques, Université de Sherbrooke, Canada
Michelle S Scott, Université de Sherbrooke, Canada

Presentation Overview: Show

Small nucleolar RNAs (snoRNAs) are structured noncoding RNAs present in multiple copies within eukaryotic genomes. SnoRNAs guide chemical modifications on their target RNA and regulate processes like ribosome assembly and splicing. Most human snoRNAs are embedded within host gene introns, the remainder being independently expressed from intergenic regions. We recently characterized the abundance of snoRNAs and their host gene across several healthy human tissues and found that the level of most snoRNAs does not correlate with that of their host gene, with the observation that snoRNAs embedded within the same host gene often differ drastically in abundance [1]. Current knowledge on the mechanisms regulating snoRNA expression dates back to more than 20 years, where it was shown only for a small subset of snoRNAs that the formation of a terminal stem and the distance between the snoRNAs and their intronic branch point were critical features for their expression [2-3]. Considering that recent annotations comprise more than 1500 snoRNAs, it is unclear if these features are still relevant to most snoRNAs. To better understand the determinants of snoRNA expression, we trained several machine learning models to predict whether snoRNAs are expressed or not in human tissues based on more than 30 collected features related to snoRNAs and their genomic context. By interpreting the models’ predictions using SHAP values, we find that snoRNAs rely on conserved motifs, a stable global structure and terminal stem as well as on a transcribed locus to be expressed. We observe that these features explain well the varying abundance of snoRNAs embedded within the same host gene. By predicting the expression status of snoRNAs across several vertebrates, we notice that only 1/3 of all annotated snoRNAs are expressed per genome, as in human. Our results suggest that ancestral snoRNAs disseminated within vertebrate genomes, sometimes leading to the development of new functions and a probable gain in fitness, thereby conserving features favorable to the expression of these few snoRNAs, the large remainder often degenerating into snoRNA pseudogenes. This work is under revision in Genome Research.
[1] Fafard-Couture et al. 2021. Annotation of snoRNA abundance across human tissues reveals complex snoRNA-host gene relationships. Genome Biology.
[2] Darzacq et al. 2000. Processing of Intron-Encoded Box C/D Small Nucleolar RNAs Lacking a 5′,3′-Terminal Stem Structure. Molecular and Cellular Biology.
[3] Hirose et al. 2001. Position within the host intron is critical for efficient processing of box C/D snoRNAs in mammalian cells. PNAS.

Thursday, May 18^th

9:00-10:00

Invited Presentation: Modelling covariate and SNP influences on DNA methylation data

Room: Leacock 132

Format: Live from venue

Celia Greenwood, Lady Davis Institute for Medical Research, Canada
Kaiqiong Zhao, CANSSI Postdoctoral fellowship, Canada
Karim Oualkacha, UQAM, Canada
Lajmi Lakhal-Chaieb, U. Laval, Canada
Aurélie Labbe, HEC, Canada
Kathleen Klein, Lady Davis Institute for Medical Research, Canada
Marie Hudson, Lady Davis Institute for Medical Research, Canada
Archer Yang, McGill University, Canada
Ines Colmegna, McGill University, Canada
Tieyuan Zhang, McGill University, Canada
Denise Daley, UBC, Canada
Sasha Bernatsky, McGill University, Canada

Presentation Overview: Show

10:30-11:00

CANCELLED Measuring the relative contribution to predictive power of modern nucleotide substitution modeling approaches

Room: Leacock 132

Format: Live from venue

Thomas Bujaki, Carleton University, Canada
Katharine Van Looyen, Carleton University, Canada
Nicolas Rodrigue, Carleton University, Canada

Presentation Overview: Show

10:30-10:45

Exploring transcript conservation and evolution with TranscriptDB

Room: Leacock 232

Format: Live from venue

Wend Yam Donald Davy Ouedraogo, Université de Sherbrooke, Canada
Abigail Djossou, Université de Sherbrooke, Canada
Aida Ouangraoua, Université de Sherbrooke, Canada

Presentation Overview: Show

The increasing amount of available genomic sequences calls for effective tools for annotating biological sequences. Inferring the function of a gene from its orthologs has been of significant use in comparative genomics. The interest for the orthology between genes has led to the design of several databases centered on genes. Alternative splicing, that contributes widely to the diversity of transcriptomes and proteomes in eukaryotes makes the transcript a refined level of functional homology relationships, thus calling for orthology inference methods and databases at the level of transcripts.

In this work, we present a transcript-centric database and a new method based on splicing structure to compute clusters of conserved transcripts for the reconstruction of transcript and gene phylogenies. TranscriptDB contains data that were obtained from the Ensembl resource. The gene annotation associated to each transcript is also provided, including gene homology information that is used to infer transcript homology relations. The collected and computed data are loaded into a custom PosgreSQL relational database. The database provides relevant clusters of conserved transcripts, and transcript phylogenies computed thanks to a new transcript similarity measure that evaluates the quantity of homologous nucleotides and exonic regions shared by transcripts. TranscriptDB provides a user-friendly web browser interface available at https://transcriptdb.cobius.usherbrooke.ca.

TranscriptDB CLUSTERING ALGORITHM
The algorithm is a graph-clustering algorithm designed to identify conserved transcripts between homologous genes. The clustering method relies on an improved reciprocal best hits approach to identify pairs of transcripts from homologous genes that share similar splicing structure.

TRANSCRIPT PHYLOGENIES RECONSTRUCTION
The algorithm consists in first constructing transcript subtrees in congruence with gene trees for each transcript cluster resulting from transcriptDB clustering. Then, a transcript super tree is constructed by combining all subtrees using the transcript similarity measure. The resulting super tree is used to infer homology relationships between transcripts.

TranscriptDB TOOLS AND APPLICATION
The database provides quick access to transcripts information via its web interface that includes a multi-scale graphical view showing conserved transcripts within their transcript trees, particularly useful for examining the evolution of conserved transcripts, the distinct types of homology relations between transcripts and putative isoorthologous transcripts. It also provides an interactive view of the gene model and gene structure. TranscriptDB may be useful for the functional annotation of proteins between genomes and to identify the type of relationship between transcripts in multiple species. Future versions of TranscriptDB will include data from the newest versions of Ensembl and the 3D visualization of proteins. Finally, the interface allows to retrieve a set of specific genes from given species and all information about the exon-intron structure of their transcripts can be downloaded in FASTA file or CSV format.

10:45-11:00

MolEvolvR: A web application for protein analysis using molecular evolution and phylogeny

Room: Leacock 232

Format: Live from venue

Joseph Burke, Michigan State University, United States
Samuel Chen, Michigan State University, United States
Lo Sosinski, Michigan State University, United States
Jacob Krol, University of Colorado Anschutz, United States
Faisal Alquaddoomi, University Colorado Anschutz, United States
Vincent Rubinetti, University of Colorado Anschutz, United States
Cristina Zimpel, Michigan State University, United States
Shaddai Amolitos, University of Colorado Anschutz, United States
Kellen Reason, Michigan State University, United States
John Johnston, Michigan State University, United States
Janani Ravi, University of Colorado Anschutz, United States

Presentation Overview: Show

Background: Characterizing protein function is key to understanding molecular interactions in biological systems. Researchers often parameterize proteins by their sequence, structure, and function. Over the past two decades, the scope of protein study has expanded by using computational tools like BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) and InterProScan (https://www.ebi.ac.uk/interpro/about/interproscan/), that allow for the comparison of homologs (i.e., similar proteins), and detection of domains (i.e., structural/functional subunits of proteins), respectively. These tools have expanded protein features to include comparative homology, domains, and domain architectures. The study of domains in protein function is analogous to residues versus motifs. Motifs are strings of individual residues, smaller than domains, with common functionalities based on their biochemical makeup. Similarly, domains zoom the magnification out further than motifs to characterize protein structure and function on a slightly larger scale. A popular domain-scanning application is InterProScan which utilizes various computational methods and databases to annotate domains on query protein sequences. With domain detection tools available, there’s an open opportunity for comparative analysis of domains across proteins, i.e., capturing evolutionary context by comparing the conservation of domain architectures across protein lineages. Particularly in rapidly evolving organisms like bacteria, there is an exceptional opportunity to understand key domains that span across species, genera, and higher taxonomy classes and phyla. Furthermore, the understudied scope of domain architectures can be paired with microbial phenotypes, such as antibiotic resistance, to predict which species can develop resistance.

Approach: We developed, MolEvolvR [DOI: https://doi.org/10.1101/2022.02.18.461833; web-app: http://jravilab.org/molevolvr/], a free user-friendly web application that can analyze hundreds of proteins in parallel and focuses on domains in an evolutionary context. Input protein(s) undergo a homology search with subsequent domain detection and secondary structure/localization predictions. After this initial process, the final analysis will produce interactive queryable and downloadable publication-ready tables and figures that compare domains on a phylogenetic scale for all query protein(s). Results can be filtered to include/exclude proteins by their taxa, domain architecture, and their homology metadata, when relevant. A particularly notable analysis is a generated network of domains (nodes) connected together to generate the space of architectures of all homologous proteins. Node size and edge weights are proportional to the frequency of occurrence. Additional analyses include phylogenetic trees, MSAs, and upset plots to show the distribution of domains and their architectures across the tree of life. Amino acid sequences (as FASTA, NCBI/Uniprot accession numbers) are the typical input data type. However, the app also supports BLAST and InterProScan results, and MSAs as input.

The web-app is deployed as an R Shiny server and can be found at http://jravilab.org/molevolvr/. The backend uses a combination of R and shell scripts, while the front-end is written in R shiny. We have tested the web-app on Mac, Windows, and Linux operating systems with Chrome, Brave, Firefox, and Safari browsers.

Results: In addition to MolEvolvR, a user-friendly web-app for everyone to use for protein characterization using molecular evolution and phylogeny, the methods underlying MolEvolvR have been used to study the PSP stress response system by comparing PSP-linked domains and genomic contexts across the tree of life [Ravi et al., 2020 bioRxiv; DOI: https://doi.org/10.1101/2020.09.24.301986]. A web-app instance (https://jravilab.shinyapps.io/psp-evolution) was created that utilizes the same analysis methods as MolEvolvR. Currently, our group utilizes these methods to study several microbial phenotypes, such as antimicrobial resistance, nutrient acquisition, and host-specificity, using the lens of evolution.

11:00-11:15

Phylogenetic Inference with GFlowNets

Room: Leacock 132

Format: Live from venue

Ming Yang Zhou, McGill University, Canada
Zichao Yan, Mila, Canada
Elliot Layne, McGill University, Canada
Nikolay Malkin, Mila, Canada
Dinghuai Zhang, Mila, Canada
Moksh Jain, Mila, Canada
Mathieu Blanchette, McGill University, Canada
Yoshua Bengio, Mila, Canada

Presentation Overview: Show

Phylogenetic inference is amongst the most extensively-studied tasks in the field of computational biology. In this work, we consider specifically the NP-hard large parsimony problem, which underpins applications not only in the study of species evolution, but also tumor progression. While many heuristics exist that are capable of estimating the most parsimonious tree relating a set of taxa, the task of accurately sampling from the posterior over all possible tree topologies at scale is highly valuable, and currently under-studied.

Methodology:
We investigate the use of Generative Flow Networks (GFlowNets) [Bengio et al., 2021] for phylogenetic inference. GFlowNets learn to construct compositional object from a discrete space X by taking a series of actions sampled from a stochastic policy network, with the objective that the likelihood of sampling each x ∈ X is proportional to some pre-defined reward function.

We propose to train a GFlowNet policy to sample from the set of trees proportionately to a Boltzmann distributed reward function with an energy term corresponding to the tree’s parsimony score. This is achieved by repeatedly joining sub-groups of taxa. During the course of training, we employ a temperature annealing schedule to recover trees with lower mutation counts.

Results:
We experimentally validate our method on two types of datasets. First, as a proof of concept, we test on a small-scale dataset of 10 transfer RNAs — a case where we can exhaustively enumerate all possible trees and compare the model-learnt sampling probability to the ground truth. Second, we experiment on the DS1-DS7 benchmarking datasets [Garey et al., 1996, Hedges et al., 1990, Lakner et al., 2008, Rossman et al., 2001, Yang and Yoder, 2003, Henk et al., 2003, Zhang and Blackwell, 2001], featuring up to 64 species (i.e. 10^103 possible rooted phylogenetic trees) and up to 2500 sites. While it is intractable to examine the full tree space due to large number of species, we compare the best phylogenetic trees sampled from our GFlowNet with solution produced by the state of the art method, PAUP* (version 4.0)[Swofford et al., 2003]. We show that on the small-scale dataset, our method correctly learns the parsimony score defined distribution over phylogenetic trees. On the larger scale datasets, by learning a distribution that highly favors low mutation trees, our GFlowNet method can efficiently discover the optimal solutions identified by PAUP*.

Overall, our GFlowNet-based tree sampling method efficiently and accurately samples from parsimony score defined posterior distributions over trees. This will prove valuable in contexts where phylogenetic inference needs to capture uncertainty on inferred trees.

Increased allele-specific expression in blood can be beneficial for immune response in healthy agers

Room: Leacock 232

Format: Live from venue

Michelle Harwood, Ontario Institute for Cancer Research, Canada
Elyssa Bader, Ontario Institute for Cancer Research, Canada
Mawusse Agbessi, Ontario Institute for Cancer Research, Canada
Kimberly Skead, Ontario Institute for Cancer Research, Canada
Vanessa Bruat, Ontario Institute for Cancer Research, Canada
Nicholas Cheng, Ontario Institute for Cancer Research, Canada
Marie-Julie Fave, Ontario Institute for Cancer Research, Canada
Philip Awadalla, Ontario Institute for Cancer Research, Canada

Presentation Overview: Show

The consequences of mutations on individual health are influenced by the regulation of gene expression. Allele-specific expression (ASE) is the preferential expression of one of two alleles, and is modulated following changes in genetic variation or environmental exposures. ASE can be caused by genetic regulatory variation, post-transcriptional modifications, or epigenetic alterations, and has the potential to alter phenotype and disease risk. Although ASE is a pervasive phenomenon affecting genetic regulation across tissues and phenotypes, it remains unknown how ASE contributes to variation in aging processes. Here, we utilize bulk RNA sequencing and single-cell RNA sequencing from blood of >1,000 selected individuals from the Canadian Partnership for Tomorrow’s Health, to evaluate ASE changes in blood cells during aging. We show that the number of SNPs with ASE increases as individuals age, particularly in cells involved in adaptive immunity (CD4+ T cells, CD8+ T cells, and B cells). We demonstrate that individuals who are healthy based on a calculated risk score from blood traits show larger increases in ASE with age compared to unhealthy individuals. By stratifying ASE sites into common and rare events, we observe that aged individuals have a higher proportion of common ASE events in genes involved in immune response. We further show that aged individuals with low health risk have a larger proportion of ASE in immune genes compared to high risk individuals. Genes involved in immunity are under strong selective pressures, and ASE variability may be beneficial for adaptability and response against pathogens. We also demonstrate that increases in ASE may decrease risk of pre-treated cardiometabolic traits, including hypertension and type 2 diabetes, however, opposite relationships are observed for cancer cases and pre-cancer samples. We further show that individuals administering anti-hypertensive and statin medications have larger overall proportions of ASE, demonstrating an environmental impact medications may have on ASE, which may also contribute to variability of ASE observed during aging. Our results suggest that increases in ASE in immune processes may be beneficial during aging by reducing risk of mortality and cardiometabolic diseases.

11:15-11:30

Prior Density Learning in Variational Phylogenetic Parameters Inference

Room: Leacock 132

Format: Live from venue

Amine Remita, Université du Québec à Montréal, Canada
Golrokh Kiani, Université de Québec à Montréal, Canada
Abdoulaye Baniré Diallo, Université du Québec à Montréal, Canada

Presentation Overview: Show

The Bayesian phylogenetic community is exploring faster and more scalable alternatives to the Markov chain Monte Carlo (MCMC) approach to approximate the high dimensional Bayesian posterior. The search for other substitutes is motivated by the falling computational costs, increasing challenges in large-scale data analysis, advances in inference algorithms and implementation of efficient computational frameworks. Some alternatives are adaptive MCMC, Hamiltonian Monte Carlo, sequential Monte Carlo and variational inference (VI). Until recently, few studies were interested in applying classical variational approaches in probabilistic phylogenetic models. However, VI started to gain some attraction from the phylogenetic community taking advantage of advances that made it more scalable, generic and accurate, such as stochastic and black box VI algorithms, latent-variable reparametrization, and probabilistic programming. These advancements allowed designing of powerful and fast variational-based algorithms to infer complex phylogenetic models and analyze large-scale phylodynamic data.

Bayesian methods incorporate the practitioner's prior knowledge about the likelihood parameters through the prior distributions. Defining an appropriate and realistic prior is difficult, especially in small data regimes, similar sequences or parameters with complex correlations. Notably, the variational phylogenetic methods assign fixed prior distributions with default hyperparameters to the likelihood parameters, a similar practice in MCMC methods. However, such a choice could bias the posterior approximation and induce high posterior probabilities in cases where the data are weak, or the actual parameter values do not fall within the range specified by the priors.

Here, we show that variational phylogenetic inference can also suffer from misspecified priors on branch lengths and less severely on sequence evolutionary parameters. Further, we propose an approach and an implementation framework (nnTreeVB) to relax the rigidity of the prior densities by learning their parameters using a gradient-based method and a neural network-based parameterization. We applied this approach to estimate branch lengths and evolutionary parameters under several Markov chain substitution models. The results of performed simulations show that the approach is powerful in estimating branch lengths and evolutionary model parameters. They also show that a flexible prior model provides better results than a predefined prior model. Finally, the results highlight that using neural networks could improve the initialization of the optimization of the prior density parameters.

Reference: Remita A.M., Kiani G. and Diallo A.B. (2023) Prior Density Learning in Variational Bayesian Phylogenetic Parameters Inference. https://arxiv.org/abs/2302.02522.

RaMP-DB 2.0 & MetaboSPAN: Improving functional interpretation of metabolomic data through comprehensive functional annotation and network approaches

Room: Leacock 232

Format: Live from venue

Andrew Patt, National Center for Advancing Translational Sciences, United States
John Braisted, National Center for Advancing Translational Sciences, United States
Kevin Coombes, The Ohio State University, United States
Tara Eicher, National Center for Advancing Translational Sciences, United States
Ewy Mathé, National Center for Advancing Translational Sciences, United States

Presentation Overview: Show

Enrichment analysis can lend mechanistic insight, suggest candidate biomarkers/therapeutic targets of disease, and allow integration of findings in other ‘omes like transcriptomics. To this end, we have built two complementary tools, RaMP-DB, and MetaboSPAN. RaMP-DB is a newly renovated integrated knowledge base, API, R package and online interface for generating biological and chemical insight into metabolomic, proteomic, and transcriptomic data. The new RaMP-DB version (2.0) features several major improvements over its predecessor, including chemical structure and class annotations for metabolites, improved pathway annotation coverage for lipids, new pathway enrichment analysis visuals, and enrichment analyses supporting the inclusion of custom backgrounds. On the other hand, MetaboSPAN is a specialized metabolomic pathway analysis method that leverages RaMP-DB and aims to compensate for inconsistent coverage of the metabolome in metabolomics experiments.
RaMP-DB is implemented as a MySQL database, an R package, an API, and a user-friendly web application. Python scripts acquire data for RaMP-DB from our primary sources (HMDB/KEGG, Wikipathways, Reactome, ChEBI, LIPID MAPS, and Rhea), and parse annotations associated with pathways, reactions, ontologies, chemical structures/classes. A semi-automated entity curation system flags faulty mappings between databases for subsequent manual curation. The contents of RaMP-DB 2.0 are regularly updated, with the current version containing 256,086 distinct metabolites, 15,827 genes/enzymes, 53,831 distinct pathways, 412,775 mappings between metabolites and pathways, 401,303 mappings between genes/enzymes and pathways, and 60,476 biochemical reactions from the Rhea database. Chemical properties such as InCHIKeys and chemical class (ClassyFire) are available for 256,592 metabolites. Further, the most recent ontologies from HMDB 5.0, including relevant portions of the new chemical functional ontology (CFO), such as biofluid/tissue of origin, are now included. Lastly, RaMP-DB's new web interface now supports 8 different single and batch queries on analytes, pathways, chemical and annotations, and enzyme/metabolite reactions, as well as pathway and chemical enrichment analysis.
MetaboSPAN was designed to account for incomplete coverage of the metabolome within individual metabolomic experiments and infer additional activity to aid in hypothesis generation. MetaboSPAN builds similarity networks based on annotations within RaMP-DB 2.0, where nodes are metabolites and edges encode shared annotatons between adjacent metabolites. The algorithm then uses network topological analysis to identify clusters of metabolites related to a list of metabolites of interest (e.g. altered in a disease), which undergo pathway enrichment testing.
To validate MetaboSPAN's performance, we designed several simulation experiments comparing the performance of MetaboSPAN against existing pathway analysis strategies (Globaltest, Fisher’s exact test, NetGSA, and FELLA). Notably, identical pathway libraries were used for each approach so as to not bias results. Our results show that MetaboSPAN yields higher sensitivity for altered pathway detection without inflating false positive findings. We further evaluated Metabospan and other methods on two independent datasets generated from the same cohort on different platforms, wherein metabolite coverage was almost completely different (sharing just one metabolite in common), allowing us to compare overlap in significant pathway findings. We found that MetaboSPAN improved the concordance of pathway results obtained from each dataset as compared to several of the baseline methods we tested against.
Both RaMP-DB and MetaboSPAN are open-source, publicly available resources. The online interface for RaMP-DB can be found at https://rampdb.nih.gov/, whereas the R package for MetaboSPAN can be downloaded at https://github.com/andyptt21/metabospan. Overall, RaMP-DB is a robust, comprehensive and well-maintained resource for functional annotations for metabolites and metabolic transcripts, and MetaboSPAN is a novel functional enrichment strategy that leverages these annotations to compensate for difficulties in metabolite detection and identification.

11:30-11:45

Colorful Orthology Clustering in Bounded-Degree Similarity Graphs

Room: Leacock 132

Format: Live from venue

Alitzel López Sánchez, Université de Sherbrooke, Canada
Manuel Lafond, Université de Sherbrooke, Canada

Presentation Overview: Show

ARAX: a graph-based modular reasoning tool for translational biomedicine

Room: Leacock 232

Format: Live from venue

Amy Glen, Oregon State University, United States
Chunyu Ma, Penn State University, United States
Luis Mendoza, Institute for Systems Biology, United States
Finn Womack, Oregon State University, United States
E. C. Wood, Stanford University, United States
Meghamala Sinha, Oregon State University, United States
Liliana Acevedo, Oregon State University, United States
Lindsey Kvarfordt, Oregon State University, United States
Ross Peene, Oregon State University, United States
Shaopeng Liu, Penn State University, United States
Andrew Hoffman, Leiden University, Netherlands
Jared Roach, Institute for Systems Biology, United States
Eric Deutsch, Institute for Systems Biology, United States
Stephen Ramsey, Oregon State University, United States
David Koslicki, Penn State University, United States

Presentation Overview: Show

Databases of biomedical knowledge are rapidly proliferating and growing, with recent advances (such as the RTX-KG2 knowledge-base that we have recently developed; (Wood et al. 2022)) increasingly focusing on integration of knowledge under a standardized schema and semantic layer (i.e., controlled vocabularies for types of concepts and types of relationships, for example, the Biolink standard (Unni et al. 2022)). The rise of standardized knowledge-bases sets the stage for the development of computational systems that can systematically discover novel connections between drugs and diseases (i.e., large-scale computational drug repurposing) or to answer other kinds of translational questions (e.g., ""What anticonvulsants are likely to have drug-drug interactions with cannabinoids?"" (Vázquez et al. 2020) or ""What drugs would downregulate expression of RHOBTB2 in the central nervous system?"" (Foksinska et al. 2022)). To be able to build such a system, improved methods and representation languages for knowledge-graph-based computational reasoning are needed. Previous efforts contributed myriad tools and approaches, but progress for biomedical reasoning systems has been hindered by (1) the lack of an expressive analysis workflow language for translational reasoning and (2) the lack of an associated reasoning engine that federates semantically integrated knowledge-bases.

As a part of the NCATS Translator project (Biomedical Data Translator Consortium 2019), we have developed ARAX (Glen et al. 2023), which is a new computational reasoning system for translational biomedicine that combines (1) an innovative workflow language (ARAXi), (2) a comprehensive semantically-unified biomedical knowledge graph (RTX-KG2), and (3) a versatile and novel method for scoring search results. Users or application-builders can query ARAX via a web browser interface or a web application programming interface. ARAX enables users to encode translational biomedical questions and to integrate knowledge across sources to answer the user’s query and facilitate exploration of results. To illustrate ARAX’s application and utility in specific disease contexts, we will present and discuss several use-case examples.

The source code and technical documentation for building the ARAX server-side software and its built-in knowledge database are freely available online (https://github.com/RTXteam/RTX). We provide a hosted ARAX service with a web browser interface at arax.rtx.ai and a web application programming interface (API) endpoint at arax.rtx.ai/api/arax/v1.3/ui/.

References:
Biomedical Data Translator Consortium. 2019. “Toward A Universal Biomedical Data Translator.” Clinical and Translational Science 12 (2): 86–90.
Foksinska, Aleksandra, Camerron M. Crowder, Andrew B. Crouse, Jeff Henrikson, William E. Byrd, Gregory Rosenblatt, Michael J. Patton, et al. 2022. “The Precision Medicine Process for Treating Rare Disease Using the Artificial Intelligence Tool miniKanren.” Frontiers in Artificial Intelligence 5 (September): 910216.
Glen, Amy K., Chunyu Ma, Luis Mendoza, Finn Womack, E. C. Wood, Meghamala Sinha, Liliana Acevedo, et al. 2023. “ARAX: A Graph-Based Modular Reasoning Tool for Translational Biomedicine.” bioRxiv. https://doi.org/10.1101/2022.08.12.503810.
Unni, Deepak R., Sierra A. T. Moxon, Michael Bada, Matthew Brush, Richard Bruskiewich, J. Harry Caufield, Paul A. Clemons, et al. 2022. “Biolink Model: A Universal Schema for Knowledge Graphs in Clinical, Biomedical, and Translational Science.” Clinical and Translational Science 15 (8): 1848–55.
Vázquez, Marta, Natalia Guevara, Cecilia Maldonado, Paulo Cáceres Guido, and Paula Schaiquevich. 2020. “Potential Pharmacokinetic Drug-Drug Interactions between Cannabinoids and Drugs Used for Chronic Pain.” BioMed Research International 2020 (August): 3902740.
Wood, E. C., Amy K. Glen, Lindsey G. Kvarfordt, Finn Womack, Liliana Acevedo, Timothy S. Yoon, Chunyu Ma, et al. 2022. “RTX-KG2: A System for Building a Semantically Standardized Knowledge Graph for Translational Biomedicine.” BMC Bioinformatics 23 (400). https://doi.org/10.1186/s12859‐022‐04932‐3.

11:45-12:00

Adapting language models to explore the multidomain protein universe

Room: Leacock 132

Format: Live from venue

Xiaoyue Cui, Carnegie Mellon University, United States
Maureen Stolzer, Carnegie Mellon University, United States
Dannie Durand, Carnegie Mellon University, United States

Presentation Overview: Show

Bridging the modeling gap: Accelerating Complex Disease Drug Discovery using Integrative Quantitative Pathway Analysis between Human Subjects and Cellular Models

Room: Leacock 232

Format: Live from venue

Pourya Naderi Yeganeh, Beth Israel Deaconess Medical Center/ Harvard Medical School, United States
Sang Su Kwak, Massachusetts General Hospital/ Harvard Medical School, United States
Mehdi Jorfi, Massachusetts General Hospital/ Harvard Medical School, United States
Katjuša Koler, University of Sheffield, United Kingdom
Luisa Quinti, Massachusetts General Hospital/ Harvard Medical School, United States
Djuna von Maydell, Massachusetts Institute of Technology, United States
Younjung Choi, Massachusetts General Hospital/ Harvard Medical School, United States
Joseph Park, Massachusetts General Hospital/ Harvard Medical School, United States
Murat Cetinbas, Massachusetts General Hospital, United States
Ruslan Sadreyev, Massachusetts General Hospital, United States
Rudolph Tanzi, Massachusetts General Hospital/ Harvard Medical School, United States
Winston Hide, Beth Israel Deaconess Medical Center/ Harvard Medical School, United States
Doo Yeon Kim, Massachusetts General Hospital/ Harvard Medical School, United States

Presentation Overview: Show

Complex diseases are highly challenging to combat partly due to the interplay of molecular cascades involved in disease pathogenesis. Cellular models of disease offer great potential for exploring biological mechanisms and drug target testing but there is currently no way to determine how well a modelled disease mechanism matches actual human disease. Several clinical trials for complex diseases have failed despite successful preclinical validations in cellular and animal models. Cellular models are built to recapitulate high-level phenotypes and disease pathology. But there is currently no approach to systematically assess how well the molecular profiles of disease pathogenesis are recapitulated in models. Comparing human and model transcriptomes is attractive but integrative study of gene expression is typically confounded by cross-platform and species-specific effects. We have developed a systems approach that better integrates transcriptomes from cell models and primary human tissues.

To determine how well a modelled disease mechanism matches the actual human disease, we have developed integrated quantitative pathway analysis (iQPA); that both captures and interrogates the degree to which disease functions constructed in models match those found in common across hundreds of diseased human brains. Using annotated pathway databases and a non-parametric approach, iQPA transforms gene expression into a series of quantifiable pathway activities. These pathway activities are analyzed using linear models to define functional dysregulation. In turn, iQPA leverages dysregulation events to identify and assess consistency of functional recapitulation between human and model.

We demonstrate the utility of iQPA applied to Alzheimer's disease (AD). Brain transcriptomic datasets sampled from different brain regions of three independent cohorts, as well as multiple cell models of AD, were integrated to determine high-fidelity therapeutic target pathways. iQPA found a high level of correlation (r = 0.84) of pathway dysregulation between distinct brain regions, whereas gene-based analysis uncovered a significantly lower correlation (r = 0.51). It unbiasedly determined which cellular models most closely recapitulate human dysregulation events. iQPA identified 83 commonly dysregulated core pathways with consistent dysregulation across human brains and the most relevant cell model. The p38 MAPK pathway is the top core pathway shared between AD brains and the relevant AD cellular models. We explored its therapeutic potential we applied a clinical p38 MAPK inhibitor which dramatically ameliorated Aβ-induced tau pathology and neuronal death in 3D-differentiated human neurons. iQPA accelerates AD drug discovery by systematically identifying dysregulated core pathway activities to provide robust, validated targets that attenuate AD pathology.