Presentation Overview: Show
Precision agriculture/farming is becoming the main approach to tackle the food needs for the increasing population in the world. Sustainable agriculture is key in addressing the challenges and in unlocking market potential. More precisely, in dairy farming, cattle well-being and longevity are crucial to a sustainable approach. In this presentation, I will illustrate how bioinformatics methods exploiting the power of artificial intelligence are important to integrate heterogeneous data (omics and non-omics) for generating predictive indicators of welfare, longevity, and profitability at different stages of cow’s life from cow milk, blood, and movements. These indicators are derived from ontologies, knowledge graph, data mining, genomics, metabolomics, computer vision and spatio-temporal deep learning techniques.
Presentation Overview: Show
We describe the problem of computing local feature attributions for dimensionality reduction methods. We use one such method that is well established within the context of supervised classification -- using the gradients of target outputs with respect to the inputs -- on the popular dimensionality reduction technique t-SNE, widely used in analyses of biological data. We provide an efficient implementation for the gradient computation for this dimensionality reduction technique. We show that our explanations identify significant features using novel validation methodology; using synthetic datasets and the popular MNIST benchmark dataset. We then demonstrate the practical utility of our algorithm by showing that it can produce explanations that agree with domain knowledge on a SARS-CoV-2 sequence data set. Throughout, we provide a road map so that similar explanation methods could be applied to other dimensionality reduction techniques to rigorously analyze biological datasets.
Presentation Overview: Show
Lack of trust in artificial intelligence (AI) models in medicine is still the key blockage for the use of AI in clinical decision support systems (CDSS). Although AI models are already performing excellently in medicine, their black-box nature entails that it is often impossible to understand why a particular decision was made. This is especially true for very complex models such as graph neural networks (GNNs), which have arisen in recent years to tackle the problem of modelling biological networks such as protein-protein-interaction graphs (PPIs). In the field of explainable AI (XAI), many algorithms have already been developed to ""explain"" to a human expert, which input features influenced a specific prediction. However, in the clinical domain, it is essential that these explanations lead to some degree of causal understanding by a clinician in the context of a specific application.
Therefore, we developed the CLARUS platform, aiming to promote human understanding of GNN predictions by allowing the domain expert to validate and improve the GNN decision-making process. CLARUS enables the visualization of the biological networks used to train and test the GNN, where node and edge correspond to gene products and their interactions for instance. Relevance values of genes and interactions are computed by XAI models, such as GNNExplainer are highlighed in the visualized graph by color intensity and line thickness. This enables domain experts to gain deeper insights into what parts of the biological network were most influential in the GNN decision-making process. More importantly, the expert can interactively manipulate the patient PPI based on their understanding and initiate a retraining or re-prediction. This interactivity allows them to ask manual counterfactual questions and analyse the resulting effects on the GNN decision.
We will present the first interactive XAI platform prototype, CLARUS, that allows not only the evaluation of specific human counterfactual questions based on user defined alterations of patient PPI networks and a re-prediction of classes, but also a retraining of the entire GNN after changing the underlying graph structures. The platform is currently hosted by the GWDG on https://rshiny.gwdg.de/apps/clarus/ and the prepint of our paper is available in bioarxiv [1].
[1] https://doi.org/10.1101/2022.11.21.517358
Presentation Overview: Show
Model organisms like Homo sapiens or Mus musculus enjoy the privilege of having their protein-protein interaction (PPI) networks largely characterised through high-confidence experimental evidence. While the networks of these organisms are mature and well-studied, it would take far too much effort to perform the same experimental validation on more obscure, lesser-studied species.
In silico methods are ideal for bridging the gap between well- and lesser-studied organisms, as they typically require fewer resources and less time than their in vitro and in vivo counterparts. Machine learning (ML) models that infer PPIs have been long proposed for this purpose. Unfortunately, supervised ML methods face the challenge that the lesser-studied species which would most benefit from having their PPIs inferred lack sufficient data to train accurate models.
ML methods which exhibit strong out-of-distribution (OOD) performance can overcome the small-dataset challenges by training on PPI networks of organisms with many edges, and inferring the edges of the incomplete, lesser-studied network. Achieving strong OOD performance, however, is a very difficult task for computational PPI prediction methods as it requires that they sufficiently generalise to the overall problem space. In fact, existing PPI methods often have a demonstrably difficult time making inferences to PPIs of proteins that (1) are outside of the training set and (2) are of a different species.
We have developed RAPPPID (PMID: 35771595), a method for Regularised Automatic Prediction of Protein-Protein Interactions using Deep Learning, that makes accurate PPI predictions on OOD data samples from evolutionarily distant species. RAPPPID takes as its only input pairs of amino acid (AA) sequences. These AA sequences are encoded using a deep twin AWD-LSTM neural network which generates latent embeddings for both sequences. These embeddings are subsequently inputted into a multi-layer perceptron (MLP) classification network. RAPPPID was trained on high-confidence edges (>95%) from the STRING dataset.
RAPPPID outperforms leading PPI prediction methods, including D-SCRIPT and SPRINT, when tested and evaluated on datasets which carefully control for data leakage. RAPPPID is capable of accurately predicting the interaction between the therapeutic antibodies Trastuzumab and Pertuzumab and their target HER-2. RAPPPID’s performance does not degrade when trained and tested on the various other species tested. Further, RAPPPID models trained on human training data accurately predict edges from other species, often achieving comparable performance to models trained on those very species themselves. RAPPPID models trained on human species gain even greater performance gains when tested on other species when transfer-learning is used to fine-tune the RAPPPID model on those species.
Presentation Overview: Show
Recent work has presented new evidence of translation for thousands of previously unknown coding sequences (CDS), drastically expanding the potential landscape of the human proteome. However, due to experimental limitations and inherent biases, many more CDS are likely to be overlooked or underestimated, thereby limiting our understanding of the full range of proteome diversity. Accurate annotation of functional elements holds crucial implications for clinical and fundamental research, hence we need revised tools to exhaustively evaluate the functional ORFeome. The astounding success of deep learning on sequence modeling tasks combined with quality -omics data provides hope in the search for an approximation of the universal features that underlie translation. Here we present FOMOnet (Fear Of Missing ORFs neural network), a 1D convolutional neural network derived from the UNet architecture. FOMOnet is trained with one-hot encoded representations of human protein coding transcripts and performs segmentation of coding regions within a transcript. With a receiver operating characteristic (ROC) area under the curve (AUC) of 99.3% and a precision-recall (PR) AUC of 96.1%, it vastly outperforms current tools aiming to predict canonical human CDS. Interestingly, our model displays no significant decrease in performance when evaluated on non-orthologous genes (with respect to human) in five other species: Mus musculus, Danio rerio, Xenopus laevis, Caenorhabditis elegans, and Saccharomyces cerevisiae. These results suggest FOMOnet as a viable tool for annotating the proteome of distantly related eukaryotes and newly sequenced species. Moreover, several studies have recently attributed function to proteins derived from pseudogenes (e.g. NOTCH2NL, SRGAP2, UBBP4) widely defined as evolutionary relics and thereby non-functional. However, the homology of pseudogenes with their parent genes has hindered large-scale functional studies. To address this blind spot in gene annotation, we used the OpenProt proteogenomics resource to extract 1784 human pseudogene transcripts with at least one ORF exhibiting strong evidence of translation (i.e. two independent detections by Ribo-seq or mass spectrometry). Among these translated ORFs, FOMOnet confidently identifies 702 as functional CDS, while also predicting hundreds of pseudogene CDS with no previous evidence for translation. Overall, this approach provides an unbiased assessment of CDS within the human genome. It outperforms current state-of-the-art tools, is generalizable to other eukaryotes, accurately predicts some newly found CDS and predicts novel ones.
Presentation Overview: Show
Natural antisense transcripts (NAT) are RNA pairs transcribed from overlapping, opposite DNA strands. NAT are expressed in all three domains of life, including retroviruses. They are involved in regulation of RNA expression including RNA maturation, stability, localization and translation. As such, they are frequently indicated in diseases pathways, such as cancer. Yet, it is unclear how NAT pairs bind as the assumption that they form long intermolecular duplexes has never been challenged. We, thus, hypothesize that NAT pairs assemble through a wide range of structures spanning from mostly intramolecular to mostly intermolecular base pairings. Many chemical probing techniques have been developed and considerable progress has been made in the investigation of RNA structure. However, they provide an average signal that is hardly amenable to deconvolution and precludes the identification of discrete structures within a complex equilibrium. We focus on, cen and ik2, two natural antisense mRNA localized to the centrosome during mitosis in Drosophila embryos. They share a 59-nucleotide long antisense region located in their 3’UTR, a prerequisite for their interaction and localized translation.
We employ direct RNA sequencing to identify adduct positions on single RNA molecules using nanopore reads. Nanopore reads utilize a 5-mer, dwell times, and current signal to characterize RNA sequences. Sequenced reads are aligned and normalized to produce reactivity profiles that are used to predict modification of unpaired nucleotides via statistical analysis or machine learning methods, such as SVM. These methods have been limited in the scope of the features set and predictions, and the size of available data. To address these challenges, we collect a large training set of both modified and unmodified cen and ik2 (n >= 20000) and their reactivity profiles. We, first, examine the consequence of information loss by expanding the feature set and incorporating reactivity profiles to create semi-supervised models which detect structural features and predict, de novo, reactivity profiles with improved correlations to ground truth. We, next, leverage these expanded feature and data sets to develop multi-class and multi-output deep learning models that jointly predict sequence, induced modifications, and secondary structural features. Preliminary results suggest that our methods yield comparable or improved identifications to standard SHAPE and existing direct RNA analyses.
Presentation Overview: Show
Over the past decade, citizen science computer games have become a popular practice for engaging the public in research activities. This methodology had a noticeable impact in molecular and cell biology, where millions of online volunteers contributed to the classification and annotation of scientific data, but also to solve advanced optimization problems requiring human supervision. Yet, despite promising results, the deployment of citizen science initiatives through academic/professional web pages faces serious limitations. Indeed, the volume of human attention needed to process massive data sets and make state-of-the-art scientific contributions rapidly outpaces the participation and availability of online volunteers. To overcome this challenge, citizen science must transcend its “natural habitat” and reach out to the entire gaming communities. Therefore, one solution is to build partnerships with commercial video game companies that already assembled large communities of gamers.
In this talk, we describe how this approach can transform the impact of citizen science in genomics. We discuss our experience from Phylo, an online puzzle for gene alignment, to Borderlands Science, a massively multiplayer online game for microbiome data analysis. We show how to embeds citizen science tasks into a virtual universe to engage new user bases. These principles have profound implications for future citizen science initiatives seeking to meet the growing demands of biology.
Presentation Overview: Show
The humoral immune response is a component of the adaptive immune system that involves the production and secretion of antibodies by B-cells. These antibodies are designed to recognize and bind to specific antigens, which are foreign substances originating from bacteria, viruses, or other pathogens. The majority of the actual targets of antibodies, known as epitopes, are small parts of microbial proteins, which can be continuous (linear) or discontinuous (conformational). Identification of B-cell epitopes is an important step in developing diagnostic methods, therapeutic antibodies, and epitope-based vaccines. Many methods have been developed in the past to predict B-cell epitopes based on profiles of the physico-chemical properties of amino acids in the sequence, their predicted secondary structure and relative solvent accessibility, and experimentally resolved or predicted 3D structures. However, the performance of these methods remains modest. Recently published applications of large language models to protein sequence embedding suggest that such embeddings, while independent of multiple sequence alignment, are capable of capturing long-range interactions and therefore encode structural (and potentially functional) information. Following these observations, we employed the ProtT5 protein language model to derive sequence embeddings and built a model to predict B-cell epitopes. The Immune Epitope Database (IEDB) was used to compile a non-redundant set of human pathogen antigens with all known linear epitopes mapped. This set was split into 5-fold training, validation, and control subsets with a 65:10:25% ratio, respectively. Then, the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) database was used to compile a blind set of 175 antigens non-redundant between themselves and against the IEDB dataset. Our prediction model was compared with the BepiPred2 and BepiPred3 methods and showed superior performance. Other advantages of our model include independence of multiple sequence alignment and/or 3D protein structure and unlimited length of the protein sequence.
Presentation Overview: Show
Background: The human gut microbiome has been shown to impact host development, normal metabolic processes, as well as the pathogenesis of various diseases. Based on these discoveries, engineering the gut microbiome for the treatment of such diseases has become an exciting new direction in medical science. Uncovering the nature of how to precisely control a patient’s microbiome requires accurate modeling the dynamics of the microbiome community under varying conditions. However, the modeling of longitudinal microbiome data faces many challenges due to the inherent noise of microbiome data. Therefore, the development of robust and accurate models will empower the identification of microbiome-targeted therapies, as clinicians and researchers will be able to identify which factors and stimuli can be used in order to drive a patient’s microbiome to a healthier composition.
Method: Here we present DiRLaM, a deep-learning framework combining an autoencoder and deep neural network for modeling microbiome dynamics. By representing the microbiome community in a reduced latent space using an autoencoder, DiRLaM can capture the essential intrinsic community structure while making the model more robust to noise. Furthermore, DiRLaM interpolates microbiome communities within the learned latent space. In order to construct smooth transitions between different microbiome community samples, a novel regularization is applied to the Beta diversity of the observed and interpolated communities. Next, a deep neural network is trained to combine the latent microbiome community with additional information about the host and external stimuli to predict the microbiome community at the next time point. Lastly, using the trained models, DiRLaM identifies microbe-microbe interactions and significant host and external factors that contribute to the dynamic changes of the microbiome community structure.
Results: Using synthetic datasets and three real-world longitudinal datasets, we show that DiRLaM provides a more robust interpolation under increasing levels of noise compared to standard B-Spline interpolations. DiRLaM also outperforms the state-of-the-art dynamic Bayesian network model for predicting subsequent microbiome communities in longitudinal data. Additionally, we demonstrate DiRLaM’s ability to identify significant host characteristics and environmental factors contributing to the dynamics of the microbiome community.
Conclusion: We present DiRLaM, a combination of an autoencoder with a novel Beta diversity regularization and deep neural network. In both synthetic and real-world conditions, DiRLaM was both more robust and more accurate when modeling longitudinal microbiome data.
Presentation Overview: Show
A series of choices in microbiome beta diversity analysis can dramatically impact findings in distance-based association testing and visual representation of sample relationships [1]. Through a comparative approach to pairwise beta diversity analysis in microbiome studies, we evaluate differences among and between established common practices and newer machine-learning approaches. Comparative analysis of a range of published microbiome datasets spanning human, animal, and environmental samples reveals how some beta diversity indices resemble one another, while others accentuate distinct features of sample variation. Identifying relationships between diversity indices can aid in interpretation and comparison of beta diversity analyses across microbiome studies, including findings from distance-based statistical tests. We then visualize these indices with established ordination methods such as PCoA and t-SNE as well as newer visualization tools such as the DeepMicro autoencoder [2]. This additional comparative analysis allows us to demonstrate advantages and disadvantages of different methods based on interpretability, distortion of original data, and susceptibility to model tuning. Finally, we describe some common pitfalls in beta diversity, such as oversaturation of pairwise indices leading to the commonly seen “arch effect” or “horseshoe effect” in visualization, and explore the efficacy of a novel approach, Local Gradient Distance (LGD), for correcting oversaturated distances. As a whole, our comparative analysis provides a new, data-driven framework for choosing an appropriate beta diversity analysis approach for a particular dataset.
[1] Knight, R., Vrbanac, A., Taylor, B. C., Aksenov, A., Callewaert, C., Debelius, J., ... & Dorrestein, P. C. (2018). Best practices for analysing microbiomes. Nature Reviews Microbiology, 16(7), 410-422.
[2] Oh, M., & Zhang, L. (2020). DeepMicro: deep representation learning for disease prediction based on microbiome data. Scientific reports, 10(1), 6026.
Presentation Overview: Show
Mendelian randomization (MR) is a method to estimate the causal effect of an exposure on an outcome in the presence of unmeasured confounding variables by leveraging the framework of instrumental variable (IV) estimation. An IV is a variable that induces variability in the exposure independently from confounding variables, and MR uses genetic variants as IVs. MR is widely used to predict the effect of interventions on modifiable disease risk factors to validate drug targets or therapeutic pathways. However, assumptions are needed to estimate causal relationships. For example, MR methods typically assume that the effect of the genetic variant used as the IV on the exposure is linear and homogeneous. These assumptions may be invalid in the presence of genetic interactions (e.g. different genetic effects in men vs women) or nonlinear relationships. Methods for nonparametric IV estimation that relax the effect homogeneity and linearity assumptions have been developed. For example, the DeepIV algorithm uses neural networks to model the instrument-exposure and exposure-outcome relationships in two distinct stages, making no additional assumptions about the functional forms and allowing for heterogeneity due to observed variables (Hartford et al. 2017). Alternative approaches include machine learning estimators based on the generalized method of moments or kernel instrumental variable regression.
Few of these recent nonparametric IV estimators have been evaluated in the context of MR. This is due in part to the difficulty of training some of these models which requires the use of heuristic optimization strategies when training neural network-based models. To bridge this gap, we have developed ml-mr, a bioinformatics package that implements various nonparametric IV estimators to enable their use and evaluation in the context of MR. We also provide a framework for simulation analyses which includes previously published simulation scenarios enabling the head-to-head comparison of different methods. To assess the precision of these MR estimators, we have also included tools to estimate valid prediction intervals from black box machine learning models using conformal prediction. We used ml-mr to evaluate different formulations of the DeepIV algorithm that simplify the model for the instrument-exposure relationship. Using simulation models, we evaluated the sensitivity of the MR estimators to important parameters such as the heritability and number of samples. We also report on possible uses of these methods to estimate conditional treatment effects for drug target validation in targeted patient populations.
Presentation Overview: Show
Cytochromes P450 (CYP450) are hemoproteins generally involved in the detoxification of the body of xenobiotic molecules. They participate in the metabolism of many drugs and genetic polymorphisms in humans have been found to impact drugs responses and metabolic functions. In this study, we investigate the genetic diversity for CYP450 genes. We found that two clusters, CYP3A and CYP4F, are notably differentiated across human populations with evidence for selective pressures acting on both clusters: we found signals of recent positive selection in CYP3A and CYP4F genes and signals of balancing selection in CYP4F genes. Furthermore, unusual linkage disequilibrium pattern is detected in both clusters, suggesting co-evolution of genes within clusters. Several of these selective signals co-localize with expression quantitative trait loci, which suggest co-regulation and epistasis within these highly important gene families. We also found that SNPs under selection in Africans in CYP3A genes were causally associated with reticulocytes count, raising the hypothesis that the selective pressure could be linked to malaria resistance. Furthermore, as the CYP3A and CYP4F subfamilies are involved in the metabolism of nutrients and drugs, our findings linking natural selection and gene expression in these gene clusters are of importance in understanding population differences in human health.
Presentation Overview: Show
Phenome-wide association studies (PheWAS) promise to detect shared genetic variants across a wide spectrum of phenotypes, thereby elucidating the molecular etiology of disease comorbidities. UK Biobank (UKB) provides a valuable resource to conduct PheWAS in over half a million individuals with readily available phenotype and genotype data. To aid PheWAS, we can harness the PheCode system, which maps ICD-10 codes to 1500 expert curated phenotype codes. Despite this effort, PheWAS based on the PheCode definitions may have limited statistical power because of the imperfect classification of cases and controls.
To tackle this challenge, we applied our recently developed MixEHR-SAGE [1] to infer 1500 PheCode-guided probabilistic topics from the UKB data using not only the 6800 unique ICD-10 codes but also 803 unique ATC medication codes and 2560 unique OPCS medical procedure codes. MixEHR-SAGE introduces a PheCode-driven initialization of phenotype topic priors. For each phenotype, we infer a dual-form of topic distribution: a seed-topic distribution over a core set of ICD-10 codes defined under the PheCode and a regular topic distribution over the entire ICD-10, OPCS and ATC vocabularies to capture complementary information from these additional data modalities.
To evaluate the accuracy of our phenotyping algorithm, we extracted 139 PheCodes that can be treated with any known ATC drug and removed subjects that used any of those drugs in their initial visits. For the remaining subjects, we used their PheCode-topic probabilities inferred in their initial visits as their surrogate disease severity scores. We evaluated these scores based on whether those subjects were prescribed linked medications in their follow-up visits. We achieved on average 60% area under the precision-recall curve in contrast to 40% using mixture models and 20% using the raw PheCodes.
We conducted PheWAS using the UKB data over the 1 million HapMap3 SNPs and the 1500 inferred PheCode-guided topics modelled as continuous phenotypes. Our PheWAS results identified genome-wide significant loci that did not reach significance in conventional facade-based PheWAS analyses. At least one new genome-wide significant locus was identified for 111 phenotypes. For example, we identified a novel association with our PheCode-guided topic for “preeclampsia and eclampsia,” a common complication of pregnancy. The lead variant at the locus, rs10405410 had an association p-value of 2.6 x 10-9 and was located near a cluster of carcinoembryonic antigen (CEA) genes including CEACAM1 and CEACAM8. CEACAM1 has been associated with insulin sensitivity in pregnant women with gestational diabetes and a recent prospective study showed that CEACAM1 levels were increased in the first trimester in women with preeclampsia when compared to health controls [2], providing external evidence supporting our finding.
Source code is available at https://github.com/li-lab-mcgill/MixEHR-Seed
[1] Song et al. (2022) Automatic phenotyping by a seed-guided topic model. In Proceedings of the 28th ACM SIGKDD.
[2] Mach, P. et al. Evaluation of carcinoembryonic antigen-related cell adhesion molecule 1 blood serum levels in women at high risk for preeclampsia. Am. J. Reprod. Immunol. 85, e13375 (2021)
Presentation Overview: Show
Polygenic risk scores, or PRS, are a tool to estimate individuals' liabilities to a disease or trait measurement based solely on genetic information. One commonly discussed potential use is in the clinic to identify people who are at greater risk of developing a disease. In this paper, we compare three large PRS for coronary artery disease (CAD). In the UK Biobank, the cohort which was used in the creation of each score, we calculated the association between CAD, the scores, and population structure for the white British subset. After adjustment for geographic and socioeconomic factors, CAD was not associated with population structure; however all three scores were confounded by genetic ancestry, raising questions about how these biases would impact clinical application. Furthermore, we investigated the differences in risk stratification using four different UK Biobank assessment centers as separate cohorts and tested how missing genetic data affected risk stratification through simulation. We show that missing data impact classification for extreme individuals for high- and low-risk, and quantiles of risk is sensitive to individual-level genotype missingness. Distributions of scores varied between assessment centers, revealing that thresholding based on quantiles can be problematic for consistency across centers and populations. Based on these results, we discuss potential avenues of improvements for PRS methodologies for usage in clinical practice.
Presentation Overview: Show
SARS-CoV-2 has had an unprecedented impact on human health and highlights the need for genomic epidemiology studies to increase our understanding of the evolution and spread of pathogens and to inform policy decisions. We sequenced viral genomes from over 22,000 patients tested at Mayo Clinic Laboratories between 2020-2022 and leveraged detailed patient metadata to describe county and regional spread in Minnesota via Bayesian phylodynamics. We found that spread in the state was mostly dominated by viruses from Hennepin County, which contains the state’s largest metropolis. This includes the spread of earlier clades as well as variants of concern Alpha and Delta.
The earliest introduction into Minnesota was to Hennepin from a domestic (USA) source on around January 22, 2020; six weeks before the first confirmed case in the state. The first county-to-county introductions were estimated to originate from Hennepin to bordering Ramsey County around February 23 and from Hennepin to somewhere in Central Minnesota around February 25. Both international and domestic introductions were most abundant in Hennepin (home to an international airport). Hennepin also was, by far, the most dominant source of in-state transmissions to other Minnesota locations (n=2,119) over the two-year period.
We measured the ratio of introductions to total viral flow into and out of each county/region. A value of 1 suggests a location as being a “sink” (accepts SARS-CoV-2 lineages but never exports them to other counties), while a value of 0 indicates a county as being a “source”. Most locations were ”sinks” throughout the pandemic. Central Minnesota (outside of Hennepin and Ramsey) was dominated by introductions early, but later in 2020 and early 2021 experienced brief fluctuating trends of higher virus exportation. Hennepin showed a different trend than all others as it consistently acted as a source for other Minnesota counties. However, there were brief periods of fluctuation such as an increase in the ratio of introductions towards the end of 2020 and early 2021, potentially driven by introductions from international locations.
As the virus continues to evolve, more within state genomic epidemiology studies are needed to inform local and state public health response by highlighting the roles of various counties on state-wide transmission. In addition, they can elucidate the impact of out of state introductions on local spread which can inform policies such as travel.
Presentation Overview: Show
The impact of COVID-19 on our lives over the past three years has been profound. COVID-19 cases vary in severity depending on a number of factors, including age, race, gender, blood type, and underlying conditions. While a few host genetic variants (such as the 3p21.31 locus) have been identified, our knowledge of how host genetics contribute to COVID-19 outcomes is still limited. Previous COVID-19 host genetic studies primarily analyze single-variant associations. However, like many human diseases, the genetic architecture underlying COVID-19 is likely to be complex, with potential interactions between variants determining disease outcomes.
In this study, we applied BridGE, a method we previously developed, to search for pathway-level genetic interactions associated with COVID-19 severity. We applied this approach to the UK Biobank England cohort and validated the discovered pathway-level interactions using the Scotland and Wales cohorts. In total, we discovered and replicated 21 between-pathway and 5 within-pathway interactions (FDR<0.05), all of which were associated with increased COVID-19 severity. Four of these interactions could be strongly replicated in the independent cohorts (FDR<0.05), including interactions involving antigen processing and presentation, androgen receptor signaling, and interferon‐γ signaling. We also found that some of the strongest variants driving these pathway-level interactions were located in or nearby the HLA super-locus (chrom. 6p21), which is intriguing given the important role of this locus in the immune response. While several of the discovered pathways showed clear relevance to COVID-19, most of these have not been implicated by previous COVID-19-related GWAS studies, and cannot be discovered through pathway enrichment analysis of collections of individually associated variants.
Finally, we found that by incorporating the discovered genetic interactions into a COVID-19 severity prediction model, the overall performance could be significantly improved using interaction information (increased AUC-ROC from 0.68 to 0.83). Similar improvements in this machine learning model performance could not be achieved by incorporating only variants discovered through single-variant GWAS analysis or through gene-set enrichment analysis. Our results suggest that complex genetic interactions between common variants play a role in determining COVID-19 severity.
Presentation Overview: Show
Recent data suggest that only a small fraction of severe malaria heritability is explained by the totality of genetic markers discovered so far. The extensive genetic diversity within African populations means that significant associations are likely to be found in Africa. In their series of multi-site genome-wide association studies (GWAS) across sub-Saharan Africa, the Malaria Genomic Epidemiology Network (MalariaGEN) observed specific limitations and encouraged country-specific analyses. Here, we present findings of a GWAS of Cameroonian participants that contributed to MalariaGEN projects (n = 1103). We identified protective associations at polymorphisms within the enhancer region of CHST15 (FDR < 0.02) that are specific to populations of African ancestry, and that tag strong eQTLs of CHST15 in hepatic cells. In-silico functional analysis revealed a signature of epigenetic regulation of CHST15 that is preserved in populations in historically malaria endemic regions, with haplotype analysis revealing a haplotype that is specific to these populations. Association analysis by ethnolinguistic group identified protective associations within SOD2 (FDR < 0.04), a gene previously shown to be significantly induced in presymptomatic malaria patients from Cameroon. Haplotype analysis revealed substantial heterogeneity within the beta-like globin (HBB) gene cluster among the major ethnic groups in Cameroon confirming differential malaria pressure and underscoring age-old fine-scale genetic structure within the country. Our findings revealed novel insights in the evolutionary genetics of populations living in Cameroon under malaria pressure with new significant protective loci (CHST15 and SOD2) and emphasized the significant attenuation of genetic association signals by fine-scale genetic structure.
Presentation Overview: Show
Cancer driver mutations often display a remarkable spatiotemporal distribution and tissue specificity. However, many of these driver genes are ubiquitously expressed and required for cellular function across cell types and tissues. Here, I will discuss our work towards understanding the reasons behind this apparent paradox in pediatric brain tumors. To systematically identify the origins of these tumors, we generated single-cell resolution maps of the developing brain for regions that are frequent tumor locations. This allowed us to catalogue cell populations in regions where these tumors arise, their dynamics over time, and their distinct cellular states. I will discuss computational strategies for integrating tumor datasets with normal developing references, and how these point to the lineage of origin, the differentiation state, and the chromatin architecture and associated vulnerabilities of several tumor entities. Across tumor types, we find that the biological robustness of lineage and positional identity programs explain many of the mutational patterns seen in patients.
Presentation Overview: Show
Motivation: Large-scale genetic and pharmacologic dependency maps are generated to reveal genetic vulnerabilities and drug sensitivities of cancer. However, user-friendly software is needed to systematically link such maps.
Results: Here we present DepLink, a web server to identify genetic and pharmacologic perturbations that induce similar effects on cell viability or molecular changes. DepLink integrates heterogeneous datasets of genome-wide CRISPR loss-of-function screens, high-throughput pharmacologic screens, and gene expression signatures of perturbations. The datasets are systematically connected by four complementary modules tailored for different query scenarios. It allows users to search for potential inhibitors that target a gene (Module 1) or multiple genes (Module 2), mechanisms of action of a known drug (Module 3), and drugs with similar biochemical features to an investigational compound (Module 4). We performed a validation analysis to confirm the capability of our tool to link the effects of drug treatments to knockouts of the drug’s annotated target genes. By querying with a demonstrating example of CDK6, the tool identified well-studied inhibitor drugs, novel synergistic gene and drug partners, and insights into an investigational drug. In summary, DepLink enables easy navigation, visualization, and linkage of rapidly evolving cancer dependency maps.
Availability: The DepLink web server, demonstrating examples, and detailed user manual are available at https://shiny.crc.pitt.edu/deplink/.
Presentation Overview: Show
Triple-negative breast cancer (TNBC) remains a difficult-to-treat and aggressive subtype of breast cancer, with poorer prognosis compared to other subtypes. Aberrant activation of the metabolic mevalonate (MVA) pathway is a hallmark of many cancer types, including TNBC, due to the production of cholesterol and non-sterol isoprenoids that promote cellular proliferation and survival. Statins are FDA-approved, cholesterol-lowering agents that demonstrate therapeutic potential, by inhibiting the rate-limiting enzyme of MVA pathway. Statins trigger tumour-specific cell death via a restorative feedback response that triggers the induction of mevalonate pathway genes, a process which ultimately dampens the pro-apoptotic activity of statins significantly.
In this work, we sought to identify synergistic statin-compound combinations that would potentiate statin-induced tumour cell death as an improved treatment strategy. Dipyridamole (DP) was previously identified as an FDA-approved agent that synergizes with statins, and potentiates statin-induced tumour cell death. However, DP’s unclear drug mechanisms, complex polypharmacology and antiplatelet activity limits its clinical use in cancer patients.
We leverage a pathway-centric, computational pharmacogenomics approach to identify new compounds that phenocopy DP, based on high-throughput integration of drug structure, drug-gene perturbation and drug sensitivity profiles into a comprehensive network. We restricted some elements of our network to only those genes that are involved in the MVA pathway. Our approach, called mevalonate drug network fusion (MVA-DNF), identified 19 drugs that phenocopy DP behaviour in the regulation of MVA pathway genes at phenotypic and molecular levels. Importantly, MVA-DNF facilitated identification of synergistic statin-compound combinations that would potentiate statin-induced tumour cell death. We validated two top-ranked compounds, Nelfinavir and Honokiol; we demonstrated that these drugs synergize with fluvastatin to potentiate tumour cell death, by testing the drugs on a panel of breast cancer cell-lines, and 3D primary breast-cancer patient-derived tumour organoids. The synergistic responses of fluvastatin-nelfinavir and fluvastatin-honokiol drug combinations share similar mechanistic behaviour in targeting of the transcriptomic and proteomic pathways as DP, presenting more targeted alternatives for statin-drug combinations in TNBC treatment.
Our computational pharmacogenomic approach presents a time- and cost-effective strategy to identify novel, actionable compounds with pathway-specific activities. We are adapting this approach to identify novel compounds that phenocopy a compound of interest while targeting various key metabolic pathways. This sets the framework for future pathway-centric identification of drug combinations as anti-cancer therapeutic strategies.
Reference: van Leeuwen et al. Computational pharmacogenomic screen identifies drugs that potentiate the anti-breast cancer activity of statins. Nature Communications 13, 6323 (2022). doi: 10.1038/s41467-022-33144-9
Presentation Overview: Show
The age-associated accumulation of somatic mutations in blood, termed Clonal Hematopoiesis of Indeterminate Potential (CHIP) has been implicated in the development of blood cancers and cardiovascular conditions. Here, we evaluate how natural selection shapes the prevalence of clonal somatic mosaic chromosomal alterations (mCAs) in blood, and their impact on cancer risk, among 13,760 individuals from the Canadian Partnership for Tomorrow’s Health. We find that mCA-inferred CHIP is three times higher than previously estimated with one in eight individuals harboring a mCA in their blood. We characterize mCA hotspots across the genome and find that hotspots are enriched for genes implicated in blood cancer. The number of mCAs in an individual’s blood is associated with a greater than two-fold increase in risk of blood cancer development. Further, we observe a 136% increase in the rate of blood cancer development for each additional autosomal mCA. Negative selection appears to play a key role in regulating the frequency of mCAs in each individual’s hematopoietic population. However, we find that gains, losses, and copy-number-neutral variants impact gene expression distinctly, with stabilizing selection shaping the penetrance of mCAs among blood transcriptomes. Over 30% of canonical eQTLs are impacted by mCAs, with loss events accounting for 9.1% of observed allele specific expression revealing a previously overlooked somatic contribution to blood expression profiles, regulatory mechanisms, and disease development. In summary, we show how different evolutionary models of selection shape clonal dynamics in healthy and pre-cancerous blood and are critical when leveraging the predictive power of somatic events for early cancer detection.
Presentation Overview: Show
Single-cell RNA sequencing can reveal valuable insights into cellular heterogeneity within tumour microenvironments (TMEs), thus paving the way for a deep understanding of the cellular mechanisms contributing to cancer. However, high heterogeneity among the same cancer types and low transcriptomic variation in immune cell subsets present challenges for accurate, high-resolution confirmation of cells’ identities. Here we present scATOMIC; a modular annotation tool for malignant and non-malignant cells. We trained the core scATOMIC model on >250,000 cancer, immune, and stromal cells defining a pan-cancer reference across 19 common cancer types and employed a novel hierarchical approach, outperforming current classification methods. We extensively confirmed scATOMIC‘s accuracy in an external validation set of 198 tumour biopsies and 54 blood samples encompassing >420,000 cancer and a variety of TME cells, achieving median F1 scores of 0.99 across cell types. Compared with 6 other methods, scATOMIC is the only tool that can accurately predict cancer type among single cancer cells. Moreover, scATOMIC identifies distinct malignant subclones among multiple samples that are not identified with copy number variation inference methods. We demonstrate increased cell type resolution across epithelial, blood and stromal cell types across cancer types. We highlight scATOMIC’s practical significance as a modular method by extending a classification branch to accurately subset breast cancers into their clinically relevant molecular subtypes. In a rare ER-low breast cancer scATOMIC annotates distinct populations of ER+ breast cancer cells and triple negative breast cancer cells with different copy number variation profiles, suggesting a genome instability evolution event leading to the loss of ER on most cancer cells. Additionally, we applied scATOMIC to predict tumours’ primary origin across metastatic cancers and achieved accurate predictions in 87% of tumours. Finally, our ability to annotate samples from across cancer types with high resolution has allowed for new insights into pan cancer cell-cell communication networks. We show that increased resolution of annotations can highlight inferred interactions between malignant cells and normal epithelial cells, as well as interactions related to the PD1-PDL1 pathways between dendritic cells and exhausted T cells. Our approach represents a broadly applicable strategy to analyze multicellular cancer TMEs.
Presentation Overview: Show
Breast cancer progression involves a reprogramming of regulatory chromatin signatures, including enhancers that regulate gene transcription associated with tumorigenesis and metastasis. Regulatory interactions mediated by enhancers, transcription factors and target genes make up complex network controlling cell identity and cellular behaviors. Utilizing network analysis to identify critical transcription factors (TFs) in different cancer cell types not only holds the potential to understand the regulatory relationship but to also discover new therapeutic targets to treat breast cancer. To elucidate the regulatory landscape in breast cancer progression we employed the human MCF10A progression model composed of four different cell lines representing the different stages in breast tumor development. Here we integrated ChIP-Seq for histone modifications and transcription factors, as well as RNA-Seq to generate the global gene regulatory network and to define core regulatory circuitry of breast cancer progression. By comparing the general gene regulatory networks between different states, we observed distinctive regulatory interactions and TF/target gene modules that change in the progression model. With the biological understanding from gene regulatory networks, we further performed analysis of TF regulatory circuitry to define critical direct and indirect regulatory events by core TFs involved in breast cancer progression. Our results highlight the importance of TFs such as RUNX1 that serve as tumor suppressor genes and demonstrate the increased activity of oncogenic TFs in late stages of breast cancer. Overall, our study unravels key chromatin features associated with breast cancer progression and further advances our understanding of the epigenetic basis of breast cancer progression, as well as provides potential therapeutic targets for inhibition of breast tumor progression.
Presentation Overview: Show
Where do genes come from? All genomes contain genes whose sequences appear unique to a given species or lineage to the exclusion of all others. These “orphan” genes cannot be related to any known gene family; they are considered evolutionarily novel and are thought to mediate species-specific traits and adaptations. In this seminar, I will present an investigation of the evolutionary origins of orphan genes in eukaryotes. According to our results, most orphan genes may have evolved through an enigmatic process called “de novo gene birth”. I will present a series of integrated computational and experimental analyses in budding yeast that begin to shed light on the molecular mechanisms of de novo gene birth. Serendipitously, these analyses reveal the existence of thousands of previously unsuspected species-specific translated elements in the yeast genome that appear to mediate beneficial phenotypes yet are evolutionarily transient. I will discuss the implications of these findings for our understanding of molecular innovation in eukaryotes.
Presentation Overview: Show
Microexons (<15 nucleotide exons) have not been thoroughly explored. We developed a pipeline to identify small microexons from >900 deep RNA-seq data sets of 10 diverse plants. We utilized this data set to examine the accuracy of microexon annotation, the effects of misannotation on inferred proteins, the mechanism and regulation of microexon splicing, and the dynamics of microexon evolution. We also developed a predictive algorithm to identify conserved coding microexons independent of RNA-seq data, and applied this to a data set of 132 plant genomes. It has been shown that microexons provide a strong phylogenetic signal consistent with expected plant organismal relationships, indicating that microexons are useful molecular genetic markers. Our discovery of microexons provides resources for improving plant genome annotations, a new insight of microexon splicing and a new source for molecular evolutionary studies in plants. Analysis of all identified microexons and their junction reads revealed that microexons require strong flanking intronic splicing signals and are predominantly spliced post-transcriptionally in the nucleus before the transcripts are released into the cytoplasm, whereas most regular exons were spliced co-transcriptionally. We revealed that microexons and associated gene structures are highly conserved among angiosperms, and many microexons can be traced back to the origin of land plants.
Presentation Overview: Show
The reconstruction of gene regulatory networks (GRNs) from single-cell gene expression data has been a topic of interest since the previous decade. However, benchmarking GRN inference algorithms remains challenging due to the absence of gold-standard ground truth. While reference GRNs can be built based on experimental data such as ChIP-Seq, or curated from literature, interactions might only partially correspond to the biological context under investigation, requiring lengthy and expensive perturbation experiments.
To overcome these issues, we present GRouNdGAN, a single-cell RNA-seq simulator based on causal generative adversarial networks. In this model, genes are causally expressed under the control of regulating transcription factors (TFs), guided by a user-provided GRN. GRouNdGAN enables simulation of single-cell RNA-seq data, in silico perturbation experiments and benchmarking of GRN inference methods. It is trained using a reference dataset to capture non-linear TF-gene dependencies, as well as technical and biological noise in real scRNAseq data to generate realistic datasets in which GRN properties are captured and gene identities are preserved.
GRouNdGAN outperforms state-of-the-art simulators in generating realistic cells indistinguishable from real ones despite the rigid constraints of an imposed GRN. Moreover, perturbing a TF results in significant perturbation of its targets, while other genes’ expression remain unchanged. In addition, GRouNdGAN can simulate cells at different states of a biological process. Using a dataset corresponding to the differentiation of stem cells, we show that the simulated cells conserve trajectories and pseudo-time orderings consistent with those of the real dataset. We use these properties to benchmark a variety of GRN inference methods, including those that utilize the concept of pseudo-time.
GRouNdGAN learns meaningful causal regulatory dynamics and can sample from interventional in addition to observational distributions and synthesize cells under conditions that do not occur in the dataset at inference time. This property allows for predicting perturbation and TF knockdown experiments in-silico. Using a scRNA-seq dataset corresponding to 11 cell types to generate simulated data, we show that excluding top three differentially expressed TFs of each cell type results in disappearance of that cell type from generated samples. In another experiment, removing lineage-determining TFs in hematopoiesis results in cells differentiating into other lineages consistent with in vitro knockout experiments.
In summary, GRouNdGAN is a powerful scRNAseq simulator with many utilities from simulating data for GRN inference to simulating in silico knockout experiments.
Presentation Overview: Show
Medulloblastoma (MB) is an aggressive pediatric cancer with subtypes that each have unique molecular features and patient outcomes (Taylor et al., 2012). The four main MB subtypes – SHH, WNT, Group 3, and Group 4 – can be predicted using gene expression or methylation data from bulk samples. SHH and WNT are easy to distinguish, but existing classification methods struggle to discriminate between Group 3 and Group 4 (Weishaupt et al., 2019). Existing methods are also often applied to entire cohorts, rather than predicting subtype labels for individual samples as they are collected. Here, we introduce a single sample predictor that accurately classifies individual samples without the need to normalize values to match a training distribution. We applied k top-scoring pairs, a classification method based on the ordering of a set of paired measurements, and random forest approaches to make subtype predictions based on within-sample relative gene expression levels. We demonstrate comparable performance across RNA-seq and microarray profiling. After training models using bulk microarray and RNA-seq, we tested the performance of our single sample predictor on single-cell RNA-seq data from a set of 36 medulloblastoma samples representing all four subtypes. Our model correctly predicted the subtype in the majority of pseudo-bulked samples constructed by averaging genes’ expression levels across all cells. We applied the classifiers to individual cells in the single cell data. The predicted subtype of the majority of individual cells matched the patient’s subtype in 35 out of 36 samples. In three samples, however, the predicted subtypes were a mix of Group 3 and Group 4, with low confidence predictions suggesting an intermediate phenotype. Notably, Group 3 and Group 4 have previously been found to exist as intermediates on a transcriptomic spectrum (Williamson et al., 2022). Our results provide single-cell support for a model of Group 3 and Group 4 existing along a continuum and illustrate the value of the ability to classify individual cells. In summary, k top-scoring pairs and random forest single sample predictors accurately predict MB subtype labels across platforms and for both bulk and single-cell transcriptomic samples.
Presentation Overview: Show
One of the current challenges in the analysis of single-cell data is the harmonized analysis of expression profiles across samples, where sample-to-sample variability exists and is driven by technical and biological effects. Lately, various computational methods have been developed with the aim of removing unwanted sources of technical variation. However, these methods have various limitations, including the inability to distinguish technical and biological sources of sample-to-sample variability, and low interpretability of the integrated low-dimensional space. We introduce Gene Expression Decomposition and Integration (GEDI), a model that unifies various concepts from normalization and imputation to integration and interpretation of single-cell transcriptomics data in a single framework. GEDI finds a common coordinate frame that defines a reference gene expression manifold and sample-specific transformations of this coordinate frame. The common coordinate frame can be expressed as a function of gene-level variables, enabling the projection of pathway and regulatory network activities onto the cellular state space. The coordinate transformation matrices, on the other hand, provide a compact and harmonized representation of differences in the gene expression manifolds across samples, enabling cluster-free differential gene expression analysis along a continuum of cell states, as well as machine learning-based prediction of sample characteristics. Comparison of GEDI to a panel of single-cell integration methods using different benchmark datasets and previously established metrics suggests that GEDI is consistently among the top performers in batch effect removal and cell type conservation, while it can uniquely deconvolve the effects of different sources of sample-to-sample variability. We also show GEDI's ability to learn condition-associated gene expression changes at single-cell resolution using a recent single-cell atlas of PBMCs profiled in healthy, mild, and severe COVID-19 cases. GEDI is able to reconstruct disease-associated cell state vector fields that are consistent with pseudo-bulk approaches, while offering improved reproducibility between different cohorts. By projecting the activity of multiple transcription factors (TFs) onto our reference manifold, we also identified various groups of TFs whose activity correlated with COVID-19-associated gene-expression changes in a cell-type-specific manner, including CEBPA, SP1, and AHR in monocytes. Finally, we demonstrate GEDI’s ability to generalize to different data-generating distributions, which in addition to the analysis of gene expression, allows the study of alternative splicing and mRNA stability landscapes. We showcase this capability using single-cell RNA-seq data of mouse neurogenesis, revealing cell type-specific cassette exon-inclusion events, mRNA stability changes that accompany neuronal differentiation, as well as RNA-binding proteins and microRNAs that drive these changes. Together, these analyses highlight GEDI as a unified framework for modeling sample-to-sample variability, pathway and network activity analysis, as well as analysis of both transcriptional and post-transcriptional programs of the cell.
Presentation Overview: Show
Transcription factor binding site prediction is a fundamental aspect of understanding gene regulatory networks. Motif overrepresentation and machine learning approaches are commonly used methods to predict where transcription factors bind to DNA, but they often suffer from poor specificity. This can partially be alleviated by approaches making use of evolutionary information that can yield important clues about sequence function and has long been combined with other types of sequence-based analyses to improve the detection of functional sites, although existing approaches remain relatively crude.
This study combines information on genome sequences and evolutionary history from placental mammals to improve the prediction of transcription factor binding sites in the human genome. We introduce Graphylo, which integrates both Convolutional Neural Networks and Graph Convolutional Networks to accurately predict transcription factor binding sites. The former is ideal for identifying short sequence motifs that are essential for transcription factor binding, while the latter is well-suited for analyzing graph-structured data such as phylogenetic trees that depict the evolutionary relationships between DNA sequences. The model takes a set of orthologous and computationally reconstructed ancestral DNA sequences from various species, including a reference species of interest, such as humans, and a phylogenetic tree that represents their evolutionary history as input. By combining these inputs, Graphylo can offer a more comprehensive understanding of gene regulation and enable researchers to gain evolutionary insights into how these regulatory networks have evolved across different species without the need for direct input of conservation scores or evolutionary constraints, which is an improvement over previous models.
We show on a wide variety of data sets that Graphylo consistently outperforms both state-of-the-art single-species deep learning approaches as well as approaches where sequence analysis is complemented by inter-species conservation scores. The use of a species-based attention model enables evolutionary insights to be gained, while the integrated gradient approach provides nucleotide-level model interpretability. Overall, our results suggest that by combining convolutional neural networks with graph convolutional networks, Graphylo is able to take advantage of evidence of negative selection on TFBS binding to enhance the sequence signal observed in humans. Unlike approaches that assume functional sites remain at the same alignment positions, Graphylo only uses the alignments to extract orthologous/ancestral sequences, making it more robust to binding site turnover. Graphylo is a powerful tool for understanding gene regulation and evolution across different species.
Presentation Overview: Show
Many deep learning approaches have been proposed to predict epigenetic profiles, chromatin organization, and transcription activity. While these approaches achieve satisfactory performance in predicting one modality from another, the learned representations are not generalizable across predictive tasks or across cell types. In this paper, we propose a deep learning approach named EPCOT which employs a pre-training and fine-tuning framework, and is able to accurately and comprehensively impute multiple modalities including epigenome, chromatin organization, transcriptome, and enhancer activity for new cell types, by only requiring cell-type specific chromatin accessibility profiles. Many of these predicted modalities, such as Micro-C and ChIA-PET, are quite expensive to get in practice, and the in silico prediction from EPCOT should be quite helpful. Furthermore, this pre-training and fine-tuning framework allows EPCOT to identify generic representations generalizable across different predictive tasks. Interpreting EPCOT models also provides biological insights including mapping between different genomic modalities, identifying TF sequence binding patterns, and analyzing cell-type specific TF impacts on enhancer activity.
Presentation Overview: Show
The wide range of cellular phenotypes demonstrated by multi-cellular organisms is due in large part to the complex and synergistic interplay of regulatory complexes spread throughout the eukaryotic genome. These regulatory elements 'enhance' specific gene programs and have been shown to operate in diverse and distinct networks across cell types. Deep-learning approaches to enhancer prediction typically leverage information-dense DNA sequence with newer approaches additionally incorporating relevant epigenomic datasets (i.e., ATAC-seq, PRO-seq, ChIP-seq, etc.) to improve accuracy and precision. However, clonal expansion of DNA mutations in cancers limits the potential utility of these approaches biomedically as the DNA sequence used for training may no longer exist in the target material (i.e., biological overfitting).
We examined the feasibility of enhancer prediction using only epigenomic chromatin datasets. By training simultaneously across multiple cell types, we successfully generated a cell-type invariant enhancer prediction platform that utilized only the pattern of chromatin marks for inference. We demonstrated that chromatin datasets are sufficient to identify enhancers genome-wide relative to networks trained using DNA-sequence. Combined with reference-genome free alignment of epigenomic datasets, we believe this approach serves as a proof-of-concept for future biomedical applications.
We next investigated what features our epigenomic enhancer-prediction network had learned. However, deep-learning neural network approaches are considered 'black boxes' with regards to human interpretability of what the features the network uses for inference. This makes refining networks to avoid biological overfitting challenging as techniques to interpret what neural networks have learned often require a priori knowledge and/or are linked to specific network architectures.
In order to understand what our enhancer-prediction neural networks learned we applied adversarial attack by Particle Swarm Optimization (PSO). PSO is network-architecture independent, has dozens of algorithmic variants, and can be applied to any prediction engine to derive the features that drive inference. PSO was used to generate adversarial inputs that in turn were used to characterize chromatin architectures predictive of enhancers and conserved across distinct cell types. By inverting the loss function, we were also able to identify chromatin architectures that were anti-correlated to predicted enhancer function. PSO is highly computational efficient and can provide human-readable output that reflects what trained networks consider predictive. As human interpretation of neural networks is a prerequisite to trusting and therefore applying these networks, we believe adversarial PSO is a critical addition to the toolset of deep learning.
Presentation Overview: Show
Gene regulatory DNA sequences, and enhancers and promoters in particular, are very important for gene expression regulation in eukaryotes. However, even though cells of these organisms seem to have no problem in identification of these functional elements among millions of bases of DNA that has no regulatory function, our computational models have great difficulty in recognizing the functional regulatory elements from non-functional and discerning between enhancers and promoters showing activity in different conditions or cell-types proves to be even more difficult.
In the last decade, many approaches to gene regulatory element classification were proposed and tested. In the last years, we have been using Bayesian networks (Bonn et al. 2012), Random Forests (Herman-Iżycka et al. 2017), support vector machines (Podsiadło et al. 2013) to predict enhancer and promoter positions in human and model organism genomes. However, despite the fact that each of these models was quite effective at the respective dataset, it seems to be very difficult to translate the results obtained in one of the studied systems to other biological contexts. In recent years, the wave of methods based on artificial neural networks has shown great success in many areas including classification tasks originating from molecular biology. One such approach, Basset (Kelley et al. 2016) applied a particular type of convolutional neural network to predicting DNA regions of accessible (open) chromatin in different tissues.
Since we have recently published (Stepniak et al. 2021) an atlas of regulatory elements (promoters and enhancers) active in gliomas we were interested to see if we can modify the Basset model to suit our task of discerning between active and inactive enhancers and promoters in the context of glioma samples taken from multiple patients. This was an especially interesting case for study, as we did not only have the positions and activity of these elements measured (by means of ChIP-Seq of histone modifications), but we also had these patients genotyped, allowing us to ascertain the potential role of DNA variants on the activity of tested regulatory elements. After modifications of the model and creating several different training(?) datasets we can not only show that our convolutional neural network provides better classification accuracy than the classical methods (AUC>80% vs <70% for Random Forests), but we can also see that integration of patient mutations in the process of neural network training can further increase the method performance (AUC even above 90%). After careful study of the internal structure of the filters learned by the network we can also show connections between features used by our model to classify sequences and DNA sequence specificity of transcription factors as well as DNA shape parameters (as defined by Zhou et al. 2013). What came to us as a surprising outcome of this study is that many filters that are essential for the network performance are solely attributable to DNA shape rather than transcription factor binding.
In summary, we can present a novel, neural network based approach to regulatory element classification that shows performance superior to our earlier methods and allows for introspection that identifies novel features important for regulatory sequence activity.
Presentation Overview: Show
Several studies have demonstrated that transposable elements (TEs) and other repetitive regions can harbor gene regulatory elements such as transcription factor binding sites. Unfortunately, repetitive regions pose problems for short-read sequencing assays such as ChIP-seq. The same TE can exist in multiple genomic regions, creating what are known as multi-mapped reads. In most ChIP-seq analysis pipelines, reads that align to multiple genomic locations are discarded during preprocessing and thus regulatory signals occurring in repetitive regions have largely been overlooked. Here, we develop an approach to allocate multi-mapped ChIP-seq reads in an efficient, accurate, and user-friendly manner. Our method, Allo, combines the probabilistic mapping of ChIP-seq reads using nearby uniquely mapped read counts with a convolutional neural network that recognizes the read distribution features of potential ChIP-seq peaks. Allo not only provides increased accuracy in multi-mapping read assignment compared to previously published methods, it also allows for read level output in the form of a corrected alignment file. Therefore, the output of our method can be input into any peak-caller downstream and is easily added to existing pipelines with very few modifications. To show the utility and validity of our approach, we analyzed a CTCF ChIP-seq dataset using Allo. We used both motif analysis as well as Hi-C data at TAD boundaries to validate the thousands of new peaks found only using Allo. Additionally, we show the application of Allo in finding peaks within paralogous gene families using a collection of ENCODE datasets. The effect of multi-mapped reads on duplicated gene families has not been extensively studied before. We highlight the importance of including multi-mapped reads arising from paralogous gene families using a PARIS ChIP-seq dataset. Peaks found after the inclusion of Allo suggest a novel pattern of PARIS binding within the transcription start sites of the Wiskott-Aldrich Syndrome family of proteins.
Presentation Overview: Show
Small nucleolar RNAs (snoRNAs) are structured noncoding RNAs present in multiple copies within eukaryotic genomes. SnoRNAs guide chemical modifications on their target RNA and regulate processes like ribosome assembly and splicing. Most human snoRNAs are embedded within host gene introns, the remainder being independently expressed from intergenic regions. We recently characterized the abundance of snoRNAs and their host gene across several healthy human tissues and found that the level of most snoRNAs does not correlate with that of their host gene, with the observation that snoRNAs embedded within the same host gene often differ drastically in abundance [1]. Current knowledge on the mechanisms regulating snoRNA expression dates back to more than 20 years, where it was shown only for a small subset of snoRNAs that the formation of a terminal stem and the distance between the snoRNAs and their intronic branch point were critical features for their expression [2-3]. Considering that recent annotations comprise more than 1500 snoRNAs, it is unclear if these features are still relevant to most snoRNAs. To better understand the determinants of snoRNA expression, we trained several machine learning models to predict whether snoRNAs are expressed or not in human tissues based on more than 30 collected features related to snoRNAs and their genomic context. By interpreting the models’ predictions using SHAP values, we find that snoRNAs rely on conserved motifs, a stable global structure and terminal stem as well as on a transcribed locus to be expressed. We observe that these features explain well the varying abundance of snoRNAs embedded within the same host gene. By predicting the expression status of snoRNAs across several vertebrates, we notice that only 1/3 of all annotated snoRNAs are expressed per genome, as in human. Our results suggest that ancestral snoRNAs disseminated within vertebrate genomes, sometimes leading to the development of new functions and a probable gain in fitness, thereby conserving features favorable to the expression of these few snoRNAs, the large remainder often degenerating into snoRNA pseudogenes. This work is under revision in Genome Research.
[1] Fafard-Couture et al. 2021. Annotation of snoRNA abundance across human tissues reveals complex snoRNA-host gene relationships. Genome Biology.
[2] Darzacq et al. 2000. Processing of Intron-Encoded Box C/D Small Nucleolar RNAs Lacking a 5′,3′-Terminal Stem Structure. Molecular and Cellular Biology.
[3] Hirose et al. 2001. Position within the host intron is critical for efficient processing of box C/D snoRNAs in mammalian cells. PNAS.
Presentation Overview: Show
DNA methylation levels are known to change with age, exposures, cis-genetic variants and disease status. I will describe an over-dispersed quasi-binomial model with functional smoothing to analyze high-resolution bisulfite sequencing measures of methylation in small genomic regions. Two types of overdispersion are needed to capture variability, and I will also discuss feature selection in this framework. Results will be illustrated with data on people with high and low levels of anti-citrullinated protein antibodies, a known risk factor for rheumatoid arthritis.
Presentation Overview: Show
Traditional approaches to probabilistic phylogenetic inference have relied on information theoretic criteria to select among a relatively small set of substitution models. These model selection criteria have recently been called into question when applied to richer models, including models that invoke mixtures of nucleotide frequency profiles. At the nucleotide level, we are therefore left without a clear picture of mixture models' contribution to overall predictive power relative to other modeling approaches. Here, we utilize a Bayesian cross-validation method to directly measure the predictive performance of a wide range of nucleotide substitution models. We compare the relative contributions of free nucleotide exchangeability parameters, gamma-distributed rates across sites, and mixtures of nucleotide frequencies with both finite and infinite mixture frameworks. We find that the most important contributor to a model's predictive power is the use of a sufficiently rich mixture of nucleotide frequencies. These results suggest that mixture models should be given greater consideration in nucleotide-level phylogenetic inference.
Presentation Overview: Show
The increasing amount of available genomic sequences calls for effective tools for annotating biological sequences. Inferring the function of a gene from its orthologs has been of significant use in comparative genomics. The interest for the orthology between genes has led to the design of several databases centered on genes. Alternative splicing, that contributes widely to the diversity of transcriptomes and proteomes in eukaryotes makes the transcript a refined level of functional homology relationships, thus calling for orthology inference methods and databases at the level of transcripts.
In this work, we present a transcript-centric database and a new method based on splicing structure to compute clusters of conserved transcripts for the reconstruction of transcript and gene phylogenies. TranscriptDB contains data that were obtained from the Ensembl resource. The gene annotation associated to each transcript is also provided, including gene homology information that is used to infer transcript homology relations. The collected and computed data are loaded into a custom PosgreSQL relational database. The database provides relevant clusters of conserved transcripts, and transcript phylogenies computed thanks to a new transcript similarity measure that evaluates the quantity of homologous nucleotides and exonic regions shared by transcripts. TranscriptDB provides a user-friendly web browser interface available at https://transcriptdb.cobius.usherbrooke.ca.
TranscriptDB CLUSTERING ALGORITHM
The algorithm is a graph-clustering algorithm designed to identify conserved transcripts between homologous genes. The clustering method relies on an improved reciprocal best hits approach to identify pairs of transcripts from homologous genes that share similar splicing structure.
TRANSCRIPT PHYLOGENIES RECONSTRUCTION
The algorithm consists in first constructing transcript subtrees in congruence with gene trees for each transcript cluster resulting from transcriptDB clustering. Then, a transcript super tree is constructed by combining all subtrees using the transcript similarity measure. The resulting super tree is used to infer homology relationships between transcripts.
TranscriptDB TOOLS AND APPLICATION
The database provides quick access to transcripts information via its web interface that includes a multi-scale graphical view showing conserved transcripts within their transcript trees, particularly useful for examining the evolution of conserved transcripts, the distinct types of homology relations between transcripts and putative isoorthologous transcripts. It also provides an interactive view of the gene model and gene structure. TranscriptDB may be useful for the functional annotation of proteins between genomes and to identify the type of relationship between transcripts in multiple species. Future versions of TranscriptDB will include data from the newest versions of Ensembl and the 3D visualization of proteins. Finally, the interface allows to retrieve a set of specific genes from given species and all information about the exon-intron structure of their transcripts can be downloaded in FASTA file or CSV format.
Presentation Overview: Show
Background: Characterizing protein function is key to understanding molecular interactions in biological systems. Researchers often parameterize proteins by their sequence, structure, and function. Over the past two decades, the scope of protein study has expanded by using computational tools like BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) and InterProScan (https://www.ebi.ac.uk/interpro/about/interproscan/), that allow for the comparison of homologs (i.e., similar proteins), and detection of domains (i.e., structural/functional subunits of proteins), respectively. These tools have expanded protein features to include comparative homology, domains, and domain architectures. The study of domains in protein function is analogous to residues versus motifs. Motifs are strings of individual residues, smaller than domains, with common functionalities based on their biochemical makeup. Similarly, domains zoom the magnification out further than motifs to characterize protein structure and function on a slightly larger scale. A popular domain-scanning application is InterProScan which utilizes various computational methods and databases to annotate domains on query protein sequences. With domain detection tools available, there’s an open opportunity for comparative analysis of domains across proteins, i.e., capturing evolutionary context by comparing the conservation of domain architectures across protein lineages. Particularly in rapidly evolving organisms like bacteria, there is an exceptional opportunity to understand key domains that span across species, genera, and higher taxonomy classes and phyla. Furthermore, the understudied scope of domain architectures can be paired with microbial phenotypes, such as antibiotic resistance, to predict which species can develop resistance.
Approach: We developed, MolEvolvR [DOI: https://doi.org/10.1101/2022.02.18.461833; web-app: http://jravilab.org/molevolvr/], a free user-friendly web application that can analyze hundreds of proteins in parallel and focuses on domains in an evolutionary context. Input protein(s) undergo a homology search with subsequent domain detection and secondary structure/localization predictions. After this initial process, the final analysis will produce interactive queryable and downloadable publication-ready tables and figures that compare domains on a phylogenetic scale for all query protein(s). Results can be filtered to include/exclude proteins by their taxa, domain architecture, and their homology metadata, when relevant. A particularly notable analysis is a generated network of domains (nodes) connected together to generate the space of architectures of all homologous proteins. Node size and edge weights are proportional to the frequency of occurrence. Additional analyses include phylogenetic trees, MSAs, and upset plots to show the distribution of domains and their architectures across the tree of life. Amino acid sequences (as FASTA, NCBI/Uniprot accession numbers) are the typical input data type. However, the app also supports BLAST and InterProScan results, and MSAs as input.
The web-app is deployed as an R Shiny server and can be found at http://jravilab.org/molevolvr/. The backend uses a combination of R and shell scripts, while the front-end is written in R shiny. We have tested the web-app on Mac, Windows, and Linux operating systems with Chrome, Brave, Firefox, and Safari browsers.
Results: In addition to MolEvolvR, a user-friendly web-app for everyone to use for protein characterization using molecular evolution and phylogeny, the methods underlying MolEvolvR have been used to study the PSP stress response system by comparing PSP-linked domains and genomic contexts across the tree of life [Ravi et al., 2020 bioRxiv; DOI: https://doi.org/10.1101/2020.09.24.301986]. A web-app instance (https://jravilab.shinyapps.io/psp-evolution) was created that utilizes the same analysis methods as MolEvolvR. Currently, our group utilizes these methods to study several microbial phenotypes, such as antimicrobial resistance, nutrient acquisition, and host-specificity, using the lens of evolution.
Presentation Overview: Show
Phylogenetic inference is amongst the most extensively-studied tasks in the field of computational biology. In this work, we consider specifically the NP-hard large parsimony problem, which underpins applications not only in the study of species evolution, but also tumor progression. While many heuristics exist that are capable of estimating the most parsimonious tree relating a set of taxa, the task of accurately sampling from the posterior over all possible tree topologies at scale is highly valuable, and currently under-studied.
Methodology:
We investigate the use of Generative Flow Networks (GFlowNets) [Bengio et al., 2021] for phylogenetic inference. GFlowNets learn to construct compositional object from a discrete space X by taking a series of actions sampled from a stochastic policy network, with the objective that the likelihood of sampling each x ∈ X is proportional to some pre-defined reward function.
We propose to train a GFlowNet policy to sample from the set of trees proportionately to a Boltzmann distributed reward function with an energy term corresponding to the tree’s parsimony score. This is achieved by repeatedly joining sub-groups of taxa. During the course of training, we employ a temperature annealing schedule to recover trees with lower mutation counts.
Results:
We experimentally validate our method on two types of datasets. First, as a proof of concept, we test on a small-scale dataset of 10 transfer RNAs — a case where we can exhaustively enumerate all possible trees and compare the model-learnt sampling probability to the ground truth. Second, we experiment on the DS1-DS7 benchmarking datasets [Garey et al., 1996, Hedges et al., 1990, Lakner et al., 2008, Rossman et al., 2001, Yang and Yoder, 2003, Henk et al., 2003, Zhang and Blackwell, 2001], featuring up to 64 species (i.e. 10^103 possible rooted phylogenetic trees) and up to 2500 sites. While it is intractable to examine the full tree space due to large number of species, we compare the best phylogenetic trees sampled from our GFlowNet with solution produced by the state of the art method, PAUP* (version 4.0)[Swofford et al., 2003]. We show that on the small-scale dataset, our method correctly learns the parsimony score defined distribution over phylogenetic trees. On the larger scale datasets, by learning a distribution that highly favors low mutation trees, our GFlowNet method can efficiently discover the optimal solutions identified by PAUP*.
Overall, our GFlowNet-based tree sampling method efficiently and accurately samples from parsimony score defined posterior distributions over trees. This will prove valuable in contexts where phylogenetic inference needs to capture uncertainty on inferred trees.
Presentation Overview: Show
The consequences of mutations on individual health are influenced by the regulation of gene expression. Allele-specific expression (ASE) is the preferential expression of one of two alleles, and is modulated following changes in genetic variation or environmental exposures. ASE can be caused by genetic regulatory variation, post-transcriptional modifications, or epigenetic alterations, and has the potential to alter phenotype and disease risk. Although ASE is a pervasive phenomenon affecting genetic regulation across tissues and phenotypes, it remains unknown how ASE contributes to variation in aging processes. Here, we utilize bulk RNA sequencing and single-cell RNA sequencing from blood of >1,000 selected individuals from the Canadian Partnership for Tomorrow’s Health, to evaluate ASE changes in blood cells during aging. We show that the number of SNPs with ASE increases as individuals age, particularly in cells involved in adaptive immunity (CD4+ T cells, CD8+ T cells, and B cells). We demonstrate that individuals who are healthy based on a calculated risk score from blood traits show larger increases in ASE with age compared to unhealthy individuals. By stratifying ASE sites into common and rare events, we observe that aged individuals have a higher proportion of common ASE events in genes involved in immune response. We further show that aged individuals with low health risk have a larger proportion of ASE in immune genes compared to high risk individuals. Genes involved in immunity are under strong selective pressures, and ASE variability may be beneficial for adaptability and response against pathogens. We also demonstrate that increases in ASE may decrease risk of pre-treated cardiometabolic traits, including hypertension and type 2 diabetes, however, opposite relationships are observed for cancer cases and pre-cancer samples. We further show that individuals administering anti-hypertensive and statin medications have larger overall proportions of ASE, demonstrating an environmental impact medications may have on ASE, which may also contribute to variability of ASE observed during aging. Our results suggest that increases in ASE in immune processes may be beneficial during aging by reducing risk of mortality and cardiometabolic diseases.
Presentation Overview: Show
The Bayesian phylogenetic community is exploring faster and more scalable alternatives to the Markov chain Monte Carlo (MCMC) approach to approximate the high dimensional Bayesian posterior. The search for other substitutes is motivated by the falling computational costs, increasing challenges in large-scale data analysis, advances in inference algorithms and implementation of efficient computational frameworks. Some alternatives are adaptive MCMC, Hamiltonian Monte Carlo, sequential Monte Carlo and variational inference (VI). Until recently, few studies were interested in applying classical variational approaches in probabilistic phylogenetic models. However, VI started to gain some attraction from the phylogenetic community taking advantage of advances that made it more scalable, generic and accurate, such as stochastic and black box VI algorithms, latent-variable reparametrization, and probabilistic programming. These advancements allowed designing of powerful and fast variational-based algorithms to infer complex phylogenetic models and analyze large-scale phylodynamic data.
Bayesian methods incorporate the practitioner's prior knowledge about the likelihood parameters through the prior distributions. Defining an appropriate and realistic prior is difficult, especially in small data regimes, similar sequences or parameters with complex correlations. Notably, the variational phylogenetic methods assign fixed prior distributions with default hyperparameters to the likelihood parameters, a similar practice in MCMC methods. However, such a choice could bias the posterior approximation and induce high posterior probabilities in cases where the data are weak, or the actual parameter values do not fall within the range specified by the priors.
Here, we show that variational phylogenetic inference can also suffer from misspecified priors on branch lengths and less severely on sequence evolutionary parameters. Further, we propose an approach and an implementation framework (nnTreeVB) to relax the rigidity of the prior densities by learning their parameters using a gradient-based method and a neural network-based parameterization. We applied this approach to estimate branch lengths and evolutionary parameters under several Markov chain substitution models. The results of performed simulations show that the approach is powerful in estimating branch lengths and evolutionary model parameters. They also show that a flexible prior model provides better results than a predefined prior model. Finally, the results highlight that using neural networks could improve the initialization of the optimization of the prior density parameters.
Reference: Remita A.M., Kiani G. and Diallo A.B. (2023) Prior Density Learning in Variational Bayesian Phylogenetic Parameters Inference. https://arxiv.org/abs/2302.02522.
Presentation Overview: Show
Enrichment analysis can lend mechanistic insight, suggest candidate biomarkers/therapeutic targets of disease, and allow integration of findings in other ‘omes like transcriptomics. To this end, we have built two complementary tools, RaMP-DB, and MetaboSPAN. RaMP-DB is a newly renovated integrated knowledge base, API, R package and online interface for generating biological and chemical insight into metabolomic, proteomic, and transcriptomic data. The new RaMP-DB version (2.0) features several major improvements over its predecessor, including chemical structure and class annotations for metabolites, improved pathway annotation coverage for lipids, new pathway enrichment analysis visuals, and enrichment analyses supporting the inclusion of custom backgrounds. On the other hand, MetaboSPAN is a specialized metabolomic pathway analysis method that leverages RaMP-DB and aims to compensate for inconsistent coverage of the metabolome in metabolomics experiments.
RaMP-DB is implemented as a MySQL database, an R package, an API, and a user-friendly web application. Python scripts acquire data for RaMP-DB from our primary sources (HMDB/KEGG, Wikipathways, Reactome, ChEBI, LIPID MAPS, and Rhea), and parse annotations associated with pathways, reactions, ontologies, chemical structures/classes. A semi-automated entity curation system flags faulty mappings between databases for subsequent manual curation. The contents of RaMP-DB 2.0 are regularly updated, with the current version containing 256,086 distinct metabolites, 15,827 genes/enzymes, 53,831 distinct pathways, 412,775 mappings between metabolites and pathways, 401,303 mappings between genes/enzymes and pathways, and 60,476 biochemical reactions from the Rhea database. Chemical properties such as InCHIKeys and chemical class (ClassyFire) are available for 256,592 metabolites. Further, the most recent ontologies from HMDB 5.0, including relevant portions of the new chemical functional ontology (CFO), such as biofluid/tissue of origin, are now included. Lastly, RaMP-DB's new web interface now supports 8 different single and batch queries on analytes, pathways, chemical and annotations, and enzyme/metabolite reactions, as well as pathway and chemical enrichment analysis.
MetaboSPAN was designed to account for incomplete coverage of the metabolome within individual metabolomic experiments and infer additional activity to aid in hypothesis generation. MetaboSPAN builds similarity networks based on annotations within RaMP-DB 2.0, where nodes are metabolites and edges encode shared annotatons between adjacent metabolites. The algorithm then uses network topological analysis to identify clusters of metabolites related to a list of metabolites of interest (e.g. altered in a disease), which undergo pathway enrichment testing.
To validate MetaboSPAN's performance, we designed several simulation experiments comparing the performance of MetaboSPAN against existing pathway analysis strategies (Globaltest, Fisher’s exact test, NetGSA, and FELLA). Notably, identical pathway libraries were used for each approach so as to not bias results. Our results show that MetaboSPAN yields higher sensitivity for altered pathway detection without inflating false positive findings. We further evaluated Metabospan and other methods on two independent datasets generated from the same cohort on different platforms, wherein metabolite coverage was almost completely different (sharing just one metabolite in common), allowing us to compare overlap in significant pathway findings. We found that MetaboSPAN improved the concordance of pathway results obtained from each dataset as compared to several of the baseline methods we tested against.
Both RaMP-DB and MetaboSPAN are open-source, publicly available resources. The online interface for RaMP-DB can be found at https://rampdb.nih.gov/, whereas the R package for MetaboSPAN can be downloaded at https://github.com/andyptt21/metabospan. Overall, RaMP-DB is a robust, comprehensive and well-maintained resource for functional annotations for metabolites and metabolic transcripts, and MetaboSPAN is a novel functional enrichment strategy that leverages these annotations to compensate for difficulties in metabolite detection and identification.
Presentation Overview: Show
Clustering genes in similarity graphs is an important step for orthology inference methods. Most algorithms group genes without considering their species, which results in clusters that contain several paralogous genes. Moreover, clustering is known to be problematic when in-paralogs arise from ancestral duplications. We have shown a two-step process that avoids these problems. First we infer clusters of only orthologs (i.e. with only genes from different species), and second we infer the inter-cluster orthologs. In this work, we focus on the first step, which leads to a problem that we call Colorful Clustering. In general this is as hard as classical clustering. However, in similarity graphs the number of species is usually small, as well as the neighborhood size of genes in other species. We therefore study the problem of clustering in which the number of colors (species) is bounded by k and each gene has at most d neighbors in another species.
Presentation Overview: Show
Databases of biomedical knowledge are rapidly proliferating and growing, with recent advances (such as the RTX-KG2 knowledge-base that we have recently developed; (Wood et al. 2022)) increasingly focusing on integration of knowledge under a standardized schema and semantic layer (i.e., controlled vocabularies for types of concepts and types of relationships, for example, the Biolink standard (Unni et al. 2022)). The rise of standardized knowledge-bases sets the stage for the development of computational systems that can systematically discover novel connections between drugs and diseases (i.e., large-scale computational drug repurposing) or to answer other kinds of translational questions (e.g., ""What anticonvulsants are likely to have drug-drug interactions with cannabinoids?"" (Vázquez et al. 2020) or ""What drugs would downregulate expression of RHOBTB2 in the central nervous system?"" (Foksinska et al. 2022)). To be able to build such a system, improved methods and representation languages for knowledge-graph-based computational reasoning are needed. Previous efforts contributed myriad tools and approaches, but progress for biomedical reasoning systems has been hindered by (1) the lack of an expressive analysis workflow language for translational reasoning and (2) the lack of an associated reasoning engine that federates semantically integrated knowledge-bases.
As a part of the NCATS Translator project (Biomedical Data Translator Consortium 2019), we have developed ARAX (Glen et al. 2023), which is a new computational reasoning system for translational biomedicine that combines (1) an innovative workflow language (ARAXi), (2) a comprehensive semantically-unified biomedical knowledge graph (RTX-KG2), and (3) a versatile and novel method for scoring search results. Users or application-builders can query ARAX via a web browser interface or a web application programming interface. ARAX enables users to encode translational biomedical questions and to integrate knowledge across sources to answer the user’s query and facilitate exploration of results. To illustrate ARAX’s application and utility in specific disease contexts, we will present and discuss several use-case examples.
The source code and technical documentation for building the ARAX server-side software and its built-in knowledge database are freely available online (https://github.com/RTXteam/RTX). We provide a hosted ARAX service with a web browser interface at arax.rtx.ai and a web application programming interface (API) endpoint at arax.rtx.ai/api/arax/v1.3/ui/.
References:
Biomedical Data Translator Consortium. 2019. “Toward A Universal Biomedical Data Translator.” Clinical and Translational Science 12 (2): 86–90.
Foksinska, Aleksandra, Camerron M. Crowder, Andrew B. Crouse, Jeff Henrikson, William E. Byrd, Gregory Rosenblatt, Michael J. Patton, et al. 2022. “The Precision Medicine Process for Treating Rare Disease Using the Artificial Intelligence Tool miniKanren.” Frontiers in Artificial Intelligence 5 (September): 910216.
Glen, Amy K., Chunyu Ma, Luis Mendoza, Finn Womack, E. C. Wood, Meghamala Sinha, Liliana Acevedo, et al. 2023. “ARAX: A Graph-Based Modular Reasoning Tool for Translational Biomedicine.” bioRxiv. https://doi.org/10.1101/2022.08.12.503810.
Unni, Deepak R., Sierra A. T. Moxon, Michael Bada, Matthew Brush, Richard Bruskiewich, J. Harry Caufield, Paul A. Clemons, et al. 2022. “Biolink Model: A Universal Schema for Knowledge Graphs in Clinical, Biomedical, and Translational Science.” Clinical and Translational Science 15 (8): 1848–55.
Vázquez, Marta, Natalia Guevara, Cecilia Maldonado, Paulo Cáceres Guido, and Paula Schaiquevich. 2020. “Potential Pharmacokinetic Drug-Drug Interactions between Cannabinoids and Drugs Used for Chronic Pain.” BioMed Research International 2020 (August): 3902740.
Wood, E. C., Amy K. Glen, Lindsey G. Kvarfordt, Finn Womack, Liliana Acevedo, Timothy S. Yoon, Chunyu Ma, et al. 2022. “RTX-KG2: A System for Building a Semantically Standardized Knowledge Graph for Translational Biomedicine.” BMC Bioinformatics 23 (400). https://doi.org/10.1186/s12859‐022‐04932‐3.
Presentation Overview: Show
Multidomain proteins are mosaics of structural or functional modules, called domains. The architecture of a multidomain protein - that is, its domain composition in N- to C-terminal order - is intimately related to its function, with each module playing a distinct functional role. For example, in cell signaling proteins, distinct domains are responsible for recognition and response to a stimulus. Multidomain architectures evolve via gain and loss of domain-encoding segments. This evolutionary exploration of domain architecture composition underlies the protein diversity seen in nature.
We present a framework based on information retrieval and natural language processing-inspired models for exploring the varied composition of domain architectures. Domain architectures are represented as vectors in a multidimensional space. Distances in this space provide a quantification of the relationship between domain architectures. This can be extended to set-wise distances for the quantitative comparison of two sets of domain architectures. Our framework has many applications, including investigating taxonomic differences in the domain architecture complement and testing domain architecture simulators, by assessing how well simulated domain architectures recapitulate properties of genuine ones. Here, we apply this framework to investigate the constraints on the formation of domain combinations. Only a tiny fraction of all possible domain combinations are observed in nature, suggesting that domain order and co-occurrence are highly constrained, but these constraints are poorly understood. We introduce a null model that generates architectures with properties that deviate from genuine domain architecture properties. Comparing the properties of domain architectures that do and do not occur in nature may shed light on the design rules of multidomain architecture composition.
Presentation Overview: Show
Complex diseases are highly challenging to combat partly due to the interplay of molecular cascades involved in disease pathogenesis. Cellular models of disease offer great potential for exploring biological mechanisms and drug target testing but there is currently no way to determine how well a modelled disease mechanism matches actual human disease. Several clinical trials for complex diseases have failed despite successful preclinical validations in cellular and animal models. Cellular models are built to recapitulate high-level phenotypes and disease pathology. But there is currently no approach to systematically assess how well the molecular profiles of disease pathogenesis are recapitulated in models. Comparing human and model transcriptomes is attractive but integrative study of gene expression is typically confounded by cross-platform and species-specific effects. We have developed a systems approach that better integrates transcriptomes from cell models and primary human tissues.
To determine how well a modelled disease mechanism matches the actual human disease, we have developed integrated quantitative pathway analysis (iQPA); that both captures and interrogates the degree to which disease functions constructed in models match those found in common across hundreds of diseased human brains. Using annotated pathway databases and a non-parametric approach, iQPA transforms gene expression into a series of quantifiable pathway activities. These pathway activities are analyzed using linear models to define functional dysregulation. In turn, iQPA leverages dysregulation events to identify and assess consistency of functional recapitulation between human and model.
We demonstrate the utility of iQPA applied to Alzheimer's disease (AD). Brain transcriptomic datasets sampled from different brain regions of three independent cohorts, as well as multiple cell models of AD, were integrated to determine high-fidelity therapeutic target pathways. iQPA found a high level of correlation (r = 0.84) of pathway dysregulation between distinct brain regions, whereas gene-based analysis uncovered a significantly lower correlation (r = 0.51). It unbiasedly determined which cellular models most closely recapitulate human dysregulation events. iQPA identified 83 commonly dysregulated core pathways with consistent dysregulation across human brains and the most relevant cell model. The p38 MAPK pathway is the top core pathway shared between AD brains and the relevant AD cellular models. We explored its therapeutic potential we applied a clinical p38 MAPK inhibitor which dramatically ameliorated Aβ-induced tau pathology and neuronal death in 3D-differentiated human neurons. iQPA accelerates AD drug discovery by systematically identifying dysregulated core pathway activities to provide robust, validated targets that attenuate AD pathology.