Posters - Schedules
Posters Home

View Posters By Category

Tuesday, May 16, between 12:00 PM EDT and 1:30 PM EDT (Odd Numbered Posters)
Wednesday, May 17, between 12:00 PM EDT and 1:30 PM EDT (Even Numbered Posters)
Session A Poster Set-up and Dismantle
Session A Posters set up:
Tuesday, May 16, between 8:00 AM EDT and 8:45 PM DDT
Session A Posters dismantle:
Tuesday, May 17, at 6:00 PM EDT
Session B Poster Set-up and Dismantle
Session B Posters set up:
Wednesday, May 16, between 8:00 AM EDT and 8:45 PM EDT
Session B Posters dismantle:
Wednesday, May 17, at 6:00 PM EDT
Virtual
02: Can Androids Segment Digital Tissue Slides? A Self-Supervised Framework for Identifying Tissue Types in Cancer Whole-slide Images
Track: General Session
  • Yiran Shao, University of Toronto, Canada
  • Philip Awadalla, University of Toronto, Ontario Institute for Cancer Research, Canada


Presentation Overview: Show

Tissue analysis is a crucial step of tumour diagnosis, but it can be time-consuming and expensive for pathologists to manually annotate regions of interest in whole-slide images (WSIs). To address this challenge, we propose a self-supervised semantic segmentation framework that uses the state-of-the-art Vision Transformer (ViT) model to identify main tissue types in cancer WSIs.

Our approach does not require expert annotation or dedicated dense pixel-level loss, allowing it to be trained on large unannotated WSI datasets with variations in tissue appearance across samples. This framework has the potential to significantly improve the efficiency and accuracy of tissue identification in cancer for both clinical and research applications.

The ability to automatically segment and classify tissue types has broader implications for digital pathology, as it can enhance the speed and accuracy of diagnosis for a wide range of cancers. Our work is an important step towards reducing the manual cognitive workload of pathologists and enabling the creation of large-scale annotated WSI datasets for computational pathology pipelines.

04: RNA editing landscape in response to SARS-CoV-2 infection
Track: General Session
  • Aiswarya Mukundan Nair, Kent State University, United States
  • Dr.Helen Piontkivska, Kent State University, United States


Presentation Overview: Show

The ongoing COVID-19 pandemic has resulted in a staggering number of confirmed cases and deaths, surpassing 700 million and 6.8 million respectively, as of March 2023 (WHO COVID-19 dashboard). Understanding the host immune response to SARS-CoV-2 is critical for developing effective therapies to combat the virus. The innate immune response which serves as the first line of defense during viral infections is one of the factors in determining the severity and spread of the disease. When SARS-CoV-2 enters the cell, it triggers an interferon response, leading to a cascade of downstream signaling resulting in the activation of interferon-stimulated genes, including ADARp150. ADARs are enzymes that modify RNA molecules by deaminating adenosine residues in double-stranded RNA and can act on both viral and endogenous RNAs. These editing events are highly dynamic and tightly regulated. However, the role of RNA editing in SARS-CoV-2 infections is not well understood.
In this study, we examined the impact of SARS-CoV-2 infection on the expression of ADAR p150 and host ADAR editing patterns. We used the customized computational pipeline to analyze a publicly available longitudinal RNAseq dataset of age- and sex- matched subjects without any comorbidities at three different stages of infection: pre-infection, during-infection, and post-infection. Our results revealed elevated expression of ADARp150 and global changes in RNA editing levels during infection. Importantly, we observed that while ADARp150 expression levels returned to pre-infection levels in post-infection samples, the editing levels did not. We further aim to explore the implications of such changes to gain mechanistic insights into viral-host interactions and host immune responses. Our findings highlight the dynamic nature of RNA editing in response to SARS-CoV-2 infection and provide new insights into potential mechanisms underlying these changes. These results could inform the development of novel therapeutic strategies targeting RNA editing processes to combat COVID-19.

06: A new GWAS method to unravel CNVs associated with ASD and cognitive ability
Track: General Session
  • Cécile Poulain, Université de Montréal, CHU Sainte Justine, Canada
  • Catherine Proulx, Université de Montréal, CHU Sainte Justine, Canada
  • Elise Douard, Université de Montréal, CHU Sainte Justine, Canada
  • Jean Louis Martineau, CHU Sainte Justine, Canada
  • Zohra Saci, CHU Sainte Justine, Canada
  • Zdenka Pausova, Hospital for Sick Children, University of Toronto, Canada
  • Tomas Paus, CHU Sainte Justine, Canada
  • Laura Almasy, Children's Hospital of Philadelphia, United States
  • David Glahn, Boston Children's Hospital/Harvard Medical School, United States
  • Guillaume Huguet, Université de Montréal, CHU Sainte Justine, Canada
  • Sébastien Jacquemont, Université de Montréal, CHU Sainte Justine, Canada


Presentation Overview: Show

Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder (NDD) characterised by large clinical, genetic and biological heterogeneity. The prevalence of ASD is 2% in the general population, which makes it an important public health issue. ASD often co-occur with many comorbidities. Moreover, 30% of autism patients have intellectual disability (ID).
Among the variants associated with ASD, copy number variants (CNVs) are the most frequently identified in the clinic. CNV are deletion or duplication of genomic regions of more than 1000 nucleotides that may involve one or many genes.
ASD-associated variants also have a significant impact on ID risk. To date, no study of rare variants has been able to clearly separate the effects on ASD risk and cognitive ability.
This is primarily due to a lack of research investigating the impact of CNVs or other rare variants on both of these conditions simultaneously : controls in genetic studies rarely receive a cognitive assessment.
Characterizing the impact of common and specific genetic variants associated with ASD and ID is crucial to better understand the biological mechanisms involved and to improve diagnosis.


Hypothesis : The study of genes included in CNVs can be used to estimate the risks associated with ASD and ID separately, based on their biological characteristics and pathways.
We aim to identify rare CNVs that confer ASD risk, while carefully controlling for their effects on cognition in order to identify common and specific biological pathways associated with ASD and ID.
Methods: We performed an association study based on genes within CNV (CNV-GWAS) on an aggregate dataset (3 cohorts with ASD and 6 control cohorts) of ~466 000 individuals, to identify CNVs implicated in ASD and ID. For genome-wide significant CNVs associated with ASD, we recomputed the association study after adjusting for cognitive ability.
Results: We replicated previous associations of 33 recurrent CNVs with ASD. We also identified 28 new regions (3 for deletion and 25 for duplication) and corresponding genes that were not previously associated with ASD.
When adjusting for cognitive ability, 3 of the new regions (1 for deletion and 2 duplications) remain significant with ASD.

Conclusion :
These datasets allowed us to detect ultra rare variants with better precision than before. This unique design including controls with cognitive assessments demonstrated that CNVs deletions and duplications remained associated with ASD even after adjusting for their effects on cognitive ability. The identification of the biological pathways associated with these genes will lead to a better understanding of the common and specific features of ASD and ID.

08: Immune-relevant 3-lncRNA signature with prognostic implications across multiple cancers
Track: General Session
  • Raghvendra Mall, Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates
  • Shimaa Sherif, Human Immunology Division, Research Branch, Sidra Medicine, Doha, Qatar, Qatar
  • Jessica Roelands, Leiden University Medical Center, Netherlands
  • Davide Bedognetti, Human Immunology Division, Research Branch, Sidra Medicine, Doha, Qatar, Qatar
  • Wouter Hendrickx, Human Immunology Division, Research Branch, Sidra Medicine, Doha, Qatar, Qatar
  • Julie Decock, Translational Cancer and Immunity Center, QBRI, HBKU, QF, Doha, Qatar, Qatar


Presentation Overview: Show

Background: Advances in our understanding of the tumor microenvironment have radically changed the cancer field, highlighting the emerging need for biomarkers of an active, favorable tumor immune phenotype to aid treatment stratification and clinical prognostication. Numerous immune-related gene signatures have been defined; however, their prognostic value is often limited to one or few cancer types. Moreover, the area of non-coding RNA as biomarkers remains largely unexplored although their number and biological roles are rapidly expanding.

Methods: We developed a multi-step process to identify immune-relevant long non-coding RNA signatures with prognostic implications in multiple TCGA solid cancer datasets.

Results: Using the breast cancer dataset as a discovery cohort, we found 2,988 differentially expressed lncRNAs between immune favorable and unfavorable tumors, as defined by the immunologic constant of rejection (ICR) 20-gene signature. Mapping of the lncRNAs to a coding-non-coding network identified 127 proxy protein-coding genes that are enriched in immune-related diseases and functions. Next, we defined a 20-lncRNA signature that showed a stronger effect on overall survival than the ICR signature in multiple solid tumors. Furthermore, we found a 3-lncRNA signature that demonstrated prognostic significance across 5 solid cancer types with a stronger association with clinical outcome than ICR. Moreover, this 3-lncRNA signature showed additional prognostic significance in uterine corpus endometrial carcinoma and cervical squamous cell carcinoma and endocervical adenocarcinoma as compared to ICR.

Conclusion: We identified an immune-related 3-lncRNA signature with prognostic connotation in multiple solid cancer types which performed equally well and in some cases better than the 20-gene ICR signature, indicating that it could be used as a minimal informative signature for clinical implementation.

Reference:
Sherif, Shimaa, Raghvendra Mall, Hossam Almeer, Adviti Naik, Abdulaziz Al Homaid, Remy Thomas, Jessica Roelands et al. ""Immune-related 3-lncRNA signature with prognostic connotation in a multi-cancer setting."" Journal of Translational Medicine 20, no. 1 (2022): 442.

10: Gene duplication, exon duplication, and elaboration of splicing contribute to Atlastin family evolution.
Track: General Session
  • Maureen Stolzer, Carnegie Mellon University, United States
  • Ruijin Yang, Carnegie Mellon University, United States
  • Samantha Bryce, Carnegie Mellon University, United States
  • Tina Lee, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States


Presentation Overview: Show

Gene families evolve through duplication, displacement, and modification of DNA on a range of scales from single exons to entire genes. Recent work on spliced alignment highlights the challenges of integrating analyses of gene architecture evolution and protein evolution. In this case study, we examine the evolution of Atlastin (ATL) GTPases, a family in which gene duplication, exon duplication, and elaboration of splice variants all contributed to functional and regulatory innovation in this family.

The Atlastins mediate membrane fusion in the endoplasmic reticulum (ER) in metazoans. While drosophila, and most invertebrates, have a single Atlastin (ATLi), humans and most vertebrates have three paralogs (ATL1/2/3). Two companion studies [1,2] have that shown ATL1 and ATL2 are autoinhibited, dependent on the presence of an extension in the C‐terminus. C-terminal autoinhibition likely acts as a regulatory mechanism that allows for rapid response to ER emergencies.

Our survey of ATL homologs and their exon structures in a broad sample of metazoan genomes, combined with phylogenetic analysis, reveals a complex history of gene duplication, exon shuffling, and elaboration of alternate splicing. ATL2 harbors two adjacent C-terminal exons that are alternatively spliced. One of these splice forms exhibits autoinhibition; the other does not. Sequence comparison indicates that these exons are homologous to each other and to the C-terminal exons in ATL1 and ATLi.

Reconciliation of exon, gene, and species trees with Notung-DM [3], combined with a comparison of the gene architecture of metazoan ATL homologs, reveals a history of gene duplications and exon structure remodeling. Surprisingly, despite their spatial organization, the two adjacent C-terminal exons in ATL2 did not arise by tandem duplication. Rather, one of these exons is a copy of the C-terminal exon from ATL1. When combined with a parsimony reconstruction of phenotypic changes, our results indicate that autoinhibition is a recent evolutionary innovation that likely arose twice independently in vertebrates.

This case study illustrates how duplication on multiple scales – genes and exons – contributes to a robust functional and regulatory repertoire. Phylogenetic reconciliation on three levels of organization reveals a more complex evolutionary history than the tandem arrangement of exons would suggest.


[1] Bryce S, Stolzer M, Crosby D, Yang R, Durand D, and Lee TH. (In Press) "Human atlastin-3 is a constitutive ER membrane fusion catalyst." Journal Cell Biol. https://dio.org/10.1083/jcb.202211021

[2] Crosby D, Mikolaj MR, Nyenhuis SB, Bryce S, Hinshaw JE, and Lee TH. (2022) "Reconstitution of Human Atlastin Fusion Activity Reveals Autoinhibition by the C Terminus." Journal of Cell Biol. 221(2).

[3] Stolzer M, Siewert K, Lai H, Xu M, and Durand D. (2015) "Event inference in multidomain families with phylogenetic reconciliation." BMC Bioinformatics. 16 Suppl 14(Suppl 14):S8.

14: Neurodevelopmental disorder PRS for all rare variant types using IQ as quantitative proxy for disorder severity.
Track: General Session
  • Thomas Renne, Université de Montréal, Canada
  • Zohra Saci, Univesité de Montréal, Canada
  • Martineau Jean-Louis, Université de Montréal, Canada
  • David Glan, Boston Children Hospital, US
  • Tomas Paus, CHU Sainte Justine, Canada
  • Zdenka Pausova, Hospital for sick children, Toronto, Canada
  • Laura Almasy, Children hospital of Philadelphia, US
  • Guillaume Huguet, Université de Montréal, Canada
  • Sébastien Jacquemont, Université de Montréal, Canada


Presentation Overview: Show

Neurodevelopmental disorders (NDDs) are a set of disorders associated to nervous system and brain occurring during the nervous system development. They are diagnosed in 7 – 14% of the general population. The NDD etiology partially comes from genetics variants carried by the patient and environments factors. Among these variants, most important ones are Copy Number Variants (CNVs) then rare Single Nucleotide Variants (SNVs). These deleterious variants are identified in 10 – 40% of patients diagnosed with an NDD. However, only a few of these variants could be used to explain the disease. It is therefore difficult for clinicians to estimate which genetic variant may contribute to the neurodevelopmental symptoms in a patient. This is due to 2 major issues: 1) Most ""pathogenic"" or ""likely pathogenic"" variants have only been observed once in patients. Thus, conduct individual association studies on these variants is impossible, and 2) The cognitive mechanism underlying the association of variants and phenotype remains unknown.

To address this issue of undocumented mutations, we proposed a novel strategy to understand general principles of variants on cognitive dimensions for a first time. This model based on the effect of loss of function of gene impacted quantifies the impact of any CNV on general intelligence with a 78% accuracy. Thanks to the amount of data increasing, more sensitive studies are now possible.

Quantifying effect sizes of genes over-expressed in specific conditions (cell-type, gene ontology...). Gene over-expressed in cortical cell types of fetal brains, effect sizes are studied individually. Each cell type effect size informs which cell type is more associated with cognition. As expected, excitatory and inhibitory neurons will have high impact on cognitive abilities but accessory cell types responsible for brain metabolism will highly impact cognition compared to a naive model evaluating effect size of the whole genome. Similar analysis on the 19,000 human GO term identify specific biological processes impacting negatively or positively the cognitive capability.

These results show that more sensitive analysis of gene Impact on cognition will bring more information of CNV impact on cognition. Then a new model including these molecular information may improve accuracy of quantification of CNV impact on general intelligence. Finally, these analyses will be used on SNVs, a more complex variant with weaker effects. Once these both models developed and optimized, they will be merged to produce a complete model describing the large majority of patients’ variants on cognition.

16: Using Machine Learning on single-cell RNAseq and clinical data to disentangle response to treatment uncertainty in Rheumatoid Arthritis
Track: General Session
  • Jean Vencic, University of Sherbrooke, Canada
  • Sophie Roux, University of Sherbrooke, Canada
  • Michelle Scott, University of Sherbrooke, Canada
  • Hugues Allard-Chamard, University of Sherbrooke, Canada


Presentation Overview: Show

Rheumatoid arthritis (RA) is a chronic, inflammatory and autoimmune disease affecting 1% of the worldwide population. It is characterized by symptomatic flares during which significant inflammation and destruction of the joints appear. The pathophysiology of the disease remains poorly understood and the available treatments can only alleviate its symptoms and rarely induce long-term remission. We currently lack effective tools to predict RA progression and response to treatments.

RA is a heterogeneous entity resulting in a variety of disease aggressiveness and clinical outcomes as well as not all patients responding to the same treatments. We posit that this difference is due to a plurality of immune alterations causing a common layer of phenotypes leading to the RA diagnosis but also several distinct immune cell expression profiles namely the RA endophenotypes leading to the differences observed in treatment response and disease development among patients.

To conduct this study, single-cell RNA sequencing data have been generated from blood mononuclear cell samples of patients presenting with RA, prior to the initiation of any treatment. Using bioinformatics methods we aim to discriminate and define specific RA endophenotypes correlated to RA-specific clinical outcomes. To do so, we investigated the transcriptomic profile differences among 18 matched control and patient samples by conducting a standard single-cell RNAseq analysis pipeline (reads mapping, reads and cell filtering, dimensional reduction, automatized annotation of immune cell types) followed by analyzing the differentially expressed (DE) genes between pairs of matched samples revealing various DE genes and regulated biological processes in B cells and T cells especially. We then integrated clinical data obtained from the first examination of patients and longitudinal information from the medical follow-ups such as the disease developments and treatment response (the patient's mean enrollment time being around 20 months). This allowed us to apply machine learning methods of the supervised classification family (Random Forest, SVM, Naive Bayes) to characterize features among both relevant transcriptomic data and clinical information that are statistically tied to RA outcomes and response to treatments.

The long-term objectives of this ongoing project will be to investigate more thoroughly the transcriptomic and clinical features selected by the machine learning methods to better characterize the distinct possible immune endophenotypes found. Moreover, we aim to continue enrolling more newly diagnosed RA patients to improve the robustness of our findings and better grasp the complexity of RA biology. Finally, developing a tool capable of linking a new patient to a characterized endophenotype - and thus its clinical outcomes - based on a few meaningful biological characteristics would pave the way for personalized medicine in RA.

18: phyDBSCAN: phylogenetic tree density-based spatial clustering of applications with noise and without hyperparameters
Track: General Session
  • Fadi Abu Salem, University of Sherbrooke, Canada
  • Boris Morosov, Astrakhan state University, Russia
  • Nadia Tahiri, University of Sherbrooke, Canada


Presentation Overview: Show

Motivation: The literature review highlights that each gene has its own evolutionary history, which may differ substantially from the evolutionary history of other genes. This is due to processes such as horizontal gene transfer or recombination. In this paper, we present an innovative approach for detecting one or multiple clusters of genes with similar evolutionary history patterns using the density-based spatial clustering of applications with noise (DBSCAN) method. A comparative study shows that the DBSCAN method is effective for biological data, as it is efficient for low-dimensional data and robust against outliers and noise. Additionally, the DBSCAN method can be adapted to any type of metric space and is sometimes related to median procedures. For example, in the case of the Robinson and Foulds (RF) distance in phylogeny, it can lead to both median trees and Euclidean distances.


Results: We present a novel and efficient method for inferring multiple consensus trees and alternative supertrees to accurately represent the most significant evolutionary patterns in a set of genetic phylogenies. We demonstrate how an adapted version of the DBSCAN clustering algorithm, which does not require any hyperparameters, can be used based on the unique properties of the Robinson and Foulds distance. This adapted algorithm can be used to partition a given set of trees into one cluster (for homogeneous data) or multiple clusters (for heterogeneous data) of trees.

20: Heterogeneous Domain Adaptation for Species-Agnostic Transfer Learning
Track: General Session
  • Youngjun Park, University Medical Center Göttingen, Germany
  • Nils Paul Muttray, Georg-August-Universität Göttingen, Germany
  • Anne-Christin Hauschild, University Medical Center Göttingen, Germany


Presentation Overview: Show

In preclinical biomedical research, model organisms such as mice or zebrafish are the primary source for developing and validating novel hypotheses. Studies investigating disease development, progression, or treatment response often focus on such models. However, a major challenge in translating these findings to clinical practice is the biological and technological heterogeneity between species, resulting from varying sets of genes and their functions. To address this need for cross-species data integration, current algorithms for species domain adaptation in biomedical research rely on external knowledge, such as homologous genes requiring additional experimental validation to confirm. Moreover, such validated knowledge often exists only for well-studied model organisms, resulting in significant information loss during gene mapping. In this study, we present the first algorithm that enables species-agnostic transfer learning (SATL) with heterogeneous domain adaptation, allowing for knowledge integration and translation across various species' datasets without relying on external knowledge. This SATL approach is an extension of the cross-domain structure-preserving projection algorithm. We evaluated and compared the novel approach with standard knowledge-guided methods and other generalized zero-shot learning models from computer vision, focusing on cell-type label projection tasks with four different single-cell sequencing datasets. Our results demonstrate that the novel SATL approach does not rely on any external experimental knowledge and has minimal information loss, making it more suitable for general biological research by allowing knowledge transfer beyond the barriers between model organisms.

22: Inverted Repeats in Viral Genomes
Track: General Session
  • George M. Rivera, Global Society for Philippine Nurse Researchers, Inc., Philippines
  • Jingxiang Gao, Carnegie Mellon University, Qatar
  • Madhura Sen, Vellore Institute of Technology, India
  • Matthew Shtrahman, UCSD School of Medicine, United States
  • Madhavi Ganapathiraju, Carnegie Mellon University, Qatar


Presentation Overview: Show

An inverted repeat (IR) in DNA is a sequence of nucleotides that is followed by its complementary bases but in reverse order (e.g., CACGGATTGTCCGTG with CACGGA being followed by its reverse complement TCCGTG). If the two complementary sequences occur without a gap between them, they are referred to as DNA palindromes (e.g., CACGGATCCGTG, without TTG occurring between the two complementary parts). IRs and palindromes cause fragile sites endangering genetic stability. IRs in viruses serve in mechanisms of entry into host cells and other purposes, such as gene silencing, initiating duplication, and genomic evolution involved in zoonotic viruses including SARS-CoV-2. In contrast to palindromes, IRs have been less explored, which also stems from the scarcity of sequence analysis tools allowing accurate detection on a large number of viral genomes. Here, we developed a software application over the Biological Language Modelling Toolkit that uses augmented suffix-arrays to efficiently to identify IRs and studied over 14 thousand viral genomes and catalogued their IRs and palindromes. The algorithm employed here finds IRs in linear time without limitations on the length nor the distance between the two halves, and it is shown to outperform prior methods including IUPACpal. We found over 19 million IRs longer than 20 bases which amounts to an average of 1,300 inverted repeats per virus, including 134 that are longer than 2 kilobases. Among the virus species with large IRs found here are herpes viruses and poxes viruses which are well studied for their IRs, while other viruses remain to be explored; there is a prevalence of large terminal IRs in bacteriophages, such as mycobacterium phage and clostridium phage. In particular, we identified large repeats in common disease-causing viruses, such as the African swine fever virus (lethal to domestic pigs), paramecium bursaria chlorella virus (important for termination of algae blooms, found to be able to infect humans and decrease the motor skills and reaction speed), Yaba-like disease virus (important in the cancer gene therapy), Lumpy skin disease virus (one of the most important animal pox), and human herpes virus. We found 54 viruses with high IR density (number of IRs per 1000 bases of the genome). These include disease-causing viruses like poxvirus, herpes virus, orf virus, lymphocystis disease virus, and a large number of bacteriophages. We also cataloged millions of palindromes in the viral genomes, that remain to be studied further. These results in identifying the inverted repeats and palindromes in viral genomes serve as a valuable resource in the discovery of mechanism of action of some of the viruses.

24: Simulation of scRNAseq data controlled by a causal gene regulatory network
Track: General Session
  • Yazdan Zinati, McGill University, Electrical and Computer Engineering, Montréal, Canada, Canada
  • Abdulrahman Takiddeen, McGill University, Electrical and Computer Engineering, Montréal, Canada, Canada
  • Amin Emad, McGill University, Electrical and Computer Engineering, Montréal, Canada, Canada


Presentation Overview: Show

The reconstruction of gene regulatory networks (GRNs) from single-cell gene expression data has been a topic of interest since the previous decade. However, benchmarking GRN inference algorithms remains challenging due to the absence of gold-standard ground truth. While reference GRNs can be built based on experimental data such as ChIP-Seq, or curated from literature, interactions might only partially correspond to the biological context under investigation, requiring lengthy and expensive perturbation experiments.

To overcome these issues, we present GRouNdGAN, a single-cell RNA-seq simulator based on causal generative adversarial networks. In this model, genes are causally expressed under the control of regulating transcription factors (TFs), guided by a user-provided GRN. GRouNdGAN enables simulation of single-cell RNA-seq data, in silico perturbation experiments and benchmarking of GRN inference methods. It is trained using a reference dataset to capture non-linear TF-gene dependencies, as well as technical and biological noise in real scRNAseq data to generate realistic datasets in which GRN properties are captured and gene identities are preserved.

GRouNdGAN outperforms state-of-the-art simulators in generating realistic cells indistinguishable from real ones despite the rigid constraints of an imposed GRN. Moreover, perturbing a TF results in significant perturbation of its targets, while other genes’ expression remain unchanged. In addition, GRouNdGAN can simulate cells at different states of a biological process. Using a dataset corresponding to the differentiation of stem cells, we show that the simulated cells conserve trajectories and pseudo-time orderings consistent with those of the real dataset. We use these properties to benchmark a variety of GRN inference methods, including those that utilize the concept of pseudo-time.

GRouNdGAN learns meaningful causal regulatory dynamics and can sample from interventional in addition to observational distributions and synthesize cells under conditions that do not occur in the dataset at inference time. This property allows for predicting perturbation and TF knockdown experiments in-silico. Using a scRNA-seq dataset corresponding to 11 cell types to generate simulated data, we show that excluding top three differentially expressed TFs of each cell type results in disappearance of that cell type from generated samples. In another experiment, removing lineage-determining TFs in hematopoiesis results in cells differentiating into other lineages consistent with in vitro knockout experiments.

In summary, GRouNdGAN is a powerful scRNAseq simulator with many utilities from simulating data for GRN inference to simulating in silico knockout experiments.

26: Unravelling the Complexity of Cellular Interactions using Underlying Graph Representations of Single-Cell Transcriptomics Data
Track: General Session
  • Akram Vasighizaker, University of Windsor, Canada
  • Sheena Hora, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada


Presentation Overview: Show

Recent advancements in single-cell RNA sequencing (scRNA-seq) have allowed researchers to study intercellular signaling networks with greater ease. Traditional methods of studying cell-cell communication using link prediction approaches on graph-structured data are limited in their effectiveness due to their assumptions about the probability of node interaction, which only apply to specific networks. To overcome this limitation, a novel method has been proposed that uses an attributed graph convolutional neural network to predict cell-cell communication from scRNA-seq data. Our method captures both latent and explicit attributes of undirected, attributed graphs created from the gene expression profile of individual cells.

The proposed method was evaluated on six datasets obtained from human and mouse pancreas tissue. Compared to other latent and implicit methods, the proposed method achieved an improvement in performance in terms of AUC and accuracy. The proposed method also obtained the lowest FPR of 0.0135 among all the approaches, implying that there is a very low probability that the method will predict non-interacting cells as interacting cells. Furthermore, the proposed method performed best in predicting actual interactions, that is, when there exists an interaction between cells, the method predicts it. Overall, the comparative analysis showed that the proposed method outperforms other latent feature-based approaches and the current state-of-the-art method for link prediction, WLNM, with 0.99 ROC and 99% prediction accuracy.

Additionally, to identify potential underlying interactions, we run the GENEMANIA algorithm on the input list of the top 20 genes. The results of the BHuman1 dataset showed key regulators and effectors of the functional relationships between cells in three sub-networks according to the GO annotation: growth factor binding, insulin-like growth factor binding, and type 1 interferons.

The proposed method has significant potential to aid in the understanding of complex cellular processes and inform the development of new therapeutic interventions. The datasets used in the study can be found at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133, and codes are available on Github.

28: A unified model for Bayesian integration and interpretation of single-cell RNA-sequencing data
Track: General Session
  • Ariel Madrigal, McGill University, Canada
  • Tianyuan Lu, McGill University, Canada
  • Larisa Morales-Soto, McGill University, Canada
  • Adrien Osakwe, McGill University, Canada
  • Hamed S. Najafabadi, McGill University, Canada


Presentation Overview: Show

One of the current challenges in the analysis of single-cell data is the harmonized analysis of expression profiles across samples, where sample-to-sample variability exists and is driven by technical and biological effects. Lately, various computational methods have been developed with the aim of removing unwanted sources of technical variation. However, these methods have various limitations, including the inability to distinguish technical and biological sources of sample-to-sample variability, and low interpretability of the integrated low-dimensional space. We introduce Gene Expression Decomposition and Integration (GEDI), a model that unifies various concepts from normalization and imputation to integration and interpretation of single-cell transcriptomics data in a single framework. GEDI finds a common coordinate frame that defines a reference gene expression manifold and sample-specific transformations of this coordinate frame. The common coordinate frame can be expressed as a function of gene-level variables, enabling the projection of pathway and regulatory network activities onto the cellular state space. The coordinate transformation matrices, on the other hand, provide a compact and harmonized representation of differences in the gene expression manifolds across samples, enabling cluster-free differential gene expression analysis along a continuum of cell states, as well as machine learning-based prediction of sample characteristics. Comparison of GEDI to a panel of single-cell integration methods using different benchmark datasets and previously established metrics suggests that GEDI is consistently among the top performers in batch effect removal and cell type conservation, while it can uniquely deconvolve the effects of different sources of sample-to-sample variability. We also show GEDI's ability to learn condition-associated gene expression changes at single-cell resolution using a recent single-cell atlas of PBMCs profiled in healthy, mild, and severe COVID-19 cases. GEDI is able to reconstruct disease-associated cell state vector fields that are consistent with pseudo-bulk approaches, while offering improved reproducibility between different cohorts. By projecting the activity of multiple transcription factors (TFs) onto our reference manifold, we also identified various groups of TFs whose activity correlated with COVID-19-associated gene-expression changes in a cell-type-specific manner, including CEBPA, SP1, and AHR in monocytes. Finally, we demonstrate GEDI’s ability to generalize to different data-generating distributions, which in addition to the analysis of gene expression, allows the study of alternative splicing and mRNA stability landscapes. We showcase this capability using single-cell RNA-seq data of mouse neurogenesis, revealing cell type-specific cassette exon-inclusion events, mRNA stability changes that accompany neuronal differentiation, as well as RNA-binding proteins and microRNAs that drive these changes. Together, these analyses highlight GEDI as a unified framework for modeling sample-to-sample variability, pathway and network activity analysis, as well as analysis of both transcriptional and post-transcriptional programs of the cell.

30: Predicting medulloblastoma subtype from single-cell RNA-seq data with pair-based classifiers
Track: General Session
  • Steven M. Foltz, Alex's Lemonade Stand Foundation, University of Pennsylvania, United States
  • Chante Bethell, Alex's Lemonade Stand Foundation, United States
  • Casey S. Greene, University of Colorado Anschutz, University of Pennsylvania, United States
  • Jaclyn N. Taroni, Alex's Lemonade Stand Foundation, United States


Presentation Overview: Show

Medulloblastoma (MB) is an aggressive pediatric cancer with subtypes that each have unique molecular features and patient outcomes (Taylor et al., 2012). The four main MB subtypes – SHH, WNT, Group 3, and Group 4 – can be predicted using gene expression or methylation data from bulk samples. SHH and WNT are easy to distinguish, but existing classification methods struggle to discriminate between Group 3 and Group 4 (Weishaupt et al., 2019). Existing methods are also often applied to entire cohorts, rather than predicting subtype labels for individual samples as they are collected. Here, we introduce a single sample predictor that accurately classifies individual samples without the need to normalize values to match a training distribution. We applied k top-scoring pairs, a classification method based on the ordering of a set of paired measurements, and random forest approaches to make subtype predictions based on within-sample relative gene expression levels. We demonstrate comparable performance across RNA-seq and microarray profiling. After training models using bulk microarray and RNA-seq, we tested the performance of our single sample predictor on single-cell RNA-seq data from a set of 36 medulloblastoma samples representing all four subtypes. Our model correctly predicted the subtype in the majority of pseudo-bulked samples constructed by averaging genes’ expression levels across all cells. We applied the classifiers to individual cells in the single cell data. The predicted subtype of the majority of individual cells matched the patient’s subtype in 35 out of 36 samples. In three samples, however, the predicted subtypes were a mix of Group 3 and Group 4, with low confidence predictions suggesting an intermediate phenotype. Notably, Group 3 and Group 4 have previously been found to exist as intermediates on a transcriptomic spectrum (Williamson et al., 2022). Our results provide single-cell support for a model of Group 3 and Group 4 existing along a continuum and illustrate the value of the ability to classify individual cells. In summary, k top-scoring pairs and random forest single sample predictors accurately predict MB subtype labels across platforms and for both bulk and single-cell transcriptomic samples.

32: High-resolution characterization of transcription factor binding in S. cerevisiae
Track: General Session
  • Justin Cha, Cornell University, United States
  • William Km Lai, Cornell University, United States
  • B Franklin Pugh, Cornell University, United States


Presentation Overview: Show

The eukaryotic genome is regulated in part through the coordinated binding of specific proteins to DNA that in turn influence how and when genes are transcribed. This class of proteins are known as transcription factors (TFs) and operate through a multitude of distinct and overlapping mechanisms. Many TFs are sequence-specific, meaning they preferentially bind to a particular nucleotide sequence (i.e., motif), while others are recruited to locations in the genome through other mechanisms. Understanding where on our genome these TFs bind would accelerate research on regulatory diseases, drug development, and identification of biomarkers.
We have previously determined the genome-wide binding of ~400 yeast transcription factors using ChIP-exo (Rossi et al., 2021), a high-resolution variant of ChIP-seq that uses lambda exonuclease to achieve near-base pair resolution of protein binding through the genome. We used the protein-binding locations of each TF from this dataset to identify enriched sequence motifs (MEME). While the original study was comprehensive in its identification of the dominant TF assemblages and their cognate DNA recognition motifs, additional associated motifs were identified but not explored. The ChIP-exo assay has an ability to distinguish between distinct types of interactions of a TF with DNA, for example, direct vs indirect interactions. This is manifested as distinct patterns of exonuclease cleavage sites concentrated at associated motifs that is monitored genome-wide by deep DNA sequencing of DNA fragments (“tags”) that are resistant to exonuclease digestion.
We have now begun to characterization of these interactions. By performing iterative motif discovery on sets of binding locations stratified by occupancy, we identified motifs enriched specifically in protein-binding locations with lower occupancy. Furthermore, we developed a method to distinguish motifs by the sequencing tag distribution shape alone, and identified instances of the same motif that possessed distinct tag distribution shapes depending on genomic context. For example, we found the sequence-specific TFs, Abf1 and Cbf1, have distinct binding patterns for their canonical motifs depending on whether the motif occurs in a telomere. Additionally, using shared motif and binding shape information, we grouped TFs into assemblages. This analysis was only possible due to the base pair resolution that ChIP-exo provides, which allowed us to call narrow peaks and to characterize the shape of those peaks. While experimental work needs to be done to fully validate TF binding to these motifs and assess their biological implications, these secondary (noncognate) motifs and characterizations of binding shapes provide deep insight into associations between TFs as well as the fundamental structure of chromatin.

34: Multiple RNA tree Robinson-Foulds Phylogeny
Track: General Session
  • Yoann Anselmetti, University of Sherbrooke, Canada
  • Aïda Ouangraoua, Universityé of Sherbrooke, Canada


Presentation Overview: Show

In multicellular organisms, it has been shown that only a low fraction of RNAs lead to the production of proteins and then most of RNAs are non coding RNAs (ncRNAs). With more than 4,000 families referenced in the Rfam database, ncRNAs represent a wide diversity of molecules whose structure interact with other molecules in numerous metabolic pathways essential to the functioning of cells.
Over the last three decades, many methods were developed to predict the secondary structure of ncRNAs and build accurate ncRNA multiple sequence alignments accounting for their secondary structure. But until now, only a few algorithms and methods were designed to study the evolution of ncRNA secondary structure (e.g. Indiegram [Bradley and Holmes, 2009] and achARNement [Tremblay-Savard et al., 2016]), and none of them allows to reconstruct the complete evolutionary history of the secondary structures of a ncRNA family.

In this talk, we consider the Small Parsimony and the Large Parsimony problems for families of ncRNAs whose secondary structures are represented as trees. For these two optimization problems, we have designed heuristic solutions under the Robinson-Foulds (RF) tree metric model. We study the theoretical complexity of the problems under the RF distance model, as well as the tree edit distance model, and provide efficient algorithmic solutions for the two problems under the two tree distance models. For the Small Parsimony problem, we test different phylogenetic tree inferences of the ncRNA families to infer ancestral ncRNAs secondary structure. We consider the phylogenetic tree available in the Rfam datatbase and phylogenetic trees for each RNA substitution models (S6A, S6B, S6C, S6D, S6E, S7A, S7B, S7C , S7D, S7E, S7F, S16, S16A and S16B) available with the maximum likelihood phylogenetic tree inference software RAxML ([Stamastakis et al., 2014] and [Pignatelli et al., 2016]). We then use our implementation of the Large Parsimony problem to infer phylogenetic tree of the ncRNA families according to the ncRNA secondary structure using RF distance model and compare this inference to the previous ncRNA families tree inference. The study of the evolution of ncRNA structures has the potential to lead to interesting insights for therapeutic targeting of ncRNAs based on the comparison of their structures involved in metabolic pathways in different species, thus combining genomic, transcriptomic and metabolomic information.

36: GenoPipe: identifying the genotype of origin within (epi)genomic datasets
Track: General Session
  • Olivia Lang, Cornell University, United States
  • Divyanshi Srivastava, Pennsylvania State University, United States
  • B Franklin Pugh, Cornell University, United States
  • William Km Lai, Cornell University, United States


Presentation Overview: Show

Confidence in experimental results is critical for discovery. As the scale of data generation in genomics has grown exponentially, experimental error has likely kept pace despite the best efforts of many laboratories. Technical mistakes can and do occur at nearly every stage of a genomics assay (i.e., cell line contamination, reagent swapping, tube mislabeling, etc.) and are often difficult to identify post-execution. The rise of faster and more convenient technologies for genetic modification (i.e., CRISPR) has resulted in an explosion of possibilities for an experimental sample’s genetic background that in turn increases the opportunity for mix-ups and contamination with other modified and control strains.

The DNA sequenced in genomic experiments contains certain markers (e.g., indels) encoded within and can often be ascertained forensically from experimental datasets. We developed the Genotype validation Pipeline (GenoPipe), a suite of heuristic tools that operate together directly on raw and aligned sequencing data from individual high-throughput sequencing experiments to characterize the underlying genome of the source material. GenoPipe validates and can rescue erroneously annotated experiments by identifying unique markers inherent to an organism’s genome (i.e., epitope insertions, gene deletions, and SNPs).

We have determined the minimum recommended sequencing depth needed for GenoPipe to predict genetic backgrounds and detect sample cross-contamination accurately and reliably. However, our analysis of real data show that even when these baselines are not met, GenoPipe is still able to report biologically meaningful results.

Our extensive analysis of real genomic data ranging from yeast to human demonstrate this pipeline can identify and precisely localize the presence of synthetic fusion proteins in the genome, identify genomic deletions, and predict the identity of common cell lines. We found multiple instances of sample mislabeling and potential sample contamination across several small and large datasets including over 9,000 WGS samples of the Yeast Knockout Collection (YKOC) and thousands of samples generated by the ENCODE consortium. Additionally, application of GenoPipe to genomically localize viral DNA in lentivirally infected cells demonstrate GenoPipe’s capabilities for exploratory research.

The reliability of scientific findings that future discoveries build upon are critically dependent on our trust in the data. Having proper quality control pipelines like GenoPipe in place prevents enormous resource waste by flagging problematic samples before processing progresses too far and potentially rescuing samples that have already been generated. We recommend that all labs generating sequencing data use GenoPipe to screen their samples.

GenoPipe: identifying the genotype of origin within (epi)genomic datasets
Olivia Lang, Divyanshi Srivastava, B. Frank Pugh, William KM Lai
bioRxiv 2023.03.14.532660; doi: https://doi.org/10.1101/2023.03.14.532660

36: GenoPipe: identifying the genotype of origin within (epi)genomic datasets
Track: General Session
  • Olivia Lang, Cornell University, United States
  • Divyanshi Srivastava, Pennsylvania State University, United States
  • B Franklin Pugh, Cornell University, United States
  • William Km Lai, Cornell University, United States


Presentation Overview: Show

Confidence in experimental results is critical for discovery. As the scale of data generation in genomics has grown exponentially, experimental error has likely kept pace despite the best efforts of many laboratories. Technical mistakes can and do occur at nearly every stage of a genomics assay (i.e., cell line contamination, reagent swapping, tube mislabeling, etc.) and are often difficult to identify post-execution. The rise of faster and more convenient technologies for genetic modification (i.e., CRISPR) has resulted in an explosion of possibilities for an experimental sample’s genetic background that in turn increases the opportunity for mix-ups and contamination with other modified and control strains.

The DNA sequenced in genomic experiments contains certain markers (e.g., indels) encoded within and can often be ascertained forensically from experimental datasets. We developed the Genotype validation Pipeline (GenoPipe), a suite of heuristic tools that operate together directly on raw and aligned sequencing data from individual high-throughput sequencing experiments to characterize the underlying genome of the source material. GenoPipe validates and can rescue erroneously annotated experiments by identifying unique markers inherent to an organism’s genome (i.e., epitope insertions, gene deletions, and SNPs).

We have determined the minimum recommended sequencing depth needed for GenoPipe to predict genetic backgrounds and detect sample cross-contamination accurately and reliably. However, our analysis of real data show that even when these baselines are not met, GenoPipe is still able to report biologically meaningful results.

Our extensive analysis of real genomic data ranging from yeast to human demonstrate this pipeline can identify and precisely localize the presence of synthetic fusion proteins in the genome, identify genomic deletions, and predict the identity of common cell lines. We found multiple instances of sample mislabeling and potential sample contamination across several small and large datasets including over 9,000 WGS samples of the Yeast Knockout Collection (YKOC) and thousands of samples generated by the ENCODE consortium. Additionally, application of GenoPipe to genomically localize viral DNA in lentivirally infected cells demonstrate GenoPipe’s capabilities for exploratory research.

The reliability of scientific findings that future discoveries build upon are critically dependent on our trust in the data. Having proper quality control pipelines like GenoPipe in place prevents enormous resource waste by flagging problematic samples before processing progresses too far and potentially rescuing samples that have already been generated. We recommend that all labs generating sequencing data use GenoPipe to screen their samples.

GenoPipe: identifying the genotype of origin within (epi)genomic datasets
Olivia Lang, Divyanshi Srivastava, B. Frank Pugh, William KM Lai
bioRxiv 2023.03.14.532660; doi: https://doi.org/10.1101/2023.03.14.532660

38: Novel Cancer Driver Gene Identification by Inference of Selective Pressure on Biallelic Inactivation Events
Track: General Session
  • Octavia Maria Dancu, McGill University, Canada
  • Rached Alkallas, McGill University, Canada
  • Mathieu Lajoie, McGill University, Canada
  • Hamed Shateri Najafabadi, McGill University, Canada


Presentation Overview: Show

Identifying cancer-driver genes is an ongoing challenge in cancer research. The identification of previously unknown cancer-driver genes is critical for both advancing our understanding of the underlying cellular processes driving oncogenesis and for improving our cancer treatment strategies. Here, we describe a novel statistical model to systematically identify and analyze the patterns of co-occurrence & mutual exclusivity of gene-inactivating events to identify cancer driver genes. This model is based on the hypothesis that if a gene is “cancer-essential”, there is a negative selective pressure against biallelic inactivation of that gene. Conversely, if a gene is a tumor suppressor, that gene’s inactivation would be beneficial for the cancer cell, hence the co-occurrence of deleterious events like damaging mutations and LOHs is more likely (positive selection for biallelic gene inactivation). Our model integrates mutation features such as mutation-impact assessment scores, mutation context, and copy number alteration (CNA) status, as well as tumor characteristics such as total mutation burden and mutation signature subtypes, to identify cancer type-specific and pan-cancer signatures of positive or negative selection of biallelic inactivation. Application of this model to mutation and CNA data from 8719 tumor samples from TCGA (The Cancer Genome Atlas) has revealed a substantial number of putative cancer-driver genes, which include known and novel candidates.
The cancer gene list emerging from this work will have clinically actionable potential, providing novel targets for drug development and drug repurposing. As well, this work helps prioritize existing lists of candidate driver genes based on their likelihood of being true cancer drivers.

40: DiRLaM: Diversity-Regularized Autoencoder for Modeling Longitudinal Microbiome Data
Track: General Session
  • Derek Reiman, Toyota Technological Institute at Chicago, United States
  • Yang Dai, Univ. of Illinois at Chicago, United States


Presentation Overview: Show

Background: The human gut microbiome has been shown to impact host development, normal metabolic processes, as well as the pathogenesis of various diseases. Based on these discoveries, engineering the gut microbiome for the treatment of such diseases has become an exciting new direction in medical science. Uncovering the nature of how to precisely control a patient’s microbiome requires accurate modeling the dynamics of the microbiome community under varying conditions. However, the modeling of longitudinal microbiome data faces many challenges due to the inherent noise of microbiome data. Therefore, the development of robust and accurate models will empower the identification of microbiome-targeted therapies, as clinicians and researchers will be able to identify which factors and stimuli can be used in order to drive a patient’s microbiome to a healthier composition.

Method: Here we present DiRLaM, a deep-learning framework combining an autoencoder and deep neural network for modeling microbiome dynamics. By representing the microbiome community in a reduced latent space using an autoencoder, DiRLaM can capture the essential intrinsic community structure while making the model more robust to noise. Furthermore, DiRLaM interpolates microbiome communities within the learned latent space. In order to construct smooth transitions between different microbiome community samples, a novel regularization is applied to the Beta diversity of the observed and interpolated communities. Next, a deep neural network is trained to combine the latent microbiome community with additional information about the host and external stimuli to predict the microbiome community at the next time point. Lastly, using the trained models, DiRLaM identifies microbe-microbe interactions and significant host and external factors that contribute to the dynamic changes of the microbiome community structure.

Results: Using synthetic datasets and three real-world longitudinal datasets, we show that DiRLaM provides a more robust interpolation under increasing levels of noise compared to standard B-Spline interpolations. DiRLaM also outperforms the state-of-the-art dynamic Bayesian network model for predicting subsequent microbiome communities in longitudinal data. Additionally, we demonstrate DiRLaM’s ability to identify significant host characteristics and environmental factors contributing to the dynamics of the microbiome community.

Conclusion: We present DiRLaM, a combination of an autoencoder with a novel Beta diversity regularization and deep neural network. In both synthetic and real-world conditions, DiRLaM was both more robust and more accurate when modeling longitudinal microbiome data.

42: Altered circadian rhythms in Luminal A breast tumors modulate chemotherapeutic targets, metastatic potential, and tumor prognosis
Track: General Session
  • Jan Hammarlund, University of Pennsylvania, United States


Presentation Overview: Show

The molecular circadian clock regulates thousands of genes in a tissue-specific manner. The influence of circadian time on cell division – and by extension cancer - is particularly strong. Data from both shift workers and model organisms suggest that circadian disruption can increase the risk of breast cancer. Individual tumors likely use distinct machinery to disrupt the normal circadian regulation of cell division. While chronomedicine promises to improve therapy, our inability to describe rhythms in specific human tissues and tumors has been a key barrier to clinical translation.

We modified CYCLOPS (CYClic-Ordering-by-Periodic-Structure), an established method for circadian data ordering to better account for confounding variables. We combined RNA-seq data from 26 time-stamped, clinical breast biopsy pairs with data from the Genotype-Tissue Expression (GTEx) project and the Tissue Cancer Genome Atlas (TCGA). For time-stamped, non-cancerous samples, the CYCLOPS ordering was well correlated with collection time. Predicted acrophases of core circadian genes were in good accord with known physiology. Cycling was observed in pathways related to inflammation, hormone responsiveness, and DNA repair.

Both co-expression analysis and experimental measures from patient derived organoids suggested continued, albeit reduced, core clock rhythms in Luminal-A breast CA samples. Application of CYCLOPS to Luminal-A data revealed disrupted rhythms with some output pathways gaining and others losing rhythmicity. Epithelial mesenchymal transition (EMT), a pathway critical to metastasis, demonstrated increased cycling. Among Luminal-A samples, there was marked variability in CYCLOPS magnitude, a composite measure of global circadian rhythm strength. When compared to tumors with lower rhythm strength, tumors with higher circadian magnitude demonstrated increased cycling of EMT pathway genes and a higher risk of metastasis (Relative Risk 1.8). Experiments with three-dimensional Luminal-A breast CA cultures showed that circadian disruption following the knockdown of the core clock gene ARNTL resulted in increased cell division but reduced matrix invasion and cellular spread.

These findings demonstrate the clinical importance of determining subtype specific circadian rhythms. We find rhythms in normal breast tissue that can guide chronotherapy to minimize local toxicity. Strikingly, high magnitude molecular rhythms in Luminal-A tumors may predict distant metastasis while, in-vitro, circadian disruption reduces tumor spread and migration.

44: Integrative single-cell genomic analysis identifies a new type of skeletal stem cells in bone marrow endosteum
Track: General Session
  • Jialin Liu, University of Michigan, United States
  • Yuki Matsushita, University of Texas Health Science Center at Houston School of Dentistry, United States
  • Angel Ka Yan Chu, University of Michigan, United States
  • Joshua D Welch, University of Michigan, United States
  • Noriaki Ono, University of Texas Health Science Center at Houston School of Dentistry, United States


Presentation Overview: Show

Skeletal stem cells (SSCs) provide an important source of bone-forming osteoblasts and support overall bone health. In bone marrow of endochondral bones, chondrocyte-to-osteoblast transformation plays a significant role in bone formation in fetal life, while leptin receptor (LepR)-expressing perivascular bone marrow stromal cells (BMSCs) participate in bone formation mainly in adulthood and aging. However, the identity of SSCs during the transition, and how they coordinate active osteogenesis particularly in young ages remain unclear. To identify putative SSCs in young bone marrow, we performed integrative single-cell genomic analyses of Prrx1-cre-marked bone marrow stromal cells (BMSCs) at young (P21) and old (18M) stages, with data integration by LIGER. Single-cell RNA-seq analyses revealed that a small group of cells with osteoblast-chondrocyte transitional (OCT) identities were abundant in young bone marrow, which were predicted as a cell-of-origin of osteoblasts and reticular stromal cells by RNA velocity and CellRank. Analyses of the isogenic single-nucleus ATAC-seq dataset summarized in a 3-dimensional simplex plot revealed that cells in the OCT cluster demonstrated “trilineage” potential toward all three fates, predominantly toward osteoblast and reticular fates. Transcription factors (TFs) binding motif enrichment analysis revealed that chromatin accessibility peaks in OCT cells are enriched for chondrocyte-related TF binding motifs but have lower levels of accessible osteoblast-related motifs, suggesting that the OCT stromal cells are still being regulated primarily by chondrocyte-related TFs, supporting their transitional identities. Subsequent validation by mouse transgenic lines revealed that these OCT stromal cells expressed fibroblast growth factor receptor 3 (Fgfr3), resided in the endosteal space and robustly generated osteoblasts and in homeostasis and regeneration. Additionally, when isolated ex vivo, these Fgfr3+ stromal cells were highly enriched for skeletal stem cell activities, and single-cell derived clones of these cells possessed serial transplantability. Therefore, our integrative single-cell genomic analysis identifies a new type of bone marrow stromal. cells with osteoblast-chondrocyte transitional identities, identifying these cells as endosteal SSCs particularly abundant in young bones and coordinating active osteogenesis.

46: Potential impact of unmeasured transcription factor ChIP-seq data on human omics studies
Track: General Session
  • Saeko Tahara, Faculty of Medicine, University of Tsukuba, Japan
  • Haruka Ozaki, Bioinformatics Laboratory, Faculty of Medicine, University of Tsukuba, Japan


Presentation Overview: Show

The ongoing accumulation of omics measurements in public databases provides opportunities to systematically generate and verify biochemical and molecular hypotheses for a comprehensive understanding of biological phenomena. Among omics measurements, collections of transcription factor (TF) ChIP-seq data have been utilized for enrichment analysis and data mining to estimate regulatory factors for differentially expressed genes (DEGs). Additionally, statistical analysis and machine learning have also used these ChIP-seq data as training data to estimate candidate TFs that regulate specific gene expression patterns and that are potentially associated with disease-related variations.
The human genome is composed of tens of thousands of genes, but research attention on human genes is imbalanced and concentrates on several specific genes [1]. Actually, a similar trend applies to TFs: our preliminary survey found that only 10% of TFs occupy half of all TF-related publications in PubMed. This fact prompted us to speculate that imbalanced research attention also applies to TF ChIP-seq experiments, leading to imbalanced ChIP-seq data among TF, resulting in impaired accuracy of the downstream analysis, and inhibiting new findings of regulatory TFs.
Here, using a large-scale dataset of human TF-ChIP-seq, RNA-seq, knowledge of cell-type specific marker genes, and literature data, we systematically investigated the prevalence of measured and "unmeasured" ChIP-seq across different TFs and tissue/cell types. First, we found that 25-50% of ChIP-seq samples (combinations of TFs and cell types) were unmeasured even though the TFs expressed sufficiently in the cell type. Next, we analyzed DEGs of knockout/knockdown experiments and a literature-based database of marker genes and identified the unmeasured but potentially functional combinations of TFs and cell types. Subsequently, we estimated that a certain proportion of the unmeasured ChIP-seq samples were likely to be functionally relevant. Moreover, we investigated the effect of the unmeasured combinations of TFs and cell types. Based on these results, we’ll discuss the potential impacts of unmeasured TF ChIP-seq data on omics studies and the prioritization of unmeasured TF ChIP-seq experiments.

[1] Stoeger, Thomas, Martin Gerlach, Richard I. Morimoto, and Luís A. Nunes Amaral. 2018. “Large-Scale Investigation of the Reasons Why Potentially Important Genes Are Ignored.” PLoS Biology 16 (9): e2006643.

50: Characterizing the targets of transcription regulators by aggregating ChIP-seq and perturbation expression data sets
Track: General Session
  • Alexander Morin, University of British Columbia, Canada
  • Eric Chu, University of British Columbia, Canada
  • Aman Sharma, University of British Columbia, Canada
  • Alex Adrian-Hamazaki, University of British Columbia, Canada
  • Paul Pavlidis, University of British Columbia, Canada


Presentation Overview: Show

Mapping the gene targets of chromatin-associated transcription regulators (TRs) is a major goal of genomics research. ChIP-seq of TRs and experiments that perturb a TR and measure the differential abundance of gene transcripts are a primary means by which direct relationships are tested on a genomic scale. It has been reported that there is poor overlap in the evidence across gene regulation strategies, emphasizing the need for integrating results from multiple experiments. While research consortia interested in gene regulation have produced a valuable trove of high-quality data, there is an even greater volume of TR-specific data throughout the literature. In this study, we demonstrate a workflow for the identification, uniform processing, and aggregation of ChIP-seq and TR perturbation experiments for the ultimate purpose of ranking human and mouse TR-target interactions. Focusing on an initial set of eight regulators (ASCL1, HES1, MECP2, MEF2C, NEUROD1, PAX6, RUNX1, and TCF4), we identified 497 experiments suitable for analysis. We used this corpus to examine data concordance, to identify systematic patterns of the two data types, and to identify putative orthologous interactions between human and mouse. We build upon commonly used strategies to forward a procedure for aggregating and combining these two genomic methodologies, assessing these rankings against independent literature-curated evidence. Beyond a framework extensible to other TRs, our work also provides empirically ranked TR-target listings, as well as transparent experiment-level gene summaries for community use.

52: Type III IFN driven ADAR editing during Human Norovirus infections
Track: General Session
  • Caroline Nitiraharjo, Kent State University, United States
  • Violet Hutchison Goldinger, Kent State University, United States
  • Sarah Melen, Kent State University, United States
  • Dr. Helen Piontkivska, Kent State University, United States


Presentation Overview: Show

Viral gastroenteritis due to human norovirus (HuNoV) remains prevalent globally and appears seemingly innocuous as typical cases clear in three days within healthy adults in industrialized countries. Yet, it severely and disproportionately impacts the young, the elderly, and the immunocompromised who risk reoccurring infections or death. During many viral infections the host innate immune response is driven primarily by type I interferons (IFNs), and may result in unintended dysregulation of ADAR (adenosine deaminases acting on RNA) editing of key host transcripts, with subsequent neurological and non-ascribed symptoms. However, it is not known whether HuNov infection - that activates primarily the type III IFN response – is also capable of eliciting changes in ADAR editing patterns. Thus, we examined patterns of ADAR editing using RNA-seq data from HuNoV infection (from Lin et al. 2020 study). The results showed not only the overexpression of ADAR genes in infected cells, but also shifts in patterns of editing across multiple host transcripts, including differences in editing from early to later stages of the infection. Interestingly, in addition to expected changes in genes relevant to the immune response, there were also changes in genes involved in myogenesis as well as nervous system and amyloid fiber formation genes, suggesting that consequences of HuNoV infection may potentially spread beyond the gastrointestinal system. Overall, our findings showed that type III IFN response elicited by HuNoVs is capable of inducing significant changes in ADAR editing patterns of host transcripts, with potential far-reaching consequences for patients’ health. Considering long-term risks, transmissibility, and prevalence of HuNoV, we urge development of stronger prevention measures and further studies of HuNoV risks.

54: Endophenotypes and prognostic model of hospitalized SARS-CoV-2 positive participants of Biobanque québécoise de la COVID-19
Track: General Session
  • Antoine Soulé, McGill University, Canada
  • William Ma, McGill University, Canada
  • Simon Rousseau, McGill University, Canada
  • Amin Emad, McGill University, Canada


Presentation Overview: Show

A minority of people infected with SARS-CoV-2 will develop severe COVID-19, sometimes without any common risk factors of COVID-19 severity. To help physicians detect patients that are more likely to require intensive care, we conducted an unsupervised stratification of the circulating proteome of 731 SARS-CoV-2 PCR-positive hospitalized participants in the Biobanque québécoise de la COVID-19. We identified six endophenotypes (EPs) associated to varying degrees of disease severity and need of intensive care. One endophenotype, EP6, was associated with a greater proportion of intensive care unit (ICU) admission, ventilation support, acute respiratory distress syndrome (ARDS) and death. Clinical features of EP6 included increased levels of C-reactive protein, D-dimers, interleukin-6, ferritin, soluble fms‑like tyrosine kinase‑1, elevated neutrophils, and depleted lymphocytes, whereas another endophenotype (EP5), was associated with cardiovascular complications, congruent with elevated blood biomarkers of cardiovascular disease like N‑terminal pro B‑type natriuretic peptide (NT-proBNP), Growth Differentiation Factor‑15 (GDF-15), and Troponin T. Importantly, a prognostic model solely based on clinical laboratory measurements was developed and validated on another group of 903 patients. This prognostic model generalizes the EPs to new patients, using only standard blood tests results, and creates new opportunities for automated identification of high-risk groups in the clinic. This novel way of addressing pathogenesis leverages detailed phenotypic information but ultimately relies on routinely available information in the clinic. Furthermore it may find applications in other diseases beyond COVID-19.

A circulating proteome-informed prognostic model of COVID-19 disease activity that relies on routinely available clinical laboratories
William Ma, Antoine Soulé, Karine Tremblay, Simon Rousseau, Amin Emad
medRxiv 2022.11.02.22281834; doi: https://doi.org/10.1101/2022.11.02.22281834

Molecular pathology of acute respiratory distress syndrome, mechanical ventilation and abnormal coagulation in severe COVID-19
Antoine Soulé, William Ma, Katelyn Yixiu Liu, Catherine Allard, Salman Qureshi, Karine Tremblay, Amin Emad, Simon Rousseau
medRxiv 2023.03.09.23286797; doi: https://doi.org/10.1101/2023.03.09.23286797

56: Classifying the Post-Duplication Fate of Paralogous Genes
Track: General Session
  • Reza Kalhor, Department of Computer Science, Université de Sherbrooke, Sherbrooke, Canada, Canada
  • Guillaume Beslon, Université de Lyon, INSA-Lyon, INRIA, CNRS, LIRIS UMR5205, Lyon, France, France
  • Manuel Lafond, Department of Computer Science, Université de Sherbrooke, Sherbrooke, Canada, Canada
  • Celine Scornavacca, Institut des Sciences de l’Evolution de Montpellier (Université de Montpellier, CNRS, IRD, EPHE), Montpellier, France, France


Presentation Overview: Show

Gene duplication is one of the main drivers of evolution. It
is well-known that copies arising from duplication can undergo multiple
evolutionary fates, but little is known on their relative frequency, and on
how environmental conditions affect it. In this paper we provide a gen-
eral framework to characterize the fate of duplicated genes and formally
differentiate the different fates. To test our framework, we simulate the
evolution of populations using aevol, an in silico experimental evolution
platform. When classifying the resulting duplications, we observe several
patterns that, in addition to confirming previous studies, exhibit new
tendencies that may open up new avenues for a better understanding
the role of duplications.

58: Modeling and Simulation of Cancer Evolution in Single Cells Abstract
Track: General Session
  • Judah Engel, Columbia University, United States
  • Khanh Dinh, Columbia University, United States
  • Simon Tavare, Columbia University, United States


Presentation Overview: Show

Single-cell DNA sequencing technology has the potential to facilitate an improved understanding of tumor evolution by elucidating the mechanisms underlying mutational and copy number processes. Recent advances have been made in this field with the development of a new sequencing technology called Direct Library Preparation + (DLP+) [1]. Single-cell DNA sequencing methods like DLP+ are ideal for studying clonal evolution in cancers exhibiting intra-tumor heterogeneity, in contrast with bulk sequencing methods. We constructed a stochastic model of ovarian cancer that takes in several parameters and produces simulated single-cell sequencing data, along with a phylogeny detailing the evolution of subclones and their copy number variants. These parameters include known variables such as cell turn-over rate and population size, along with unknown parameters such as the rates of certain mutation and copy number aberration classes (whole genome duplications, missegregations, focal amplifications, etc.) and the selection rates of specific genotypes, each of which can be described using a probability distribution. We attempted to fit the distributions of the unknown parameters using a couple methods. Conventional likelihood-based methods are intractable due to the complexity of the model, so we primarily relied on numerical approaches such as Approximate Bayesian Computation (ABC) [2] to calculate posterior distributions for the parameters. We used a euclidean distance metric as our summary statistic for traditional ABC before turning to a variant of ABC that utilizes a random forest [3] to circumvent the need for an explicitly defined summary statistic. We are currently working to integrate data on clonal ancestry into the inference method using deep learning techniques. [4,5,6] Modeling cancer is potentially valuable because it is able to reveal mechanistic features of its evolution. Understanding the complex pathological mechanisms of cancer evolution could improve our ability to design treatments to help those suffering from this disease and others like it.





Work Cited:
[1] Laks E, McPherson A, et al. Clonal Decomposition and DNA Replication States Defined by Scaled Single-Cell Genome Sequencing. Cell. 2019 Nov 14;179(5):1207-1221.e22.
[2] Tavaré, S; Balding, DJ; Griffiths, RC; Donnelly, P (1997). ""Inferring Coalescence Times from DNA Sequence Data"". Genetics. 145 (2): 505–518.
[3] Louis Raynal, Jean-Michel Marin, Pierre Pudlo, Mathieu Ribatet, Christian P Robert, Arnaud Estoup, ABC random forests for Bayesian parameter inference, Bioinformatics, Volume 35, Issue 10, 15 May 2019, Pages 1720–1728, https://doi.org/10.1093/bioinformatics/bty867
[4] Ramesh, Poornima & Lueckmann, Jan-Matthis & Boelts, Jan & Tejero-Cantero, Alvaro & Greenberg, David & Goncalves, Pedro & Macke, Jakob. (2022). GATSBI: Generative Adversarial Training for Simulation-Based Inference.
[5] Koch, Gregory R.. “Siamese Neural Networks for One-Shot Image Recognition.” (2015)., ICML deep learning workshop, vol. 2. 2015.
[6] David Greenberg, Marcel Nonnenmacher, and Jakob Macke. Automatic posterior transformation for likelihood-free inference. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2404–2414. PMLR, 2019.

60: Deep Learning on Genetic Data with Diet Networks for Complex Phenotype Prediction
Track: General Session
  • Camille Rochefort-Boulanger, University of Montreal, Montreal Heart Institute, Mila, Canada
  • Matthew Scicluna, University of Montreal, Montreal Heart Institute, Mila, Canada
  • Léo Choiniere, University of Montreal, Canada
  • Jean-Christophe Grenier, Montreal Heart Institute, Canada
  • Raphaël Poujol, Montreal Heart Institute, Canada
  • Pierre-Luc Carrier, Mila, Canada
  • Yoshua Bengio, University of Montreal, Mila, Canada
  • Julie Hussin, University of Montreal, Montreal Heart Institute, Canada


Presentation Overview: Show

The Diet Networks (DN) is a deep learning approach proposed to accommodate the large number of genetic variants used as features in prediction tasks in genomics that causes overfitting problems. The DN architecture is composed of an auxiliary network trained on a task-relevant representation of the genetic variants simultaneously with a main network, dedicated to the prediction task, to reduce the number of free parameters in the main network and prevent overfitting. Given the heterogeneity of genomic data collection protocols and the high number of missing data in genomic datasets, we evaluated the generalization capability of the DN, previously trained in the 1000 Genomes Project (1KGP) dataset on a population stratification task, in the independent dataset CARTaGENE, the biobank of the Quebec province. Our results show that, in addition to generalizing its predictions to an independent dataset, the DN can also generalize its prediction to a population never seen during training, the French-Canadian population found in CARTaGENE but not in the 1KGP dataset.

The DN approach was tested for the prediction of complex phenotypes, including height and obesity, using White British participants from the UKbiobank. In a classification task of normal weight and obese individuals, the DN reaches an accuracy on the test set comparable to what is achieved with current polygenic risk scores based on linear models. Providing the genotype frequencies per class as a representation of genetic variants to the auxiliary network has proven to be effective in the classification tasks of population stratification and obesity prediction mentioned above and we are currently exploring new representations of genetic variants adapted for regression tasks. We find that, in the context of height prediction, genetic variant representations based on summary statistics derived from Genome-Wide Association Studies are promising to improve the DN performance. The generalization capability of the DN and the results obtained for obesity and height predictions show the DN’s potential to handle the large number of genetic variants and are a first step towards the integration of data from other sources, such as other 'omics' modalities, clinical information and environmental factors, to improve the prediction of complex phenotypes.

62: Adapting language models to explore the multidomain protein universe
Track: General Session
  • Xiaoyue Cui, Carnegie Mellon University, United States
  • Maureen Stolzer, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States


Presentation Overview: Show

Multidomain proteins are mosaics of structural or functional modules, called domains. The architecture of a multidomain protein - that is, its domain composition in N- to C-terminal order - is intimately related to its function, with each module playing a distinct functional role. For example, in cell signaling proteins, distinct domains are responsible for recognition and response to a stimulus. Multidomain architectures evolve via gain and loss of domain-encoding segments. This evolutionary exploration of domain architecture composition underlies the protein diversity seen in nature.
We present a framework based on information retrieval and natural language processing-inspired models for exploring the varied composition of domain architectures. Domain architectures are represented as vectors in a multidimensional space. Distances in this space provide a quantification of the relationship between domain architectures. This can be extended to set-wise distances for the quantitative comparison of two sets of domain architectures. Our framework has many applications, including investigating taxonomic differences in the domain architecture complement and testing domain architecture simulators, by assessing how well simulated domain architectures recapitulate properties of genuine ones. Here, we apply this framework to investigate the constraints on the formation of domain combinations. Only a tiny fraction of all possible domain combinations are observed in nature, suggesting that domain order and co-occurrence are highly constrained, but these constraints are poorly understood. We introduce a null model that generates architectures with properties that deviate from genuine domain architecture properties. Comparing the properties of domain architectures that do and do not occur in nature may shed light on the design rules of multidomain architecture composition.

64: Identifying efficacious dietary additives to improve poultry gut health
Track: General Session
  • Maxine Ty, University of Toronto, Canada
  • John Parkinson, University of Toronto, Canada


Presentation Overview: Show

Background: Antibiotic growth promoters (AGPs) are commonly used within poultry production to improve feed conversion, bird growth, and reduce morbidity and mortality from clinical and subclinical diseases. The overuse of AGPs in livestock production has been linked to the ability of these pathogens to acquire novel antimicrobial resistance mechanisms, highlighting an urgent need to identify alternatives, capable of promoting the development of a healthy microbiome. To address the global bans on AGPs, the poultry industry explored the use of probiotics, natural microbes that confer a beneficial effect on their host, to mitigate against infections associated with food safety and food security.

Ty et al. (Animal Microbiome, 2022. 4(2)) compared the efficacy of four microbial interventions (Pediococcus acidilactici, Saccharomyces cerevisiae boulardii; and complex consortia-Aviguard and CEL) to bacitracin, a commonly used AGP, to modulate gut microbiota and subsequently impact Campylobacter jejuni infection in poultry. While the different treatments did not significantly decrease C. jejuni burden relative to the untreated controls, both complex consortia resulted in significant decreases relative to treatment with bacitracin. Analysis of 16S rDNA profiles from cecal content samples revealed a distinct microbial signature associated with each treatment. For example, Aviguard and CEL increased the relative abundance of Bacteroidaceae and Rikenellaceae respectively, both major producers of short-chain fatty acids (SCFAs), key molecules involved in host homeostasis and disease state.

In the second phase of this research, we compared microbial interventions (Aviguard, Bacillus pumilus, Saccharomyces cerevisiae boulardii) in poultry by employing 16S rRNA surveys and metatranscriptomics to analyze community composition and function. Gene expression profiles of the gut microbiome for each sample will be generated, detailing a list of microbial genes and their relative expressions. From this, we plan to identify upregulated expression of enzymes in metabolic pathways that produce SCFAs.

Significance: Ultimately this research will lay the foundation for designing next-generation products to replace AGPs. The application of effective alternatives in poultry production will ultimately improve bird health, maintain food security and safety, and reduce economic loss in the industry.

66: RobusTAD: nonparametric test detects hierarchical topologically associating domains
Track: General Session
  • Yanlin Zhang, McGill University, Canada
  • Rola Dali, McGill University, Canada
  • Mathieu Blanchette, McGill University, Canada


Presentation Overview: Show

Topologically associating domains (TADs) are fundamental in forming hierarchical organized 3D genomes and facilitating cellular functions. Many TAD annotation tools have been proposed, however, identifying TAD boundaries and hierarchies at high resolutions remains challenging [1,2]. Most algorithms only use the study sample to annotate TADs. Recently, we proposed RefHiC [3] for topological structure annotations. It overcomes the data sparsity issue by augmenting the input with a panel of reference Hi-C samples. However, it cannot detect TAD hierarchies. Here, we introduce RobusTAD, a set of TAD annotation algorithms that provide accurate and robust TAD annotation at high resolutions. RobusTAD detects TADs from the study sample, and RobusTAD-LMCC improves TAD boundary annotations by leveraging external Hi-C data and achieves superior performance by exploiting locally matched chromosome conformations (LMCC).

RobusTAD takes a Hi-C matrix as input and calls TADs in three steps: (i) study sample based TAD boundary identification; (ii) reference panel based refinement of TAD boundaries; (iii) pairing of left and right boundaries into a nested domain hierarchy. Study sample based boundary identification is based on seeking local maxima in a 1D TAD boundary scores. RobusTAD assigns left and right TAD boundary scores to each locus by performing genomic distance stratified rank sum test between upstream/downstream inter- and intra-domain interactions. RobusTAD-LMCC refines boundary calls made on the study sample by utilizing a panel of reference Hi-C samples. For a given candidate TAD boundary at position p, We define locally matched chromosome conformations LMCC(p) as a collection of Hi-C samples in which a TAD boundary occurs within 25 kb of p. It then computes refined boundary scores for the 50 kb region by averaging boundary scores from LMCC(p) and the study sample itself; the position that reaches the maximum score is the final high-resolution boundary prediction. Last, RobusTAD assembles TADs by pairing left and right boundary candidates using a dynamic programming algorithm that maximizes the sum of TADs scores. RobusTAD computes the TAD score with the distance stratified rank sum statistic of interactions between intra- and inter-domain in both upstream and downstream. As sub-TADs lead to a TAD score inflation, we exclude any sub-TADs in the TAD score calculation. We designed our dynamic programming algorithm by assuming TADs are fully nested or disjoint. This algorithm is guaranteed to produce the global optimal TAD hierarchy without partially overlapping TADs.

We compared the performance of RobusTAD and RobusTAD-LMCC to 13 other TAD callers. Our results indicate that RobusTAD-LMCC and RobusTAD are the most accurate TAD callers, as both are among the top five TAD callers in all accuracy comparisons.

References
[1] Lee, D. I., & Roy, S. (2021). GRiNCH: simultaneous smoothing and detection of topological units of genome organization from sparse chromatin contact count matrices with matrix factorization. Genome biology, 22(1), 164.
[2] Rao, S. S., Huntley, M. H., Durand, N. C., Stamenova, E. K., Bochkov, I. D., Robinson, J. T., ... & Aiden, E. L. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159(7), 1665-1680.
[3] Zhang, Y., & Blanchette, M. (2022). Reference panel guided topological structure annotation of Hi-C data. Nature Communications, 13(1), 7426.

68: Bridging the modeling gap: Accelerating Complex Disease Drug Discovery using Integrative Quantitative Pathway Analysis between Human Subjects and Cellular Models
Track: General Session
  • Pourya Naderi Yeganeh, Beth Israel Deaconess Medical Center/ Harvard Medical School, United States
  • Sang Su Kwak, Massachusetts General Hospital/ Harvard Medical School, United States
  • Mehdi Jorfi, Massachusetts General Hospital/ Harvard Medical School, United States
  • Katjuša Koler, University of Sheffield, United Kingdom
  • Luisa Quinti, Massachusetts General Hospital/ Harvard Medical School, United States
  • Djuna von Maydell, Massachusetts Institute of Technology, United States
  • Younjung Choi, Massachusetts General Hospital/ Harvard Medical School, United States
  • Joseph Park, Massachusetts General Hospital/ Harvard Medical School, United States
  • Murat Cetinbas, Massachusetts General Hospital, United States
  • Ruslan Sadreyev, Massachusetts General Hospital, United States
  • Rudolph Tanzi, Massachusetts General Hospital/ Harvard Medical School, United States
  • Doo Yeon Kim, Massachusetts General Hospital/ Harvard Medical School, United States
  • Winston Hide, Beth Israel Deaconess Medical Center/ Harvard Medical School, United States


Presentation Overview: Show

Complex diseases are highly challenging to combat partly due to the interplay of molecular cascades involved in disease pathogenesis. Cellular models of disease offer great potential for exploring biological mechanisms and drug target testing but there is currently no way to determine how well a modelled disease mechanism matches actual human disease. Several clinical trials for complex diseases have failed despite successful preclinical validations in cellular and animal models. Cellular models are built to recapitulate high-level phenotypes and disease pathology. But there is currently no approach to systematically assess how well the molecular profiles of disease pathogenesis are recapitulated in models. Comparing human and model transcriptomes is attractive but integrative study of gene expression is typically confounded by cross-platform and species-specific effects. We have developed a systems approach that better integrates transcriptomes from cell models and primary human tissues.

To determine how well a modelled disease mechanism matches the actual human disease, we have developed integrated quantitative pathway analysis (iQPA); that both captures and interrogates the degree to which disease functions constructed in models match those found in common across hundreds of diseased human brains. Using annotated pathway databases and a non-parametric approach, iQPA transforms gene expression into a series of quantifiable pathway activities. These pathway activities are analyzed using linear models to define functional dysregulation. In turn, iQPA leverages dysregulation events to identify and assess consistency of functional recapitulation between human and model.

We demonstrate the utility of iQPA applied to Alzheimer's disease (AD). Brain transcriptomic datasets sampled from different brain regions of three independent cohorts, as well as multiple cell models of AD, were integrated to determine high-fidelity therapeutic target pathways. iQPA found a high level of correlation (r = 0.84) of pathway dysregulation between distinct brain regions, whereas gene-based analysis uncovered a significantly lower correlation (r = 0.51). It unbiasedly determined which cellular models most closely recapitulate human dysregulation events. iQPA identified 83 commonly dysregulated core pathways with consistent dysregulation across human brains and the most relevant cell model. The p38 MAPK pathway is the top core pathway shared between AD brains and the relevant AD cellular models. We explored its therapeutic potential we applied a clinical p38 MAPK inhibitor which dramatically ameliorated Aβ-induced tau pathology and neuronal death in 3D-differentiated human neurons. iQPA accelerates AD drug discovery by systematically identifying dysregulated core pathway activities to provide robust, validated targets that attenuate AD pathology.

72: Forkhead transcription factors diversify their DNA-binding targets via differential abilities to engage inaccessible chromatin
Track: General Session
  • Shaun Mahony, The Pennsylvania State University, United States


Presentation Overview: Show

While transcription factors (TFs) are central to the establishment of cell fates, we still know little about how cell-specific TF regulatory activities result from the interplay between a TF’s sequence preference and cell-specific chromatin environments. To understand the determinants of TF DNA-binding specificity, we need to examine how newly activated TFs interact with sequence and preexisting chromatin landscapes to select their binding sites. Here, we present a principled approach to model the sequence and preexisting chromatin determinants of TF binding. Specifically, we have developed a neural network that jointly models sequence and prior chromatin data to interpret the binding specificity of TFs that have been induced in well-characterized chromatin environments. The network architecture allows us to quantify the degree to which sequence and prior chromatin features explain induced TF binding, both at individual sides and genome-wide.
Here, we apply our approach to characterize differential binding activities across a selection of Forkhead-domain TFs when each is expressed in mouse embryonic stem cells. We show that the preexisting chromatin landscape is an important determinant of differential Fox TF binding specificity. While all share similar Forkhead DNA-binding domains with the prototypical “pioneer” factor FoxA1, paralogous Fox TFs display a wide range of abilities to engage relatively inaccessible chromatin. Thus, despite having similar DNA-binding preferences, paralogous Fox TFs can bind to different DNA targets, and drive differential gene expression patterns, even when expressed in the same chromatin environment. We propose that modifying preferences for preexisting chromatin states is an important strategy by which evolution enables the functional diversification of paralogous TFs.

74: C-less is K-more: k-mers as an alternative to gene expression in low-coverage RNA-seq
Track: General Session
  • Carl Munoz, University of Montreal, Canada
  • Sébastien Lemieux, Insitut de recherche en immunologie et en cancérologie (IRIC), Canada


Presentation Overview: Show

Standard RNA-seq protocols require a certain amount of coverage, the result of which is then aligned to a reference transcriptome. However, lowering sequencing coverage can reduce costs and allow for more replicates, while using reference-free methods, such as k-mer counts, can capture raw sequencing information not limited to canonical transcripts. While data pre-treatment is identical in both methods, there currently exists no approach combining them in the context of bulk RNA-seq. In this study, the TCGA RNA-seq data are subsampled at various levels to simulate lower sequencing coverage. All coverage levels have their gene expression and k-mer counts quantified, which are then compared against each other following variance filtration and t-SNE dimensionality reduction. The quality of the clustering of these data are quantified with a silhouette score and observed quantitatively. These will also be fed through a neural network to assess their capacity for cancer type prediction. We found that, for all coverage levels, filtering k-mers on variance significantly improve clustering between cancer types. These also performed as well as, if not better than, gene expressions. K-mers are also expected to perform similarly to gene expression when it comes to cancer type predictions. These results show that we can significantly reduce genome coverage while still attaining a similar level of performance to gene expression, allowing for potential new RNA-seq standards.

76: Investigating the Influence of Heterophily on Cell Type Prediction of Single-cell RNA Sequencing Data Using Graph Neural Networks
Track: General Session
  • Lian Duan, University of Windsor, Canada
  • Mahshad Hashemi, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada


Presentation Overview: Show

Introduction and motivations: Graph neural networks (GNNs) have gained increasing
popularity as a powerful tool for node classification in complex networks, among other tasks.
However, the traditional design of GNNs assumes homophily, where connected nodes have
similar class labels and features. In the real world, it is common for connected nodes to have
different class labels and dissimilar features, a scenario known as heterophily, which can affect
the performance of GNNs.
Methods and datasets: To address this issue, recent studies have proposed paradigms to
enhance the representation power of GNNs under heterophily. These methods include higher order neighborhoods, ego- and neighbor-embedding separation, and the combination of
intermediate representations. However, it is unclear whether these proposed approaches are
effective in real-world datasets with high heterophily.
In this study, the main task is to predict cell types in the Baron Human Pancreas dataset, which
contains single-cell RNA sequencing data of pancreatic cells from a healthy human donor. The
effectiveness of the proposed approach by Zhang et al. on this dataset, which has high
heterophily, is evaluated, and the performance of these methods is compared against state-of-the art GNN methods and others that disregard the graph structure.
Results: The results show that the proposed approach significantly improves the performance of
GNNs on the Baron Human Pancreas dataset. H2GCN, incorporating the proposed designs,
outperforms all other GNN methods, including GraphSAGE, Mixhop GAT, and GCN. These
findings demonstrate the potential of the proposed approach in improving the performance of the
GNN on heterophilous datasets such as the Baron Human Pancreas dataset.
Conclusions and implications: This study also emphasizes the need to evaluate the performance
of GNNs on real-world datasets with high heterophily accurately. Similarly, it provides insights
into the limitations and potentials of different GNN models and contributes to the development
of more effective and generalizable GNN methods. The implications of this research are
significant and far-reaching, contributing to the development of more effective and generalizable
GNN models applicable to various domains, such as social network analysis, drug discovery, and
recommendation systems, among others.
Overall, the study highlights the importance of designing GNNs that are robust to heterophily
and provides a foundation for further research in this area to explore their potential for other
applications. The proposed approach can enhance the representation power of GNNs under
heterophilous conditions and improve the performance of GNNs on real-world datasets with high
degree of heterophily. This research opens new possibilities for the development of more
effective and generalizable GNN models that can be applied to a wide range of applications in
various domains.

76: Investigating the Influence of Heterophily on Cell Type Prediction of Single-cell RNA Sequencing Data Using Graph Neural Networks
Track: General Session
  • Lian Duan, University of Windsor, Canada
  • Mahshad Hashemi, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada


Presentation Overview: Show

Introduction and motivations: Graph neural networks (GNNs) have gained increasing
popularity as a powerful tool for node classification in complex networks, among other tasks.
However, the traditional design of GNNs assumes homophily, where connected nodes have
similar class labels and features. In the real world, it is common for connected nodes to have
different class labels and dissimilar features, a scenario known as heterophily, which can affect
the performance of GNNs.
Methods and datasets: To address this issue, recent studies have proposed paradigms to
enhance the representation power of GNNs under heterophily. These methods include higher order neighborhoods, ego- and neighbor-embedding separation, and the combination of
intermediate representations. However, it is unclear whether these proposed approaches are
effective in real-world datasets with high heterophily.
In this study, the main task is to predict cell types in the Baron Human Pancreas dataset, which
contains single-cell RNA sequencing data of pancreatic cells from a healthy human donor. The
effectiveness of the proposed approach by Zhang et al. on this dataset, which has high
heterophily, is evaluated, and the performance of these methods is compared against state-of-the art GNN methods and others that disregard the graph structure.
Results: The results show that the proposed approach significantly improves the performance of
GNNs on the Baron Human Pancreas dataset. H2GCN, incorporating the proposed designs,
outperforms all other GNN methods, including GraphSAGE, Mixhop GAT, and GCN. These
findings demonstrate the potential of the proposed approach in improving the performance of the
GNN on heterophilous datasets such as the Baron Human Pancreas dataset.
Conclusions and implications: This study also emphasizes the need to evaluate the performance
of GNNs on real-world datasets with high heterophily accurately. Similarly, it provides insights
into the limitations and potentials of different GNN models and contributes to the development
of more effective and generalizable GNN methods. The implications of this research are
significant and far-reaching, contributing to the development of more effective and generalizable
GNN models applicable to various domains, such as social network analysis, drug discovery, and
recommendation systems, among others.
Overall, the study highlights the importance of designing GNNs that are robust to heterophily
and provides a foundation for further research in this area to explore their potential for other
applications. The proposed approach can enhance the representation power of GNNs under
heterophilous conditions and improve the performance of GNNs on real-world datasets with high
degree of heterophily. This research opens new possibilities for the development of more
effective and generalizable GNN models that can be applied to a wide range of applications in
various domains.

78: Exploration of Large Scale Differential Expression Datasets
Track: General Session
  • Neera Patadia, The University of British Columbia, Canada
  • Paul Pavlidis, The University of British Columbia, Canada


Presentation Overview: Show

A major goal in genomics is to identify and interpret patterns of gene expression. A common approach to investigate these patterns is to perform differential expression (DE) analysis, which compares the expression of genes from a baseline to a biological condition of interest. While individual studies on their own provide some insights into these patterns of gene expression, the aggregation of thousands of differential expression studies could reveal patterns which may reflect modules for gene regulatory programs. A guiding question for this work is how many such programs are there – that is, do all experimental conditions have a unique signature, or does differential expression “cluster” into a finite number of patterns. The identification of modules of gene regulation can provide powerful comprehension into the relatedness of biological conditions for elucidation of molecular bases that link these conditions together. The Gemma database, developed in the Pavlidis lab contains over 15,000 manually curated expression profiling datasets based on RNA-sequencing and microarray data from mouse and human (over 400,000 samples). These datasets have been annotated with approximately 40,000 condition comparisons derived from formal ontologies, and includes categories such as drug treatments, diseases, developmental stages, genetic manipulations and tissue types among others. An algorithm developed in the lab, GemmaDE, uses Gemma’s data to perform “condition enrichment” analyses. GemmaDE takes a gene list of interest (in our case, DE genes) as input and outputs a vector of ranked scores for all condition comparisons within the database, based on their relevance to the hit­list. In this work, we use the GemmaDE algorithm to analyze all differential expression datasets within Gemma to examine relationships between different biological conditions within the database. The relationships are conceptualized as a condition-similarity network, with the expectation that experiments with similar patterns of differential expression will cluster together. To interpret the patterns that emerge from the GemmaDE analysis, we compare our findings to a semantic similarity network based on annotated experimental ontology terms. This work contributes to our understanding of the expression and regulation landscape of the human and mouse genomes.

80: MPAQT: A novel data integration framework for isoform quantification with short-read and long-read RNA-seq
Track: General Session
  • Michael Apostolides, McGill University, Canada
  • Albertas Navickas, UCSF, United States
  • Benedict Choi, UCSF, United States
  • Hani Goodarzi, UCSF, United States
  • Hamed Najafabadi, McGill University, Canada


Presentation Overview: Show

Transcript quantification is an ongoing problem in biomedical research, which is not fully solved. RNA-sequencing with short reads (SRs) is currently the leading approach due to low cost, high depth, and many available software tools for downstream analysis. Short reads, however, are often unable to resolve complex splicing events among highly similar transcripts, while long reads (LRs) provide full-length transcript sequences, allowing accurate assignment of reads to transcripts, but usually at lower depth due to high cost. New computational methods are needed for joint analysis of SR and LR data. We introduce MPAQT (Multi-Platform Aggregation and Quantification of Transcripts), a novel statistical framework that takes advantage of the high depth of SRs and the high accuracy and unambiguity of LRs. MPAQT’s generative model explicitly connects the transcript abundance profile of a sample to the expected SR and LR distribution, allowing maximum-likelihood estimation of transcript abundances from SR data alone or the combination of SR and LR data. Using various experimental and simulated benchmarking datasets, we show that MPAQT quantifies transcripts more accurately than other leading tools such as kallisto, salmon, and RSEM; this improvement remains true at both gene level and transcript level. Using SR data alone, we show that MPAQT captures quantification information from transcripts with low expression often missed by other tools. When combining both SRs and LRs, we show MPAQT improves quantification of select transcripts when adding in LRs compared to only using SRs; transcripts with improved quantification are often from longer genes with more exons, have more splicing variants, and are enriched for neuronal differentiation and brain-related processes. Finally, we analyzed human embryonic stem cells (hESCs) undergoing in vitro differentiation toward cortical neurons using paired SR and LR data. We highlight MPAQT’s improved quantification of transcripts related to neuronal differentiation, including isoform switch events between immature and mature neurons not captured with SR data, showing MPAQT’s detection of transcript abundance changes accompanying neuronal differentiation. Many differentially quantified transcripts contain alternative 5’ and/or 3’ untranslated regions, suggesting possible altered transcript stability or cellular localization. Differentially quantified transcripts tend to be similar, differing by one or two exons–LRs can detect such small differences due to their complete transcript coverage. MPAQT’s ability to integrate SR and LR data, and its improved quantification of transcripts from longer genes with more exons and splicing variants, make it a novel tool to study transcript quantification in tissues with complex splicing patterns such as the brain and cancer.

82: Factorized Embeddings demonstrate that transcriptomic profiles can be summarized into very few genetic components useful for sample-related and biological feature detection
Track: General Session
  • Léonard Sauvé, IRIC / Université de Montréal, Canada
  • Sebastien Lemieux, IRIC / Université de Montréal, Canada


Presentation Overview: Show

Gene Expression profiles comprise up to 20,000 counts and can be summarized into a much smaller set of components. The machine learning models (PCA, t-SNE, Factorized Embeddings) used to generate the components are able to faithfully reconstruct the original data, and to retrieve impactful sample-related and biological features. To verify these assumptions, empirical analyses were conducted on gene expression data from the Cancer Genome Atlas (TCGA) comprising 10,345 samples from 33 cancer types and on GTEX data. In this study, Principal Component Analysis, factorized embeddings, t-Stochastic Neighbour Embedding (t-SNE) as well as random signatures were used to perform dimensionality reduction on RNA-seq profiles. We then compared the prediction accuracy of the deep neural network (DNN) and logistic regression for cancer type predictions based on the RNA-seq profiles by 5-fold cross-validation across various dimensionality reduction sizes. Results revealed that at equal dimensionality DNN outperforms the linear model and that analyses that factorize data into a smaller set of components (PCA, t-SNE, FE approches) always outperform random signatures, which was expected but never properly demonstrated. Results show, however, that given a large number of randomly picked genes, signatures can be used to identify cancer types with a very high accuracy. We repeated this task in the TCGA breast cancer subset of 1,051 patients to identify the PAM50 gene subtype (luminal A/B, basal-like, Her2, normal-like) as these features are correlated with treatment strategy and survival. Surprisingly, in our setup, genes participating in the PAM50 signature themselves did not reach perfect accuracy but were still used as an upper bound of around 80%, indicating discrepancies between the available annotations and the actual expression data or the limitation of our testing setup. Nevertheless, the best performance among the strategies was obtained using 20 PCs from the PCA combined with a DNN for this task. This work reveals that in the context of sample classification and dealing with large datasets of high dimensionality such as RNA-seq profiling data, proper dimensionality reduction can be applied and will always outclass random gene signatures, although carefully selected signatures can also perform well in this context, but require a lot of effort to be generated. In future work, we plan on implementing dimensionality reduction with feature prediction simultaneously within a non-linear model as we think combining gene expression with survival data could offer new and powerful machine learning detection systems for better prognosis and better survival for patients.

84: Elucidating the genomic landscape of pediatric Metastatic Medulloblastoma (Leptomeningeal Disease, LMD).
Track: General Session
  • Ana Isabel Castillo Orozco, McGill University, Canada
  • Masoumeh Aghababazadeh, McGill University, Canada
  • Marjan Khatami, McGill University, Canada
  • Niusha Khazaei, McGill University, Canada
  • Geoffroy Louis Yvon Danieau, McGill University, Canada
  • Livia Garzia, McGill University, Canada


Presentation Overview: Show

Medulloblastoma (MB) is a highly aggressive and the most common pediatric brain tumor that arises mainly in the cerebellum. MB presents a high intertumoral heterogeneity, and at least four molecular subgroups (SHH, WNT, Group 3, and Group 4) have been identified, which can be split into further 12 subtypes. These molecular identities are clinically relevant as subgroups and subtypes may determine disease outcome. MB can metastasize to the leptomeningeal space, known as Leptomeningeal Disease (LMD). The presence of metastatic dissemination is a universal predictor of poor outcome among MB patients. Metastatic MB is predominantly found in the MB Group 3 type. Although LMD represents a main clinical challenge, it is a vastly understudied field, and its molecular mechanisms are poorly characterized. Recent research has shown that primary and metastases diverge dramatically.
Thus, therapies based on targets identified in primary tumors might be ineffective in metastatic patients. Accordingly, there is an urgent need to develop strategies to study metastatic Medulloblastoma. We hypothesize that an in-depth knowledge of the molecular events driving subclones of the primary tumor to metastasize will offer therapeutic targets for effective therapies to treat or prevent LMD. To test this hypothesis, we have focused on expanding therapy naïve Group 3 PDXes that faithfully replicate these compartments. Primary Group 3 PDXes have been generated by orthotopic implantation, whereas LMD Group 3 MB PDXes have been achieved by a serial selection of tumors from flank to brain. This model works under experimental evidence showing a previously recognized hematogenous route for MB metastasis. We have addressed our efforts in performing comparative genomic analyses between primary and LMD Group 3 MB PDXes models (Med114FH, Med411FH, and MMB) to profile the intertumoral LMD heterogeneity and to identify genetic drivers/pathways that sustain this compartment. Using ssGSEA and deconvolution approaches, we have identified PDXes models that retain neoplastic subpopulations previously identified in MB single-cell sequencing studies. Similarly, we have identified slight differences in cell subpopulations proportions between primary and leptomeningeal compartments.
Furthermore, we observe profound differences in gene expression between primary and LMD. Our results show various signaling pathways enriched across LMD models, such as protein secretion, oxidative phosphorylation, MTORC1 signaling, and coagulation. We also have identified differentially expressed genes (DEG), where a few genes are found in more than one PDX (members of the Solute Carriers family, such as SLC44A3 and SLC17A9). Interestingly, we have detected FcFragment of IgG Binding Protein (FCGBP) as one single DEG expressed in all Group 3 LMD models. This finding was concordant with DEA results and single-cell atlas from GEO expression omnibus datasets for breast and lung cancer metastatic to the leptomeninges.
Interestingly, retrieval of short variants from RNAseq data has not revealed thus far mutations enriched in LMD that could be ascribed to these genetic changes. These changes may be attributed to the disruption of epigenetic mechanisms. Work is currently being done to profile the epigenome of LMD by correlating transcriptomic data with active chromatin markers such as H3K27ac in primary and metastatic PDX samples.
In conclusion, our results support the notion that primary and LMD retain subpopulation clusters present in MB Group 3 tumors with slight changes. We also show that primary and LMD are transcriptionally different, with various enriched pathways in the metastatic compartment. Our work also reveals various DEGs shared among LMD Group 3 PDX models, which are in progress for functional validation. These results also suggest the scarcity of point mutations driving leptomeningeal disease and the relevance of performing ChipSeq studies to support transcriptomic analysis. Through these approaches, we aim to elucidate the genetic dependencies of metastatic Medulloblastoma that will help with targeted therapies.

86: Creating a Novel Deep Learning Pipeline to Generate and Screen Novel Molecules for Hormone-Positive Breast Cancer Treatment
Track: General Session
  • Nishank Raisinghani, Dougherty Valley High School, United States
  • David DiStefano, Tufts University, United States


Presentation Overview: Show

There has been a lot of research done into the implementation of neural networks in the bioinformatics space, specifically with respect to drug discovery. Although there have been many promising steps taken in this direction, there is still a large amount of research yet to be done in this field. In this paper, we design a novel architecture that aims to generate novel molecules that will treat hormone-receptor-positive breast cancer disease. These molecules are aimed to inhibit aromatase, CDK4, CDK6, PI3K, and mTOR proteins. To do this, we used a natural-language-processor (NLP) based variational autoencoder. Our model is trained on the ZINC open-source dataset due to its vast library of approximately 250k drug molecules, and the wide variety of metrics that describe these molecules. To generate our molecules we compiled a test set of about 68 molecules that were already proven to bind to our mentioned target proteins. To measure the initial viability of our generated molecules we used RDKit’s quantitative estimated drug-likeness (QED) score, which will help provide insight into the drug-likeness of our generated data. Supplementary NLP-based models helped predict other properties of our generated molecules, specifically solubility, synthetic accessibility, and toxicity to further heighten our screening process. We used the AutoDock Vina framework to predict the Gibbs Free Energy Score between the molecule and the desired target enzymes. Our experimentation was able to expand and improve upon a previous solubility prediction architecture to procure more accurate results on both solubility and synthetic accessibility of molecules. The goal of our research is to develop a novel high-throughput process to generate and screen for hormone-positive breast cancer drug molecules that can be feasible in the real world. Since the drug discovery space is so large (approximately 10^60 molecules), neural networks are a valuable tool to help cut down the time and cost it takes to find these molecules. Through our experimentation, we were able to add a novel improvement to a working VAE framework by refining certain layers of the network’s decoder, leading to the generation of three molecules that passed our screening process and have high viability to be successful in suppressing hormone-positive breast cancer tumor growth.