Posters - Schedules
Posters Home

View Posters By Category

Tuesday, May 16, between 12:00 PM EDT and 1:30 PM EDT (Odd Numbered Posters)
Wednesday, May 17, between 12:00 PM EDT and 1:30 PM EDT (Even Numbered Posters)
Session A Poster Set-up and Dismantle
Session A Posters set up:
Tuesday, May 16, between 8:00 AM EDT and 8:45 PM DDT
Session A Posters dismantle:
Tuesday, May 17, at 6:00 PM EDT
Session B Poster Set-up and Dismantle
Session B Posters set up:
Wednesday, May 16, between 8:00 AM EDT and 8:45 PM EDT
Session B Posters dismantle:
Wednesday, May 17, at 6:00 PM EDT
Virtual
01: Deriving spatial features from in situ proteomics imaging to enhance cancer survival analysis
Track: General Session
  • Monica Dayao, Carnegie Mellon University, United States
  • Alexandro Trevino, Enable Medicine, United States
  • Honesty Kim, Enable Medicine, United States
  • Matthew Ruffalo, Carnegie Mellon University, United States
  • H. Blaize D'Angio, Enable Medicine, United States
  • Ryan Preska, Enable Medicine, United States
  • Umamaheswar Duvvuri, University of Pittsburgh, United States
  • Aaron Mayer, Enable Medicine, United States
  • Ziv Bar-Joseph, Carnegie Mellon University, United States


Presentation Overview: Show

Spatial proteomics data has been used to map cell states and improve our understanding of tissue organization. More recently, these methods have been extended to study the impact of such organization on disease progression and patient survival. However, to date, the majority of supervised learning methods utilizing these data types did not take full advantage of the spatial information, impacting their performance and utilization. Taking inspiration from ecology and epidemiology, we developed novel spatial feature extraction methods for use with spatial proteomics (CODEX) data. We used these features to learn prediction models for cancer patient survival. As we show, using the spatial features led to consistent improvement over prior methods that used the spatial proteomics data for the same task. In addition, feature importance analysis revealed new insights about the cell interactions that contribute to patient survival.

03: Interpretable factor decomposition of single-cell transcriptomics map of the rat liver
Track: General Session
  • Delaram Pouyabahar, The Donnelly Centre, University of Toronto, Canada
  • Sai Chung, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Olivia Pezzutti, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Catia Perciani, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Sherry Wang, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Xue-Zhong Ma, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Chao Jiang, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Damra Camat, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Trevor Chung, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Manmeet Sekhon, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Justin Manuel, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Xu-Chun Chen, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Ian McGilvray, Multi-Organ Transplant Program, Toronto General Hospital Research Institute, Canada
  • Sonya MacParland, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Gary Bader, The Donnelly Centre, University of Toronto, Lunenfeld-Tanenbaum Research Institute, Canada


Presentation Overview: Show

Single-cell RNA-sequencing is able to identify the gene expression heterogeneity within complex biological systems, though interpretation is challenging due to a mix of biological and technical factors. Previous studies have demonstrated the utility of reduced dimensional representations to identify shared cellular attributes and unique biological processes across single-cell datasets. However, in many cases, the inferred dimensions from standard matrix factorization methods may not align with biologically meaningful gene expression programs, and nonlinear methods often lack interpretability. Here, we developed a computational pipeline based on varimax-PCA to identify and interpret the biological and technical sources of variation in single-cell transcriptomics maps. We demonstrate the utility of this pipeline on a novel single-cell map of the healthy rat liver, leading to key insights on strain-specific differences in this liver transplantation model. Our pipeline guided the cell-type annotation of a single-cell rat liver map that was confounded by ambient RNA and highlighted the enrichment of pro-inflammatory signals in myeloid cells of the Lewis rat strain, which is important for understanding the biology of this model system. These findings were experimentally validated by performing ex vivo LPS stimulation experiments followed by intracellular cytokine staining to measure myeloid cell inflammatory response. We have also evaluated the current methods in the literature to infer hidden factors from single-cell datasets, including scCoGAPs, NIFA, f-scLVM, cNMF, and LDVAE. Next, we plan to expand our approach to jointly and scalably model the known and unknown factors at sample and cell levels in single-cell RNA-seq maps.
This manuscript is under review: https://www.biorxiv.org/content/10.1101/2022.11.11.516225v1

03: Interpretable factor decomposition of single-cell transcriptomics map of the rat liver
Track: General Session
  • Delaram Pouyabahar, The Donnelly Centre, University of Toronto, Canada
  • Sai Chung, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Olivia Pezzutti, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Catia Perciani, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Sherry Wang, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Xue-Zhong Ma, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Chao Jiang, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Damra Camat, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Trevor Chung, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Manmeet Sekhon, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Justin Manuel, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Xu-Chun Chen, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Ian McGilvray, Multi-Organ Transplant Program, Toronto General Hospital Research Institute, Canada
  • Sonya MacParland, Ajmera Transplant Centre, Toronto General Hospital Research Institute, Canada
  • Gary Bader, The Donnelly Centre, University of Toronto, Lunenfeld-Tanenbaum Research Institute, Canada


Presentation Overview: Show

Single-cell RNA-sequencing is able to identify the gene expression heterogeneity within complex biological systems, though interpretation is challenging due to a mix of biological and technical factors. Previous studies have demonstrated the utility of reduced dimensional representations to identify shared cellular attributes and unique biological processes across single-cell datasets. However, in many cases, the inferred dimensions from standard matrix factorization methods may not align with biologically meaningful gene expression programs, and nonlinear methods often lack interpretability. Here, we developed a computational pipeline based on varimax-PCA to identify and interpret the biological and technical sources of variation in single-cell transcriptomics maps. We demonstrate the utility of this pipeline on a novel single-cell map of the healthy rat liver, leading to key insights on strain-specific differences in this liver transplantation model. Our pipeline guided the cell-type annotation of a single-cell rat liver map that was confounded by ambient RNA and highlighted the enrichment of pro-inflammatory signals in myeloid cells of the Lewis rat strain, which is important for understanding the biology of this model system. These findings were experimentally validated by performing ex vivo LPS stimulation experiments followed by intracellular cytokine staining to measure myeloid cell inflammatory response. We have also evaluated the current methods in the literature to infer hidden factors from single-cell datasets, including scCoGAPs, NIFA, f-scLVM, cNMF, and LDVAE. Next, we plan to expand our approach to jointly and scalably model the known and unknown factors at sample and cell levels in single-cell RNA-seq maps.
This manuscript is under review: https://www.biorxiv.org/content/10.1101/2022.11.11.516225v1

05: Pre-diagnosis blood profiling up to seven years prior to diagnosis reveals early cell-free DNA signatures of cancer
Track: General Session
  • Nicholas Cheng, Ontario Institute for Cancer Research, Canada
  • Kimberly Skead, Ontario Institute for Cancer Research, Canada
  • Tom Ouellette, Ontario Institute for Cancer Research, Canada
  • Dave Cescon, Ontario Institute for Cancer Research, Canada
  • Scott Bratman, Princess Margaret Cancer Centre, University Health Network, Canada
  • Daniel DeCarvalho, Princess Margaret Cancer Centre, University Health Network, Canada
  • David Soave, Wilfrid Laurier University, Canada
  • Philip Awadalla, Ontario Institute for Cancer Research, Canada


Presentation Overview: Show

Liquid biopsies have been well demonstrated to be a non-invasive screening tool for detection of chronic diseases at early stage when treatment is most effective. However, evaluating the clinical utility of emerging biomarkers for early disease detection requires application of new technologies to biologics collected from asymptomatic individuals prior to a diagnosis. To demonstrate the value of utilizing longitudinal population health cohorts for early biomarker characterization and disease prediction, we leverage the Ontario Healthy Study (OHS) to profile over 400 cell-free DNA (cfDNA) methylation profiles up to seven years prior to a cancer diagnoses from cohort participants. We identified incident cancer cases among OHS participants at the time of enrollment with no history of cancer but was later diagnosed with breast, prostate or pancreatic cancer during study follow-up. Cell-free DNA methylation profiling was performed using cell-free methylated DNA immunoprecipitation (cfMeDIP-Seq) on blood plasma collected from participants at enrollment time among incident cancer cases and matched cancer-free controls. We identfied cfDNA methylation and fragmentomic features that discriminated cancer-free controls from pre-diagnosis cancer cases over five years before diagnosis, and demonstrate that these markers were reflective of signatures from the originating cancer tissue and potentially lifestyle factors such as alcohol consumption. Further, using machine learning we developed predictive models from cfDNA methylation markers to predict future cancer risk, achieving an AUROC of 0.852 among held-out pre-diagnosis breast cancer cases and controls. Likewise, predictive models trained solely with pre-diagnosis cfDNA methylation samples were also generalizable and predictive of established prostate and pancreatic cancers, achieving average test AUROCs of 0.95 and 0.96. By leveraging longitudinal population health cohort resources to interrogate pre-diagnosis biologic samples, our findings reveal that early cfDNA signatures of cancers can be detected up to six year prior to diagnosis and can be similarly extended to other conditions and alternative emerging methodologies interrogating blood biomarkers.

07: Including location and scale information reduces false positive inference with ALDEx2 when analyzing HTS datasets
Track: General Session
  • Michelle Pistner Nixon, College of Information Sciences and Technology, Pennsylvania State University, United States
  • Justin Silverman, College of Information Sciences and Technology, Pennsylvania State University, United States
  • Greg Gloor, Department of Biochemistry, The University of Western Ontario, Canada


Presentation Overview: Show

The analysis of complicated high throughput sequence datasets, such as meta-transcriptomes, is fraught with overt and hidden problems. Statistical inference relies on proper estimation of both location and scale. Extant HTS analysis tools do a good job of identifying location in simple datasets, but do not account for scale. This leads to both type 1 and type 2 errors. Two issues in particular are problematic. The first is that analysis tools for differential abundance are very prone to Type 1 errors, especially as additional data is added. In common usage both p-values and differences in location are used to jointly decide which of the features (genes, microbes, etc) are different between groups. We argue that Type 1 errors arise in many instances because, by design, the normalizations used are unable to incorporate information about the difference in scale between the groups. The second is that while differences in location are easy to observe, defining the appropriate midpoint between two groups is open to interpretation. We will demonstrate issues with inappropriate location and scale using a vaginal metatranscriptomic dataset and show how ALDEx2 can be modified to incorporate appropriate estimates. We will further show how including this information will reduce false positive identification of differentially abundant features providing the ability to extract useful system-wide information from such complicated datasets.

09: Automated morphometrics across diverse bio-imaging datasets
Track: General Session
  • Tom Ouellette, Ontario Institute for Cancer Research, Canada
  • Philip Awadalla, Ontario Institute for Cancer Research, Canada


Presentation Overview: Show

The ability to quantify the morphological variation of cells, or sub-cellular structures, has greatly enhanced biological discovery across many imaging workflows, including, but not limited to, high-content screens, perturbation experiments, and even histopathology images. An important prerequisite for most morphological analyses is instance segmentation which involves the delineation and assignment of each pixel to individual objects within an image. Notably, instance segmentation facilitates morphological analysis since segments provide direct information on the geometric form of each detected biological object. Evidently, existing methods use segments to quantify morphological variation employing either geometric descriptors or, more recently, deep representation learning. However, while geometric descriptors provide interpretability when dissecting results, they may not optimally encode cell or nuclear form in lower dimensions, are often highly correlated, and tend to contain limited information about higher-level features such as shape. Furthermore, deep representation learning for morphological analysis is still nascent and, although promising, requires further development and benchmarking to enable automated, principled, and interpretable analysis of different axes of morphological variation. To overcome these limitations, we present a new toolkit for the automated and unsupervised morphological analysis from cellular images. Importantly, we show how a combination of both classical methods, using statistical shape analysis, and modern methods, leveraging deep generative models, can be used to isolate and study biological form (shape and size) across diverse bio-imaging datasets.

11: GLKB: a Genomic Literature Knowledge Base
Track: General Session
  • Yuanhao Huang, University of Michigan, United States
  • Chengfan Li, University of Michigan, United States
  • Feitong Tang, University of Michigan, United States
  • Jiyu Chen, University of Michigan, United States
  • Jiyue Zhu, University of Michigan, United States
  • Shaochun Zheng, University of Michigan, United States
  • Xuteng Luo, University of Michigan, United States
  • Dongyu Zhu, University of Michigan, United States
  • Jiahao Qiu, Princeton University, United States
  • Xinyu Lu, Carnegie Mellon University, United States
  • Jie Liu, University of Michigan, United States


Presentation Overview: Show

In recent years, efforts have been made to improve the accessibility and usefulness of genomic research resources.
However, there remains a need for a resource that integrates knowledge from both genomic literature and databases in a way that enables easy access and comparison.
In this article, we developed the Genomic Literature Knowledge Base (GLKB).
GLKB consolidates genomic knowledge from over 33 million PubMed abstracts and manually curated databases using a strict schema. Users can select and compare results with different confidence levels through the Python API and user-friendly web interface.
By consolidating literature and database knowledge in a single resource, the GLKB provides researchers with a powerful tool to accelerate the pace of genomic research and discovery.

13: Molecular Circadian Rhythms in Aged and Alzheimer’s Diseased Brains
Track: General Session
  • Henry Hollis, Drexel University, United States
  • Erik Musiek, Department of Neurology, Knight Alzheimer's Disease Research Center, Washington University School of Medicine, United States
  • Ron Anafi, Division of Sleep Medicine, Chronobiology & Sleep Institute, University of Pennsylvania, United States


Presentation Overview: Show

Introduction:
Circadian, or daily, rhythms are every-where in nature, influencing various behaviors beyond sleep and wakefulness. Depending on the tissue, molecular circadian clocks govern thousands of genes and proteins. Hundreds of disease genes and drug targets are among those that display this circadian rhythmicity.
Neurodegenerative diseases are known to alter circadian rhythms in behavior and physiology. Studies have found that disrupted autonomic rhythms, behavioral fragmentation and circadian delay are all common among patients with Alzheimer’s disease (AD) (Abbott & Zee, 2015; Musiek et al., 2018; Weissová et al., 2016).
Patients with AD are more active at night and sleep more during the day. Aberrant behavioral rhythms are the most common reason for people institutionalizing their AD family members (Pollak & Perlick, 2016).
Nevertheless, translation of circadian biology to AD clinical research has remained minimal. Clinically important molecular rhythms in AD are unknown and as such opportunities for chronotherapy have gone unrealized. The following work is meant to address these gaps in knowledge.

Experimental Design:
We processed pseudo bulk (total counts from thousands of single cells) RNA-Seq expression data from The Religious Order Study/Memory and Aging Project (ROS/MAP). These data comprise 405 dorsolateral pre-frontal cortex samples, both AD and control subjects (248 control, 157 AD). I applied an informatic ordering tool, CYCLOPS (Cyclic Ordering by Periodic Structure), to infer an internal circadian phase for each pseudo bulk sample. I also estimate circadian phases in the individual cell types that compose the pseudo bulk data (microglia, inhibitory neurons, excitatory neurons). Biological and technical validation was performed on the circadian orderings. Once subject phases were assigned, I reconstructed the circadian transcriptome using cosinor regression. Using nested regression models, I identified transcripts that showed differential rhythmicity (change in amplitude or peak phase) between control and AD subjects. I then conducted pathway analysis with NCBI’s DAVID on these differentially rhythmic genes to frame them in the bigger picture of AD-biology.

Results: I find ~10% of genes in control subjects exhibit significant cycling via nested linear models in the pseudo bulk data (Bonferroni p < 0.05 with amplitude/mean ratio > 0.25). Furthermore, I find differences in which transcripts significantly cycle between individual cell types. Hundreds of transcripts exhibiting significant differentially cycling (BH.Q < 0.15) between AD and control subjects in pseudo bulk. Of these, CIART is the only core clock gene. Most of the differentially cycling genes lose amplitude with AD (195 genes) although some gain amplitude with AD (61 genes).


Conclusions: The genes that cycle most differently between AD and control subjects are overrepresented in Ribosome, Oxidative Phosphorylation, and neurodegenerative diseases pathways, among others.

References:
Abbott, S. M., & Zee, P. C. Sleep Medicine Clinics. 2015; 4, (517– 522).
Musiek, E, S., JAMA Neurol. 2018; 5 (582–590)
Pollak, C. P., & Perlick, D. Journal of Geriatric Psych. And Neur. 2016; 4 (204-210).
Weissová, K. PLOS One; 2016; 1

Acknowledgements:
I would like to acknowledge Jan Hammarlund whose help with the CYCLOPS algorithm was invaluable. I also acknowledge my funding source, the NIH.

15: Integrating ENCODE and Context-Specific H3K27ac ChIP-Seq and RNA-Seq Data to Infer Transcription Factor Activity
Track: General Session
  • Brandon Lukas, Department of Biomedical Engineering, University of Illinois at Chicago, United States
  • Yongchao Huang, Department of Biomedical Engineering, University of Illinois at Chicago, United States
  • Yang Dai, Department of Biomedical Engineering, University of Illinois at Chicago, United States


Presentation Overview: Show

Background: Understanding the context-specific activities of transcription factors (TFs) is critical for the identification of molecular mechanisms that govern normal biological processes or that are implicated in disease. TF activities often need to be inferred computationally using gene expression data and an underlying gene regulatory network (GRN).

Motivation: The accuracy of inferred TF activity depends heavily on the quality of the GRN. Unfortunately, most GRNs are incomplete and not context-specific. They contain a large number of false positives and false negatives in certain biological settings. Herein, we take the myometrium as a biological source of interest. Leiomyomas (benign tumors) arise from the smooth muscle cells of the myometrium and can affect as much as 80% of women in their reproductive years. While certain TFs, such as progesterone and estrogen receptors, have been implicated in the disease, a deep understanding of TF dysregulation in leiomyoma remains elusive.

Methods: We integrated H3K27ac ChIP-seq data collected from leiomyoma and adjacent myometrium tissues from 20 patients [1] with annotated ENCODE 3 data summarizing the TF binding sites of 338 factors across 130 cell types to create a custom GRN, which consists of 337 unique TFs and 239,289 inferred TF-target gene pairs. We also considered the DoRothEA network [2], which consists of 430 unique TFs and 32,455 TF-gene connections across non-specific cell types. We used both GRNs together with matching RNA-seq transcriptomes in the decoupleR framework [3] to generate inferred TF activities.

Results: Applying the univariate linear, multivariate linear, and weighted sum models in the decoupleR framework with our custom GRN, we obtained the activities for 302 TFs based on consensus scoring. In contrast, using the DoRothEA network we obtained activities for 171 TFs. Some 91 TFs were commonly inferred. Differential activity analysis between leiomyoma and myometrium samples carried out by paired t-test revealed 34 TFs associated with activities consistent across using both GRN networks, including ESR1 and three members of the E2F family: E2F1, E2F6, and E2F7.

Conclusions: We inferred TF activities from gene expression profiles using two independent networks, including the context-specific GRN generated by integrating ENCODE TF ChIP-seq data with experimental H3K27ac ChIP-seq data. Focusing on the common TFs obtained from using both networks, we provide a refined and prioritized list of potential TFs that may deepen our understanding of the mechanisms governing leiomyoma development and maintenance.

[1]: Leistico, Jacob R., et al. ""Epigenomic tensor predicts disease subtypes and reveals constrained tumor evolution."" Cell reports 34.13 (2021): 108927.
[2]: Garcia-Alonso, Luz, et al. ""Benchmark and integration of resources for the estimation of human transcription factor activities."" Genome research 29.8 (2019): 1363-1375.
[3]: Badia-i-Mompel, Pau, et al. ""decoupleR: ensemble of computational methods to infer biological activities from omics data."" Bioinformatics Advances 2.1 (2022): vbac016.

17: Bayesian group sequential enrichment designs based on adaptive regression of response on baseline biomarkers
Track: General Session
  • Ying Yuan, University of Texas MD Anderson Cancer Center, United States


Presentation Overview: Show

Precision medicine relies on the idea that, for a particular targeted agent, only a subpopulation of patients are sensitive to it and thus may benefit from it therapeutically. In practice, it often is assumed based on pre-clinical data that a treatment-sensitive subpopulation is known, and moreover that the agent is substantively efficacious in that subpopulation. Due to important differences between pre-clinical settings and human biology, however, data from patients treated with a new targeted agent often show that one or both of these assumptions are false. This paper provides a Bayesian randomized group sequential enrichment design that compares an experimental treatment to a control based on survival time, and uses genomic biomarkers to assist with adaptive variable selection and enrichment. Initially, the design enrolls patients under broad eligibility criteria. At each interim decision, submodels for regression of response and survival time on genomic biomarkers and treatment are fit, variable selection is used to identify a biomarker panel that characterizes treatment-sensitive patients. Enrollment of each cohort is restricted to the most recent adaptively identified treatment-sensitive patients. Group sequential decision cutoffs are calibrated to control overall type I error and account for the adaptive enrollment restriction. The design provides a basis for precision medicine by identifying a treatment-sensitive subpopulation, if it exists, and determining whether the experimental treatment is superior to the control in that subpopulation. A simulation study shows that the proposed design reliably identifies a sensitive subpopulation, yields much higher generalized power compared to several existing enrichment designs and a conventional all-comers group sequential design, and is robust.

19: Stromal-Cancer cell PKC-β signaling in the tumor microenvironment
Track: General Session
  • Julius Herzog, University of Vermont, United States
  • Seth Frietze Department Of Biomedical And Health Sciences, University of Vermont, United States


Presentation Overview: Show

Tumors are complex tissues composed of a networks of different cell types that support the growth and survival of the tumor. Studying the cellular and molecular composition of the tumor microenvironment (TME) may provide an opportunity to identify disease mechanisms and therapeutic targets. The protein kinase C-β (PKC-β) has emerged as an important signaling mechanism in stromal cells that is required to support cancer cell survival and drug resistance. In this project, we analyzed RNA-seq data to derive a stromal-cancer cell PKC-β gene signature. We then systematically explored the PKC-β signature in available single cell RNA-seq from human tumors. We evaluated the distribution of stromal PKC-β gene signatures in single cell populations of cancer associated fibroblast (CAF) cells and endothelial cells from including pancreatic ductal adenocarcinoma (PDAC) and breast cancer (BRCA) datasets. Gene regulatory network analysis identified distinctive transcriptional mechanisms associated with PKC-β enriched cell populations surrounding the tumor. Ongoing work is conducting analysis of available spatial transcriptomic data derived from human tumor tissues to study PKC-β signaling in the context of the TME including reconstruction of key cell interactions mediated by distinctive ligand-receptor interactions.

21: Differential RNA editing patterns in host transcriptome in response to reovirus (ReoV) infection
Track: General Session
  • Ayesha Tariq, Kent State University, United States
  • Helen Piontkivska, Kent State University, United States


Presentation Overview: Show

Reoviruses (ReoV) are double-stranded RNA viruses from family Reoviridae that can infect a wide range of hosts, including plants, insects and mammals. In animal hosts, after the entry into the host cell, viral RNA is often subjected to editing by host’s Adenosine Deaminases Acting on RNA (or ADARs), which catalyze the conversion of Adenosine (A) to Inosine (I), interpreted as a G, as part of the innate immune response. However, in addition to viral RNA editing, ADARs can also interact with and edit host transcripts. Previous report showed that reovirus infection, despite strong activation of ADARp150, does not influence editing of some of the major known editing targets, while likely editing others, suggesting a potentially nuanced editing pattern. However, the results were based on a handful of selected editing sites, and did not cover the entire transcriptome. Thus, to determine whether and how ReoV infection affects host ADAR editing patterns, we analyzed a publicly available RNA-seq dataset from murine fibroblasts that allowed us to examine changes in editing patterns on a transcriptome-wide scale. Our results showed that – similar to other RNA viruses (eg, Piontkivska et al., 2021) – ReoV infection also elicits significant changes in host editing patterns, including genes enriched in cell signaling, membrane trafficking and RNA metabolism.

23: Biological pattern discovery in GBM using computer vision.
Track: General Session
  • Shamini Ayyadhury, The Donnelly Centre, University of Toronto, Canada
  • Patty Sachamitr, Blue Rock Therapeutics, Canada
  • Michelle Kushida, The Arthur and Sonia Labatt Brain Tumor Research Centre, The Hospital for Sick Children, Canada
  • Nicole Park, Princess Margaret Cancer Centre, University Health Network, Canada
  • Fiona Coutinho, Brain Canada, Canada
  • Owen Whitley, Genentech, Canada
  • Laura Richards, Celcius Therapeutics, Canada
  • Panagiotis Prinos, Structural Genomics Consortium, University of Toronto, Canada
  • Cheryl Arrowsmith, Structural Genomics Consortium, University of Toronto, Canada
  • Peter Dirks, The Arthur and Sonia Labatt Brain Tumor Research Centre, The Hospital for Sick Children, Canada
  • Trevor Pugh, Princess Margaret Cancer Centre, University Health Network, Canada
  • Gary Bader, The Donnelly Centre, University of Toronto, Canada


Presentation Overview: Show

Glioblastoma (GBM) is an adult glioma with the wildtype IDH gene, showing abysmal prognosis. Sequencing technologies show us the existence of a neurodevelopmental frame-work, upon which heterogeneous molecular patterns are overlaid in GBM.

The tumor lives within the brain. Its functional capacity is the net result of extrinsic and intrinsic molecular factors interacting with each other. These factors affect how cells communicate and organize themselves into complex functional tissue architectures that support their evolution, survival and resistance. Understanding the phenotypic relevance of these molecular signatures, in the context of the tumour’s 3D architecture, will allow us to conduct an improved screening of biomarkers with high utility and thereafter drive innovative translational research.

Here, we propose that biologically relevant signatures are imprinted within the spatial organization of cells and that spatial pixel analysis and other computer vision methods can complement and improve our functional understanding of GBM organization. We use a phase contrast image dataset of glioblastoma stem cells grown in culture, imaged 4-12 hrs for over 12-16 days. We apply 29 hand-engineered pixel features per image, deriving spatial pixel signatures for each of our 17’601 phase-contrast images. Using principal component analysis, canonical correlation analysis and regression methods, we show that these pixel features follow biologically relevant phenotypes. Using cell-type signatures obtained from a matched bulk RNA dataset, we show that samples of images with high PC2 scores had overall higher mesenchymal and microglia scores, as compared to samples of images with low PC2 scores, which showed higher neurodevelopmental signatures. We also provide evidence showing that distinct pixel feature families explain spatial pixel relationships of images from distinct patient-derived samples along PC2 (i.e granularity signals are more distinct in images along high PC2 whereas the GLCM pixel models are more distinct in images from low PC2).

We thus show that cellular organization in culture is biologically motivated and cell organization and growth exhibit distinct patterning regardless of the random non-biological geometric (i.e rotations, translation) shifts in 2D cultures.

25: RaMP-DB 2.0 & MetaboSPAN: Improving functional interpretation of metabolomic data through comprehensive functional annotation and network approaches
Track: General Session
  • Andrew Patt, National Center for Translational Sciences, United States
  • John Braisted, National Center for Advancing Translational Sciences, United States
  • Kevin Coombes, The Ohio State University, United States
  • Tara Eicher, Division of Preclinical Informatics, National Center for Advancing Translational Sciences, Rockville, MD 20850, United States
  • Ewy Mathé, Division of Preclinical Informatics, National Center for Advancing Translational Sciences, Rockville, MD 20850, United States


Presentation Overview: Show

We have built two complementary tools, RaMP-DB, and MetaboSPAN to aid in pathway enrichment analysis of metabolites. RaMP-DB is a newly renovated integrated knowledge base, API, R package and online interface for generating biological and chemical insight into metabolomic, proteomic, and transcriptomic data. The new RaMP-DB version (2.0) features several major improvements over its predecessor, including chemical structure and class annotations for metabolites, improved pathway annotation coverage for lipids, new pathway enrichment analysis visuals, and enrichment analyses supporting the inclusion of custom backgrounds. On the other hand, MetaboSPAN is a specialized metabolomic pathway analysis method that leverages RaMP-DB and aims to compensate for inconsistent coverage of the metabolome in metabolomics experiments.
The contents of RaMP-DB 2.0 are regularly updated, with the current version containing 256,086 distinct metabolites, 15,827 genes/enzymes, 53,831 distinct pathways, 412,775 mappings between metabolites and pathways, 401,303 mappings between genes/enzymes and pathways, and 60,476 biochemical reactions from the Rhea database. Chemical properties such as InCHIKeys and chemical class (ClassyFire) are available for 256,592 metabolites. MetaboSPAN builds similarity networks based on annotations within RaMP-DB 2.0, where nodes are metabolites and edges encode shared annotatons. The algorithm uses network topological analysis to identify clusters of metabolites related to a list of metabolites of interest (e.g. altered in a disease), which undergo pathway enrichment testing. We designed several simulation experiments comparing the performance of MetaboSPAN against existing pathway analysis strategies (Globaltest, Fisher’s exact test, NetGSA, and FELLA). Our results show that MetaboSPAN yields higher sensitivity for altered pathway detection without inflating false positive findings.
Both RaMP-DB and MetaboSPAN are open-source, publicly available resources. RaMP-DB is a robust, comprehensive and well-maintained resource for functional annotations for metabolites and metabolic transcripts, and MetaboSPAN is a novel functional enrichment strategy that leverages these annotations to compensate for difficulties in metabolite detection and identification.

29: Comparing penalized methodologies for significant SNP identification on whole-genome data
Track: General Session
  • Nikita Kohli, Thompson Rivers University, Canada
  • Jabed Tomal, Thompson Rivers University, Canada
  • Yan Yan, Thompson Rivers University, Canada


Presentation Overview: Show

Genome-Wide Association Studies (GWAS) aim to identify the relationship between genetic variations, usually Single Nucleotide Polymorphisms (SNPs), and physical traits. Since the whole-genome SNP data is typically high-dimensional, detecting significant SNPs is challenging [1]. Feature selection algorithms based on statistical and machine learning methods are often used to tackle the problem.

This study presents an algorithm that combines five penalized methods - Ridge, LASSO, Elastic net, Group LASSO, and Sparse Group Lasso (SGL) - to identify potentially significant SNPs. It aims to enhance the confidence of the selected SNPs by leveraging the beneficial properties of five penalized methodologies. The algorithm is done in two phases. Firstly, the data are utilized to train Ridge, LASSO, and Elastic Net. Then the union of the output SNPs from these methodologies is sent to Group Lasso and SGL. Finally, the combined SNPs from Group Lasso and SGL are the final output of the proposed algorithm.

The performance of the proposed algorithm was analyzed using a mode plant, Arabidopsis thaliana data, with three different phenotypes: anthocyanin (binary) - the presence or absence of anthocyanin, Width (continuous) - the plant diameter, and germination days (categorical) - the number of days to germinate. Results showed that the proposed algorithm produced a better list of significant SNPs with high confidence for follow-up analysis and used similar or less computational time compared to different single penalized methods.

Significant SNPs from the algorithm are validated to locate genes and compared with the results from a GWAS program, GAPIT [2]. It revealed a shared gene, AT3G08970, for anthocyanin. This finding further increases the confidence of the gene’s association with the phenotype. Some other genes located by the proposed algorithm also showed functions contributing to the phenotypes; for instance, AT3G43357 [3], AT1G65300 [4], and AT1G65300 are relevant to anthocyanin, plant diameter, and germination days, respectively. The new findings from the proposed algorithm could make it complement with GWAS as they use different models (generalized or mixed linear models in GAPIT versus penalized models in the proposed algorithm).

The study concludes that combining multiple penalized methods led to improved robustness, accuracy, and confidence compared to using a single penalized method. It also finds that combining the proposed algorithm and GWAS software can identify more potential genes associated with the phenotype. The proposed algorithm is written in R, and the code is available at https://github.com/nkofficial-1005/Penalized-methodologies-for-significant-SNP-identification.

References:
[1] A. Korte and A. Farlow, “The advantages and limitations of trait analysis
with GWAS: a review,” Plant methods, vol. 9, no. 1, pp. 1–9, 2013
[2] A. E. Lipka, F. Tian, Q. Wang, J. Peiffer, M. Li, P. J. Bradbury, M. A.Gore, E. S. Buckler, and Z. Zhang, “GAPIT: genome association and prediction integrated tool,” Bioinformatics, vol. 28, no. 18, pp. 2397–2399, 2012
[3] R. Stoppel, N. Manavski, A. Schein, G. Schuster, M. Teubner,C. Schmitz-Linneweber, and J. Meurer, “Rhon1 is a novel ribonucleic acid-binding protein that supports rnase e function in the Arabidopsis chloroplast,” Nucleic acids research, vol. 40, no. 17, pp. 8593–8606, 2012.
[4] M. Waqas, L. Shahid, K. Shoukat, U. Aslam, F. Azeem, and R. M. Atif, “Role of dna-binding with one finger (dof) transcription factors for abiotic stress tolerance in plants,” in Transcription factors for abiotic stress tolerance in plants. Elsevier, 2020, pp. 1–14

31: Wastewater surveillance uncovers regional diversity and dynamics of SARS-CoV-2 variants across nine states in the USA
Track: General Session
  • Rafaela Fontenele, National Institutes of Health, United States
  • Yiyan Yang, National Institutes of Health, United States
  • Erin Driver, Arizona State University, United States
  • Arjun Magge, Arizona State University, United States
  • Simona Kraberger, Arizona State University, United States
  • Joy Custer, Arizona State University, United States
  • Keith Dufault-Thompson, National Institutes of Health, United States
  • Erin Cox, Arizona State University, United States
  • Melanie Newell, Arizona State University, United States
  • Arvind Varsani, Arizona State University, United States
  • Rolf Halden, Arizona State University, United States
  • Matthew Scotch, Arizona State University, United States
  • Xiaofang Jiang, National Institutes of Health, United States


Presentation Overview: Show

Wastewater-based epidemiology (WBE) is a non-invasive and cost-effective approach for monitoring pathogen spread within a community. WBE has been adopted as one of the methods to monitor the spread and population dynamics of the SARS-CoV-2 virus, but significant challenges remain in the bioinformatic analysis of WBE-derived data. Here, we have developed a new distance metric, CoVdist, and an associated analysis tool that facilitates the application of ordination analysis to WBE data and the identification of viral population changes based on nucleotide variants. We applied these new approaches to a large-scale dataset from 18 cities in nine states of the USA using wastewater collected from July 2021 to June 2022. We found that the trends in the shift between the Delta and Omicron SARS-CoV-2 lineages were largely consistent with what was seen in clinical data, but that wastewater analysis offered the added benefit of revealing significant differences in viral population dynamics at the state, city, and even neighborhood scales. We also were able to observe the early spread of variants of concern and the presence of recombinant lineages during the transitions between variants, both of which are challenging to analyze based on clinically-derived viral genomes. The methods outlined here will be beneficial for future applications of WBE to monitor SARS-CoV-2, particularly as clinical monitoring becomes less prevalent. Additionally, these approaches are generalizable, allowing them to be applied for the monitoring and analysis of future viral outbreaks.

33: Sequencing Artifacts Influencing SARS-CoV-2 Intra-host Analysis
Track: General Session
  • Fatima Mostefai, Montreal Heart Institute, UdeM Département de Biochimie et Médecine Moléculaire, Canada
  • Jean-Christophe Grenier, Montreal Heart Institute Research Center, Canada
  • Raphael Poujol, Montreal Heart Institute Research Center, Canada
  • Julie Hussin, Montreal Heart Institute Research Center, UdeM Département de Médecine, Canada


Presentation Overview: Show

SARS-CoV-2 has been sequenced at an unprecedented scale leading to a vast amount of viral population genomic data. SARS-CoV-2 evolved into several variants of concern by accumulating beneficial mutations at the population level during transmission (inter-host) and at the host level during infection (intra-host). Mutations arise in viral genomes during this intra-host phase of infection, and analyzing these intra-host mutations may allow us to predict variants emerging at the population level. Intra-host single nucleotide variants (iSNVs) can be captured by analyzing next-generation sequencing (NGS) reads. However, sequencing artifacts can be introduced during the NGS process and can result in an accumulation of errors generating false iSNVs. Here, we aim to identify true intra-host mutations from SARS-CoV-2 NGS data and propose the most relevant metrics to distinguish them from false iSNVs. After applying whole genome quality control and mutation calling filters, we built an SQL database containing 12,121,836 iSNVs over 128,423 SARS-CoV-2 sequencing libraries downloaded from NCBI. We used unsupervised learning methods, such as PCA and interpretable tSNE, to determine the structure caused by sequencing centers and establish the best metrics to differentiate between true and false iSNVs. First, we identify two strand-bias artifacts: (a) false low and high recurrence iSNVs that are only observed on one strand; (b) unbalanced strand mapping on the viral genome caused by amplicon sequencing. For each iSNV, we computed the probability of observing an alternative allele on a given read strand using a binomial test with the probability of success given by the Alternative Allele frequency (AAF). Second, we found that iSNVs with AAF below 5% are enriched in center-specific errors, which were filtered out. Finally, we identified over 4,000 outlier samples with a high iSNV count and unique mutational patterns. Interestingly, a subset of these outliers had an excess of G > U substitutions that are recurrent across many samples, attributed to library preparation for NGS. Our findings show that careful pre-processing is essential to distinguish true intra-host viral mutations from sequencing artifacts in SARS-CoV-2 genomic data, an important first step to studying the intra-host evolution during infection. This robust bioinformatics methodology will also be instrumental in rapid response to other harmful viruses when future outbreaks inevitably occur.

35: Unraveling the role of alternative splicing in driving tumor development at a pancancer scale
Track: General Session
  • Larisa M. Soto, McGill University, Canada
  • Rached Alkallas, McGill University, Canada
  • Hamed S. Najafabadi, McGill University, Canada


Presentation Overview: Show

Dysregulation of alternative splicing (AS) is a key mechanism that underlies the extensive cellular programming during cancer development and progression. Studies on large patient cohorts have found major splicing alterations in tumors, such as changes in the relative abundance of pro-oncogenic isoforms often triggered by expression changes of core splicing factors and RNA binding proteins. Despite the large-scale dysregulation of AS described so far, it is almost impossible to use this knowledge to infer direct causal relationships and derive actionable targets. All approaches undertaken so far are confounded by experimental variables–such as tumor purity or sample batch processing–in the same way as differential gene expression tests are. Even though these are known issues affecting the analysis of RNA sequencing data, there is a scarcity of methods that can account for such confounding variables when comparing splicing changes across conditions. Here, we describe TRex, a computational framework to quantify exon-centric AS events from transcript abundances and associate differential splicing rates with experimental variables of interest in datasets with complex designs. To assess the performance of TRex, we compared it against two state-of-the-art methods, rMATS and SUPPA2, in a series of simulated datasets with increasing batch effect strengths and a range of metrics for defining the ground-truth. TRex demonstrated superior performance in all settings, with an average increase in AUC of 0.2 across methods and cutoffs. The performance of TRex remained virtually unchanged in the presence of confounding variables, whereas both rMATS and SUPPA2 showed decreased performance in the presence of simulated confounding factors. We applied TRex to ~10,000 samples across 30 cancer types from The Cancer Genome Atlas while accounting for confounders when contrasting cassette exon inclusion rates in tumor vs normal samples. We found 9,468 exons differentially included in at least one cancer type, of which 3,364 were cancer-specific and 37 were differentially included in at least 10 cancer types. These results support the existence of multiple mechanisms driving splicing dysregulation in cancer. When we did not account for any experimental variables, we found 10,536 exons differentially included, suggesting that at least 11% of the splicing changes associated with tumors in prior analyses could likely be explained by purity, age, and/or sex. Overall, our work represents a major methodological advance in quantification of splicing in the presence of confounding factors, enabling us to discover the mechanisms underlying alternative splicing programs driving cancer progression.

35: Unraveling the role of alternative splicing in driving tumor development at a pancancer scale
Track: General Session
  • Larisa M. Soto, McGill University, Canada
  • Rached Alkallas, McGill University, Canada
  • Hamed S. Najafabadi, McGill University, Canada


Presentation Overview: Show

Dysregulation of alternative splicing (AS) is a key mechanism that underlies the extensive cellular programming during cancer development and progression. Studies on large patient cohorts have found major splicing alterations in tumors, such as changes in the relative abundance of pro-oncogenic isoforms often triggered by expression changes of core splicing factors and RNA binding proteins. Despite the large-scale dysregulation of AS described so far, it is almost impossible to use this knowledge to infer direct causal relationships and derive actionable targets. All approaches undertaken so far are confounded by experimental variables–such as tumor purity or sample batch processing–in the same way as differential gene expression tests are. Even though these are known issues affecting the analysis of RNA sequencing data, there is a scarcity of methods that can account for such confounding variables when comparing splicing changes across conditions. Here, we describe TRex, a computational framework to quantify exon-centric AS events from transcript abundances and associate differential splicing rates with experimental variables of interest in datasets with complex designs. To assess the performance of TRex, we compared it against two state-of-the-art methods, rMATS and SUPPA2, in a series of simulated datasets with increasing batch effect strengths and a range of metrics for defining the ground-truth. TRex demonstrated superior performance in all settings, with an average increase in AUC of 0.2 across methods and cutoffs. The performance of TRex remained virtually unchanged in the presence of confounding variables, whereas both rMATS and SUPPA2 showed decreased performance in the presence of simulated confounding factors. We applied TRex to ~10,000 samples across 30 cancer types from The Cancer Genome Atlas while accounting for confounders when contrasting cassette exon inclusion rates in tumor vs normal samples. We found 9,468 exons differentially included in at least one cancer type, of which 3,364 were cancer-specific and 37 were differentially included in at least 10 cancer types. These results support the existence of multiple mechanisms driving splicing dysregulation in cancer. When we did not account for any experimental variables, we found 10,536 exons differentially included, suggesting that at least 11% of the splicing changes associated with tumors in prior analyses could likely be explained by purity, age, and/or sex. Overall, our work represents a major methodological advance in quantification of splicing in the presence of confounding factors, enabling us to discover the mechanisms underlying alternative splicing programs driving cancer progression.

39: Multi-modal inference of phenotype-relevant gene regulatory networks in human endoderm formation
Track: General Session
  • Chen Su, McGill University, Canada
  • Amin Emad, McGill University, Canada
  • William Pastor, McGill University, Canada


Presentation Overview: Show

Human endoderm formation is a critical process in human embryogenesis that gives rise to internal organs such as the lungs and pancreas. However, the regulatory mechanisms underlying the lineage specification of human embryonic stem cells (hESC) to definitive endoderm are not yet fully understood. Using a novel computational model, InPheRNo-ChIP, we integrated RNA-seq transcriptomic data from three independent studies with ChIP-seq DNA-protein interactions and phenotypic labels to reconstruct the gene regulatory network (GRN) associated with this process.
InPheRNo-ChIP first estimates two sets of summary statistics capturing TF-gene associations (one based on RNA-seq and one based on ChIP-seq). In parallel, a set of summary statistics corresponding to gene-phenotype associations are obtained by comparing RNA-seq data and phenotypic labels. A probabilistic graphical model (PGM) is then used to model the conditional distributions of different random variables, while incorporating the distribution of the observed summary statistics. The posterior probabilities obtained from this PGM are then used to form a phenotype-relevant GRN.
Our analysis showed that InPheRNo-ChIP recovers both novel and known regulatory mechanisms of endoderm formation in the context of hESC differentiation. We validated the inferred network using an in-vitro scRNA-seq-based CRISPRi screening dataset involving multiple molecular drivers of human endoderm differentiation. Notably, by examining the target genes of FOXA2, SOX17, and SMAD2 - transcription factors involved in endoderm differentiation - we discovered that InPheRNo-ChIP-identified genes are highly enriched for early endoderm markers. Additionally, validation using large databases such as the Gene Transcription Regulation Database (GTRD) and the Library of Integrated Network-Based Cellular Signatures (LINCS), further confirmed the ability of this model in identifying gene regulatory relationships.
Overall, this study identified a core set of TF-gene edges involved in endoderm formation and demonstrated InPheRNo-ChIP’s ability to uncover transcriptional mechanisms in human embryogenesis.

39: Multi-modal inference of phenotype-relevant gene regulatory networks in human endoderm formation
Track: General Session
  • Chen Su, McGill University, Canada
  • Amin Emad, McGill University, Canada
  • William Pastor, McGill University, Canada


Presentation Overview: Show

Human endoderm formation is a critical process in human embryogenesis that gives rise to internal organs such as the lungs and pancreas. However, the regulatory mechanisms underlying the lineage specification of human embryonic stem cells (hESC) to definitive endoderm are not yet fully understood. Using a novel computational model, InPheRNo-ChIP, we integrated RNA-seq transcriptomic data from three independent studies with ChIP-seq DNA-protein interactions and phenotypic labels to reconstruct the gene regulatory network (GRN) associated with this process.
InPheRNo-ChIP first estimates two sets of summary statistics capturing TF-gene associations (one based on RNA-seq and one based on ChIP-seq). In parallel, a set of summary statistics corresponding to gene-phenotype associations are obtained by comparing RNA-seq data and phenotypic labels. A probabilistic graphical model (PGM) is then used to model the conditional distributions of different random variables, while incorporating the distribution of the observed summary statistics. The posterior probabilities obtained from this PGM are then used to form a phenotype-relevant GRN.
Our analysis showed that InPheRNo-ChIP recovers both novel and known regulatory mechanisms of endoderm formation in the context of hESC differentiation. We validated the inferred network using an in-vitro scRNA-seq-based CRISPRi screening dataset involving multiple molecular drivers of human endoderm differentiation. Notably, by examining the target genes of FOXA2, SOX17, and SMAD2 - transcription factors involved in endoderm differentiation - we discovered that InPheRNo-ChIP-identified genes are highly enriched for early endoderm markers. Additionally, validation using large databases such as the Gene Transcription Regulation Database (GTRD) and the Library of Integrated Network-Based Cellular Signatures (LINCS), further confirmed the ability of this model in identifying gene regulatory relationships.
Overall, this study identified a core set of TF-gene edges involved in endoderm formation and demonstrated InPheRNo-ChIP’s ability to uncover transcriptional mechanisms in human embryogenesis.

39: Multi-modal inference of phenotype-relevant gene regulatory networks in human endoderm formation
Track: General Session
  • Chen Su, McGill University, Canada
  • Amin Emad, McGill University, Canada
  • William Pastor, McGill University, Canada


Presentation Overview: Show

Human endoderm formation is a critical process in human embryogenesis that gives rise to internal organs such as the lungs and pancreas. However, the regulatory mechanisms underlying the lineage specification of human embryonic stem cells (hESC) to definitive endoderm are not yet fully understood. Using a novel computational model, InPheRNo-ChIP, we integrated RNA-seq transcriptomic data from three independent studies with ChIP-seq DNA-protein interactions and phenotypic labels to reconstruct the gene regulatory network (GRN) associated with this process.
InPheRNo-ChIP first estimates two sets of summary statistics capturing TF-gene associations (one based on RNA-seq and one based on ChIP-seq). In parallel, a set of summary statistics corresponding to gene-phenotype associations are obtained by comparing RNA-seq data and phenotypic labels. A probabilistic graphical model (PGM) is then used to model the conditional distributions of different random variables, while incorporating the distribution of the observed summary statistics. The posterior probabilities obtained from this PGM are then used to form a phenotype-relevant GRN.
Our analysis showed that InPheRNo-ChIP recovers both novel and known regulatory mechanisms of endoderm formation in the context of hESC differentiation. We validated the inferred network using an in-vitro scRNA-seq-based CRISPRi screening dataset involving multiple molecular drivers of human endoderm differentiation. Notably, by examining the target genes of FOXA2, SOX17, and SMAD2 - transcription factors involved in endoderm differentiation - we discovered that InPheRNo-ChIP-identified genes are highly enriched for early endoderm markers. Additionally, validation using large databases such as the Gene Transcription Regulation Database (GTRD) and the Library of Integrated Network-Based Cellular Signatures (LINCS), further confirmed the ability of this model in identifying gene regulatory relationships.
Overall, this study identified a core set of TF-gene edges involved in endoderm formation and demonstrated InPheRNo-ChIP’s ability to uncover transcriptional mechanisms in human embryogenesis.

41: MicroPET: a Snakemake workflow for tissue-dependent quantification and microRNA/isomiR target prediction with an equine example
Track: General Session
  • Jonah Cullen, Department of Veterinary Population Medicine, University of Minnesota, United States
  • Laura Figueroa, Department of Veterinary Population Medicine, University of Minnesota, United States
  • Anna Lytle, Department of Veterinary Population Medicine, University of Minnesota, United States
  • Robert Schaefer, Department of Veterinary Population Medicine, University of Minnesota, United States
  • James Mickelson, Department of Veterinary and Biomedical Sciences, University of Minnesota, United States
  • Molly McCue, Department of Veterinary Population Medicine, University of Minnesota, United States


Presentation Overview: Show

MicroRNAs (miRNA) modulate gene expression in a temporal and tissue-specific manner across a broad range of biological processes in mammals. As such, dysregulation of miRNAs has been observed in numerous diseases. MiRNAs act as post-transcriptional regulators through reduced stability or translation of mRNA targets via sequence complementarity. The complexity and diversity of miRNA-based gene regulation was recognized with the discovery of miRNA isoforms (isomiRs) which may possess altered tissue-specific expression and mRNA targets impacting different biological pathways. Despite the development of many isomiR-capable processing tools, “gold standards” do not exist and isomiRs are thus often excluded from miRNA profiling studies resulting in profiling biases and potential mischaracterization of biological significance. Thus characterizing expression and mRNA targeting across normal tissues in a reproducible manner is a key component in understanding the biological significance of miRNAs and their functional isoforms on health and disease. Toward that end we constructed a comprehensive quantitative equine miRNA expression atlas using 230 small RNA-seq libraries from 31 tissues of 12 healthy Quarter Horses. To generate the atlas we developed a publicly available, containerized pipeline for the processing and analysis of high-throughput small RNA-sequencing datasets, enabling reproducible miRNA profiling and integration with publicly available data. From a total of nearly 3 billion reads, canonical miRNAs (associated isomiRs) and predicted novel miRNAs were profiled per-tissue. Unsupervised analyses suggested a fair amount of overlap in expression profiles across tissues with notable standouts in the pituitary, heart, and various muscle tissues. We identified ~10% of miRNAs expressed in a tissue-specific manner with the most observed in the pituitary, jejunum, and liver. Consistent with previous research, the majority of isomiRs contained modifications to the 3’ end relative to the canonical sequence. Alterations likely to cause target redirecting (5’ end or miRNA-mRNA binding region) were observed in 18% of isomiRs with on-going analysis confirming modified target gene sets compared to the canonical miRNA. This study represents the most comprehensive characterization of miRNA expression and mRNA targeting in normal equine tissue to date, expanding the current understanding of tissue-specific miRNA-based regulatory routines.

43: Building generalised protein-protein interaction models for robust out-of-distribution, cross-species interactions using RAPPPID
Track: General Session
  • Joseph Szymborski, Department of Electrical and Computer Engineering, McGill University. Mila, Quebec AI Institute., Canada
  • Amin Emad, Department of Electrical and Computer Engineering, McGill University. Mila, Quebec AI Institute., Canada


Presentation Overview: Show

Model organisms like Homo sapiens or Mus musculus enjoy the privilege of having their protein-protein interaction (PPI) networks largely characterised through high-confidence experimental evidence. While the networks of these organisms are mature and well-studied, it would take far too much effort to perform the same experimental validation on more obscure, lesser-studied species.

In silico methods are ideal for bridging the gap between well- and lesser-studied organisms, as they typically require fewer resources and less time than their in vitro and in vivo counterparts. Machine learning (ML) models that infer PPIs have been long proposed for this purpose. Unfortunately, supervised ML methods face the challenge that the lesser-studied species which would most benefit from having their PPIs inferred lack sufficient data to train accurate models.

ML methods which exhibit strong out-of-distribution (OOD) performance can overcome the small-dataset challenges by training on PPI networks of organisms with many edges, and inferring the edges of the incomplete, lesser-studied network. Achieving strong OOD performance, however, is a very difficult task for computational PPI prediction methods as it requires that they sufficiently generalise to the overall problem space. In fact, existing PPI methods often have a demonstrably difficult time making inferences to PPIs of proteins that (1) are outside of the training set and (2) are of a different species.

We have developed RAPPPID (PMID: 35771595), a method for Regularised Automatic Prediction of Protein-Protein Interactions using Deep Learning, that makes accurate PPI predictions on OOD data samples from evolutionarily distant species. RAPPPID takes as its only input pairs of amino acid (AA) sequences. These AA sequences are encoded using a deep twin AWD-LSTM neural network which generates latent embeddings for both sequences. These embeddings are subsequently inputted into a multi-layer perceptron (MLP) classification network. RAPPPID was trained on high-confidence edges (>95%) from the STRING dataset.

RAPPPID outperforms leading PPI prediction methods, including D-SCRIPT and SPRINT, when tested and evaluated on datasets which carefully control for data leakage. RAPPPID is capable of accurately predicting the interaction between the therapeutic antibodies Trastuzumab and Pertuzumab and their target HER-2. RAPPPID’s performance does not degrade when trained and tested on the various other species tested. Further, RAPPPID models trained on human training data accurately predict edges from other species, often achieving comparable performance to models trained on those very species themselves. RAPPPID models trained on human species gain even greater performance gains when tested on other species when transfer-learning is used to fine-tune the RAPPPID model on those species.

45: A deep learning model of preclinical-to-clinical anti-cancer drug response prediction and biomarker identification
Track: General Session
  • David Earl Hostallero, McGill University, Canada
  • Lixuan Wei, Mayo Clinic, United States
  • Liewei Wang, Mayo Clinic, United States
  • Junmei Cairns, Mayo Clinic, United States
  • Amin Emad, McGill University, Canada


Presentation Overview: Show

Two major tasks of precision medicine include the prediction of clinical drug response and the identification of biomarkers of drug sensitivity. However, the scarcity of clinical drug response data is a significant bottleneck in the development of more complex and sophisticated machine learning pipelines for these two tasks. In a recent study [1], we developed a deep learning pipeline called TINDL, which is trained on preclinical cancer cell lines to predict the patient tumors’ responses to different treatments. Through our proposed tissue-informed normalization, TINDL leverages the prior knowledge of the distribution of the tissue (and cancer) types to reduce the statistical discrepancies between cell lines and patient tumors.

Additionally, the TINDL pipeline includes an explanation submodule that provides interpretability for our deep learning model. This submodule identifies a small subset of genes whose expression have a considerable contribution on the trained model’s prediction as potential novel biomarkers of drug response. TINDL was trained on a large-scale database of cancer cell lines and drug responses (Genomics of Drug Sensitivity in Cancer (GDSC)) and was evaluated on patient tumors’ data from The Cancer Genome Atlas (TCGA). Our results demonstrated TINDL’s capability to segregate resistant and sensitive tumors for 10 out of 14 drugs, outperforming various other machine learning models. We performed small interference RNA (siRNA) knockdown experiments on 10 genes identified by our pipeline for tamoxifen, one of the drugs in which our model has shown its predictive power. Our experiments confirmed that for the MCF7 cells, all of these identified genes influence tamoxifen sensitivity, while for T47D cells, seven of these genes have significant influence in the cells’ tamoxifen sensitivity. Moreover, genes identified as potential biomarkers for multiple drugs suggest some similarity in the mechanisms of action among drugs and implicated several important signaling pathways.

This abstract is based on our 2023 published article [1]. The code can be accessed at https://github.com/ddhostallero/tindl.

[1] Hostallero, D. E., Wei, L., Wang, L., Cairns, J., & Emad, A. (2023). Preclinical-to-clinical Anti-cancer Drug Response Prediction and Biomarker Identification Using TINDL. Genomics, Proteomics & Bioinformatics. (DOI: 10.1016/j.gpb.2023.01.006)

47: Benchmarking Taxonomic Classification Metagenomics Tools for Virus Variant and Strain Identification
Track: General Session
  • Harbinder Kaur, Jawaharlal Nehru University, India
  • Trishala Das, Jawaharlal Nehru University, India
  • Shailendra Niboriya, Jawaharlal Nehru University, India
  • Andrew Lynn, Jawaharlal Nehru University, India


Presentation Overview: Show

Metagenomics has emerged as an important field of research that uses genetic material recovered directly from environmental or clinical samples to infer the taxonomic variation and abundance of its constituent microbiome. Identifying microbial taxa present in complex biological and environmental samples is an emerging challenge in microbiology with common applications like determining the etiology of an infection from a patient’s blood sample to examining the bacterial diversity of an environmental soil sample. Metagenomics, by directly inferring the community composition from a microbiome sample, enables more rapid species detection and the discovery of novel species without requiring culture-dependent approaches. Taxonomic classification, the assignment to biological clades with shared ancestry, is mainly based on a genome similarity search of large genome databases. Current taxonomic classification methods from genome read sequences have varying requirements of computational resources, and based on their algorithm, have differing sensitivity for evolutionary divergence. The widely used metagenomic classifiers for taxonomic classification are; kraken2, Centrifuge, Gottcha, and Metaphlan. Kraken2 uses a database of k-mers and the lowest common ancestor (LCA) approach to map the query sequence against the database for taxonomic classification. The Centrifuge uses the Burrows-Wheeler transform (BWT) indexing scheme to make the database compact to enable rapid identifications of the reads. Gottcha (Genomic Origins Through Taxonomic Challenge) uses a hierarchical suite (database) of unique signatures obtained from prokaryotes and viral genomes and maps the split reads to the database for species identification. Metaphlan2 relies on a database of unique clade-specific marker genes for taxonomy classification. We benchmark the above methods for both their computational requirements and taxonomic resolution using a simulated virome dataset. While all methods provide rapid taxonomy identification at the species level, none can provide sufficient resolution of variants within a species. We, therefore, recommend the use of a variant mapping tool like UShER (Ultrafast Sample placement on Existing tRees), which uses the maximum parsimony method to quickly add new samples to a preexisting phylogeny, in tandem with Centrifuge - which achieves the highest sensitivity and precision with the lowest computational resources at the species level - as a solution to provide a high-resolution taxonomic classification of evolutionary divergent samples.

49: Identification and characterization of new unannotated human snoRNAs using an integrative transcriptomics strategy.
Track: General Session
  • Alphonse Birane Thiaw, Université de Sherbrooke, Canada
  • Étienne Fafard-Couture, Université de Sherbrooke, Canada
  • Danny Bergeron, Université de Sherbrooke, Canada
  • Michelle Scott, Université de Sherbrooke, Canada


Presentation Overview: Show

mall nucleolar RNAs (snoRNAs) are non-coding RNAs (ncRNAs) present in all eukaryotes and best known for their involvement in the biogenesis of ribosomes. Generally, snoRNAs are located in the introns of longer genes, however some are expressed from intergenic regions. Based on their motifs and functions, snoRNAs are classified into two groups: H/ACA boxes and C/D boxes guiding rRNA pseudouridylation and methylation respectively. Besides these canonical functions, some are also known for their involvement in the regulation of gene expression at several levels including alternative splicing, alternative polyadenylation and transcript stability. During the last 2 decades, more and more studies have indicated the involvement of snoRNAs in human diseases, in particular cancers. Despite the breadth of functionality described for snoRNAs, the majority are poorly characterized and recent transcriptomic studies have identified new unannotated human snoRNAs and show that many annotated human snoRNAs are not expressed. For the identification of snoRNAs in different eukaryotes, there are currently a small number of predictors, but these suffer typically from high rates of false positives and false negatives. We have recently demonstrated that TGIRT-seq, an RNAseq approach that minimizes structure bias through the use of a thermostable reverse transcriptase, enables the accurate quantification of known snoRNAs but also facilitates the identification of snoRNAs missing in annotations. While we had originally considered only one cancer cell line for the discovery of new snoRNAs, we now aim to carry out a much wider screen to test the completeness of human snoRNAs. To do so, we have employed StringTie on TGIRT-seq from diverse normal human tissues to identify expressed genes missing from current annotations, having an intronic or intergenic location with a size between 50 and 200 nucleotides. These data were integrated with immunoprecipitation sequencing studies of core snoRNA binding proteins (PAR-CLIP and eCLIP) as well as RNA-seq studies following the depletion of these same proteins to identify the most likely snoRNA candidates. This methodology allowed us to identify 119 potential snoRNAs including 65 box C/D and 55 box H/ACA snoRNAs which we are currently further validating and characterizing. These results further demonstrate that the annotation of snoRNAs in humans is far from being exhaustive, hence the interest of implementing more efficient and more reliable pipelines for their identification.

51: From Gene to Cognition: Mapping the effects of genomic deletions and duplications on cognitive ability
Track: General Session
  • Sayeh Kazem, University of Montreal, Canada
  • Kuldeep Kumar, CHU Sainte-Justine, Canada
  • Guillaume Huguet, CHU Sainte-Justine, Canada
  • Myriam Lizotte, Mila Quebec AI institute, Canada
  • Thomas Renne, University of Montreal, Canada
  • Jakub Kopal, CHU Sainte-Justine, Canada
  • Stefan Horoi, Mila Quebec AI institute, Canada
  • Zohra Saci, CHU Sainte-Justine, Canada
  • Martineau Jean-Louis, CHU Sainte-Justine, Canada
  • Guy Wolf, Mila Quebec AI institute, Canada
  • David Glahn, Boston Children's Hospital, United States
  • Laura Almasy, Children’s Hospital of Philadelphia, United States
  • Guillaume Dumas, University of Montréal, Canada
  • Sebastien Jacquemont, CHU Sainte-Justine, Canada


Presentation Overview: Show

Copy Number Variants (CNVs) are genomic deletions and duplications, which may encompass one or more genes. CNVs are major contributors to neurodevelopmental disorders and affect cognitive ability. Cognitive ability is a major trait assessed in the developmental pediatric and psychiatric clinic due to the fact that there is a high prevalence rate of mental disorders among people with intellectual disabilities (range 30–50%). Recent research estimated that scoring genes based on their intolerance to loss of function could predict the effects of CNVs on cognitive ability. However, these models remain inaccurate due to the fact that they carry no information on protein function coding by the gene.
Overarching knowledge gap: The effects of most CNVs on cognition/risk for neurodevelopmental disorders, using the information on the function of the genes they disrupt, remain undocumented.
Hypothesis: The information of where (space) and when (time) genes are expressed in the brain and in what cell types and tissues (cells & tissues), can explain CNVs' effects on cognition.
Overarching aim: To understand and predict the effect of CNVs on cognition/risk for neurodevelopmental disorders using the information of space, time, cells, and tissues.
Methods: The one-by-one analysis of rare CNVs is limited to a few variants; to estimate the risk conferred on rare CNVs, we need to investigate CNVs in the aggregate and use the information on the nature and function of the genes they disrupt. Our analytical approach defines in two main steps:
1. Partitioning the genome into gene sets based on the information of space, time, or cell expression.
2. Estimating the mean effect size on cognition per category using a linear regression model, for deletion and duplication separately. In this model, cognitive ability is considered as a function of the number of genes deleted or duplicated, inside and outside of each category.
Based on this approach, we explain the effects of CNVs on cognitive ability using the information on:
1-The spatial gene expression across the cortex (space). We partition genes based on the information of common spatial brain expression pattern (anatomical hierarchy) and intolerance to loss of function. Then, we estimate the effect sizes on cognition (step 2).
2-The levels of gene expression across brain cell types & tissues (cells & tissues). We use the information on the levels of expression across tissues, cell types (adult and fetal), and Go-terms. As it is a multidimensional problem, first we need to reduce the dimension. Then, on the reduced dimension, we look for neighbors of genes that are functionally close (using the searchlight algorithm). Finally, we estimate the effect size on cognition (step 2).
3-Neurodevelopmental brain (spatiotemporal) expression patterns (time). Using a conceptually similar approach to that outlined in parts 1 & 2, we partition the genome into gene sets based on the neurodevelopmental brain (spatiotemporal) expression patterns. Then we estimate the effect size on cognition per gene set.
We use a large dataset including the genotypic and phenotypic information of 258k healthy individuals which is the result of 5 years of data processing by Jacquemont lab. By taking advantage of this carefully curated archival data set, we explore how CNVs influence cognitive ability. Preliminary results suggest that the genes that are more expressed in neuro cells decrease cognitive ability the most when deleted or duplicated. Moreover, deletion and duplication alter cognitive ability by affecting genes in opposing spatial patterns of expression in the cortex.
If we succeed in all stages of our project, it would be possible to provide new computational methods to discover the impact of genetic variants on human cognition which will be utilized in the neurodevelopment clinic to provide personalized counseling.

53: STRUCTURAL CHARACTERIZATION OF NATURAL ANTISENSE TRANSCRIPTS WITH NANOPORE SEQUENCING
Track: General Session
  • J White Bear, McGill University, Canada
  • Grégoire De Bisschop, Institut de recherches de cliniques de Montréal (IRCM), Canada
  • Eric Lécuyer, Institut de Recherches Cliniques de Montréal, Canada
  • Jerome Waldispuhl, McGill University, Canada


Presentation Overview: Show

Natural antisense transcripts (NAT) are RNA pairs transcribed from overlapping, opposite DNA strands. NAT are expressed in all three domains of life, including retroviruses. They are involved in regulation of RNA expression including RNA maturation, stability, localization and translation. As such, they are frequently indicated in diseases pathways, such as cancer. Yet, it is unclear how NAT pairs bind as the assumption that they form long intermolecular duplexes has never been challenged. We, thus, hypothesize that NAT pairs assemble through a wide range of structures spanning from mostly intramolecular to mostly intermolecular base pairings. Many chemical probing techniques have been developed and considerable progress has been made in the investigation of RNA structure. However, they provide an average signal that is hardly amenable to deconvolution and precludes the identification of discrete structures within a complex equilibrium. We focus on, cen and ik2, two natural antisense mRNA localized to the centrosome during mitosis in Drosophila embryos. They share a 59-nucleotide long antisense region located in their 3’UTR, a prerequisite for their interaction and localized translation.
We employ direct RNA sequencing to identify adduct positions on single RNA molecules using nanopore reads. Nanopore reads utilize a 5-mer, dwell times, and current signal to characterize RNA sequences. Sequenced reads are aligned and normalized to produce reactivity profiles that are used to predict modification of unpaired nucleotides via statistical analysis or machine learning methods, such as SVM. These methods have been limited in the scope of the features set and predictions, and the size of available data. To address these challenges, we collect a large training set of both modified and unmodified cen and ik2 (n >= 20000) and their reactivity profiles. We, first, examine the consequence of information loss by expanding the feature set and incorporating reactivity profiles to create semi-supervised models which detect structural features and predict, de novo, reactivity profiles with improved correlations to ground truth. We, next, leverage these expanded feature and data sets to develop multi-class and multi-output deep learning models that jointly predict sequence, induced modifications, and secondary structural features. Preliminary results suggest that our methods yield comparable or improved identifications to standard SHAPE and existing direct RNA analyses.

53: STRUCTURAL CHARACTERIZATION OF NATURAL ANTISENSE TRANSCRIPTS WITH NANOPORE SEQUENCING
Track: General Session
  • J White Bear, McGill University, Canada
  • Grégoire De Bisschop, Institut de recherches de cliniques de Montréal (IRCM), Canada
  • Eric Lécuyer, Institut de Recherches Cliniques de Montréal, Canada
  • Jerome Waldispuhl, McGill University, Canada


Presentation Overview: Show

Natural antisense transcripts (NAT) are RNA pairs transcribed from overlapping, opposite DNA strands. NAT are expressed in all three domains of life, including retroviruses. They are involved in regulation of RNA expression including RNA maturation, stability, localization and translation. As such, they are frequently indicated in diseases pathways, such as cancer. Yet, it is unclear how NAT pairs bind as the assumption that they form long intermolecular duplexes has never been challenged. We, thus, hypothesize that NAT pairs assemble through a wide range of structures spanning from mostly intramolecular to mostly intermolecular base pairings. Many chemical probing techniques have been developed and considerable progress has been made in the investigation of RNA structure. However, they provide an average signal that is hardly amenable to deconvolution and precludes the identification of discrete structures within a complex equilibrium. We focus on, cen and ik2, two natural antisense mRNA localized to the centrosome during mitosis in Drosophila embryos. They share a 59-nucleotide long antisense region located in their 3’UTR, a prerequisite for their interaction and localized translation.
We employ direct RNA sequencing to identify adduct positions on single RNA molecules using nanopore reads. Nanopore reads utilize a 5-mer, dwell times, and current signal to characterize RNA sequences. Sequenced reads are aligned and normalized to produce reactivity profiles that are used to predict modification of unpaired nucleotides via statistical analysis or machine learning methods, such as SVM. These methods have been limited in the scope of the features set and predictions, and the size of available data. To address these challenges, we collect a large training set of both modified and unmodified cen and ik2 (n >= 20000) and their reactivity profiles. We, first, examine the consequence of information loss by expanding the feature set and incorporating reactivity profiles to create semi-supervised models which detect structural features and predict, de novo, reactivity profiles with improved correlations to ground truth. We, next, leverage these expanded feature and data sets to develop multi-class and multi-output deep learning models that jointly predict sequence, induced modifications, and secondary structural features. Preliminary results suggest that our methods yield comparable or improved identifications to standard SHAPE and existing direct RNA analyses.

55: GenomicKB: a knowledge graph for the human genome
Track: General Session
  • Fan Feng, University of Michigan, United States
  • Feitong Tang, University of Michigan, United States
  • Yijia Gao, University of Michigan, United States
  • Shuyuan Yang, University of Michigan, United States
  • Tianjun Li, University of Michigan, United States
  • Dongyu Zhu, University of Michigan, United States
  • Yuan Yao, University of Michigan, United States
  • Yuanhao Huang, University of Michigan, United States
  • Jie Liu, University of Michigan, United States


Presentation Overview: Show

Genomic Knowledgebase (GenomicKB) is a graph database for researchers to explore and investigate human genome, epigenome, transcriptome, and 4D nucleome with simple and efficient queries. The database uses a knowledge graph to consolidate genomic datasets and annotations from over 30 consortia and portals, including 347 million genomic entities, 1.36 billion relations, and 3.9 billion entity and relation properties. GenomicKB is equipped with a web-based query system (https://gkb.dcmb.med.umich.edu/) which allows users to query the knowledge graph with customized graph patterns and specific constraints on entities and relations. Compared with traditional tabular-structured data stored in separate data portals, GenomicKB emphasizes the relations among genomic entities, intuitively connects isolated data matrices, and supports efficient queries for scientific discoveries. GenomicKB transforms complicated analysis among multiple genomic entities and relations into coding-free queries, and facilitates data-driven genomic discoveries in the future.

Paper: Feng, Fan, et al. ""GenomicKB: a knowledge graph for the human genome."" Nucleic Acids Research 51.D1 (2023): D950-D956.
https://academic.oup.com/nar/article/51/D1/D950/6786196

57: Major changes in the myocardial metabolome following nicotinamide riboside supplementation in a heart failure rat model
Track: General Session
  • Pamela Mehanna, Montreal Heart Institute, Research Center, Montreal, QC, Canada
  • Selma Lopez Vaquera, Department of Signalling and Cardiovascular Pathophysiology, UMR-S 1180, INSERM, Université Paris-Saclay, France
  • Florence Castelli, Département Médicaments et Technologies pour la Santé, MetaboHUB, Université Paris-Saclay, CEA, INRAE, France
  • François Fenaille, Département Médicaments et Technologies pour la Santé, MetaboHUB, Université Paris-Saclay, CEA, INRAE, France
  • Jean-Christophe Grenier, Montreal Heart Institute, Research Center, Montreal, QC, Canada
  • Matthieu Ruiz, Montreal Heart Institute, Research Center, Montreal, QC & Department of Nutrition, Université de Montréal, Canada
  • Julie Gandouet Hussin, Montreal Heart Institute, Research Center, Montreal, QC & Department of Medicine, Université de Montréal, Canada
  • Mathias Mericksay, Department of Signalling and Cardiovascular Pathophysiology, UMR-S 1180, INSERM, Université Paris-Saclay, France


Presentation Overview: Show

Background: Nicotinamide riboside (NR) is a precursor of nicotinamide adenine dinucleotide (NAD), a central regulator of human metabolism. Alterations of NAD homeostasis are widely studied in the context of cardiac diseases, including heart failure (HF). NAD stimulation has emerged as a new avenue for the development of metabolic therapy of HF. Aim: Here, we aim to investigate the impact of oral NR supplementation on metabolite levels in four tissues using a rat model of HF. Methods: Forty male rats were distributed into four groups; they were either subjected to a mock surgery (Sham), or a permanent ligation of the left anterior descending coronary artery (MI). Both groups were fed either with a control diet (CD) or an NR-supplemented diet. Untargeted metabolomics was performed on left ventricular (LV), liver and kidney tissues as well as on plasma extractions for each rat. LC-MS was used to measure metabolite levels using both hydrophilic interaction (HILIC) and reverse phase (C18) columns. Results: Unsupervised learning analyses show a clear effect of the diet (CD vs NR), most notably in LV and kidney tissues. Linear regression models further highlight LV as the tissue with the strongest changes in the metabolome when we compare diets, with a larger effect in MI than in Sham. A total of 1563/3151 metabolite features were differentially quantified in the NR diet compared to the CD one in the MI condition (fdr<0.05) and 1106/3151 in the Sham condition. Furthermore, we detected interaction effects in plasma and LV highlighting potential metabolite candidates that could be regulated by the NR diet. Finally, in an effort to better characterise the metabolites’ modulation between tissues, we used metabolites shared between LV and the other three tissues and identified tissue-specific patterns of effects. Metabolites shared between LV and plasma are consistently modulated after NR supplementation, whereas kidney and liver signatures are less consistent with the LV patterns. Conclusions: In this study, we identified tissue-specific metabolic signatures in all four tissues investigated and found overwhelming changes in the cardiac metabolome following NR supplementation in rats, with a larger effect in MI than in Sham. Our interaction analysis highlights putative candidate metabolites that could be regulated by the NR diet. Significance: These results provide a better understanding of how NR supplementation can alter the myocardial and extra-myocardial metabolome in the context of HF.

59: SEM: size-based expectation maximization for characterizing nucleosome positions and types
Track: General Session
  • Jianyu Yang, The Pennsylvania State University, United States
  • Shaun Mahony, The Pennsylvania State University, United States
  • Kuangyu Yen, Southern Medical University, China


Presentation Overview: Show

Genome-wide nucleosome positions are most popularly characterized using a combination of micrococcal nuclease and high-throughput sequencing (MNase-seq). MNase-seq typically shows the existence of nucleosome-free regions (NFRs) upstream of transcription start sites (TSSs). Traditional MNase-seq employs extensive MNase digestion and performs size-selection to specifically enrich mono-nucleosome-sized DNA fragments. However, depending on the chemical composition, nucleosomes don’t always protect ~147bp DNA. For example, nucleosomes engaged by Pol II transiently loses one H2A-H2B dimer and become a hexamer. Nucleosomes containing histone variant H2A.Z could have looser DNA wrapping than canonical nucleosomes.
Although there is growing attention to various nucleosome types, currently available nucleosome analysis packages mainly focus on characterizing nucleosome dyad locations, occupancy, and positioning fuzziness, and usually assume that nucleosomes protect a standard 147bp DNA fragment, which is not suitable for research on nucleosomes that don’t fit this assumption. To address the need for approaches that can characterize different nucleosome types from MNase-seq data, we developed the Size-based Expectation Maximum (SEM) nucleosome analysis package. SEM models the distribution of MNase-seq fragments around nucleosomes via a two-component Gaussian Mixture model. In addition, SEM takes the distribution of protected DNA fragment lengths into consideration to distinguish nucleosome types. Benchmark analysis shows that SEM can achieve competitive performance to existing nucleosome calling packages in predicting nucleosome dyad location, occupancy, and fuzziness.
To demonstrate the ability of SEM to distinguish unique nucleosome types, we studied a special type of “fragile” nucleosome residing in MNase-seq-defined NFRs. Fragile nucleosomes are proposed to protect relatively short DNA fragments under low MNase concentrations and cannot resist the higher concentrations of MNase typically used in MNase-seq. Since previously characterized fragile nucleosomes are located within promoter regions and may play regulatory roles, further characterizing fragile nucleosomes is of critical importance. We expected SEM could help determine the genome-wide distribution profile of fragile nucleosome.
Applied to a low MNase-concentration H2B MNase-ChIP-seq dataset from mouse embryonic stem cells, SEM discovers three nucleosome types: short-fragment nucleosomes; canonical nucleosomes; and di-nucleosomes. Short-fragment nucleosomes can be further classified into two subtypes, dependent on chromatin accessibility. One set of short-fragment nucleosomes are located within accessible regions, exhibit relatively high MNase sensitivity, and display similar distribution patterns around TSS and CTCF peaks as the previously reported fragile nucleosomes. This subtype of short fragment nucleosome is not only located in promoters, but also in enhancer regions and other accessible regions. Further exploration suggests that they colocalize with the chromatin remodelers Chd6, Chd8 and Ep400. Another set of short-fragment nucleosomes (hereafter called “non-canonical” nucleosomes) are located outside accessible regions and display a high enrichment of weak nucleotides at the exit/entry sites. Directly related to this A/T enrichment, we observed a relatively high enrichment of several transcription factor binding motifs, such as Fox family factors. Although MNase A/T biased digestion may cause the sensitivity feature for these non-canonical nucleosomes, we suggest the motifs at non-canonical nucleosome’s entry/exit sites could potentially serve as transcription factor engaging sites.
In summary, SEM provides an effective platform for characterizing distinct nucleosome subtypes and will facilitate a deeper characterization of fragile nucleosomes.

61: Systematic pan-cancer analysis to reveal the prognostic significance of driver mutations and the tumor immune microenvironment
Track: General Session
  • Masroor Bayati, Ontario Institute for Cancer Research (OICR), Canada
  • Michael Slobodyanyuk, Ontario Institute for Cancer Research (OICR), Canada
  • Jüri Reimand, Ontario Institute for Cancer Research (OICR), Canada


Presentation Overview: Show

Introduction: Driver mutations in cancer genomes and the tumor immune microenvironment (TIME) have been studied extensively to elucidate biological mechanisms and develop biomarkers. However, the functional interactions of driver mutations with TIME and patient phenotypes remain less explored [1]. Our pan-cancer study suggests that the co-occurrences of specific driver mutations and TIME features provide complementary insights to cancer biology and leads to biomarker development. We focused on hepatocellular carcinoma (HCC) of the liver, a major cancer type with poor prognosis that is significantly increasing in incidence in Canada [2].

Method: We developed a machine learning framework to discover prognostic biomarkers that integrates cancer driver mutations (SNVs, CNAs) with immune cell infiltration (ICI) profiles as TIME features [3]. The method finds functional interactions of cancer drivers and ICIs (driver-ICIs) such that the interactions are more powerful predictors of patient survival compared to either class of features alone. We applied our method to transcriptomic, genomic and clinical data of ~7000 primary tumors of 24 cancer types from The Cancer Genome Atlas (TCGA) [4]. To understand the functional implications of the identified driver-ICI interactions, we studied matching transcriptomics datasets using pathway and network analyses.

Results: We found 82 driver-ICI interactions across 18 cancer types that define novel groups of high-risk or low-risk tumors. Prognostic models of driver-ICI interactions outperformed baseline models of clinical variables, indicating the complementary value of our models. For example in liver cancer, TP53 driver mutations coupled with high infiltration of neutrophils in the tumors indicated poor patient prognosis. These high-risk tumors associated with the dysregulation of dozens of genes in key metabolic pathways. In particular, our pathway analyses suggested that this driver-ICI interaction may contribute to the mitochondrial dysfunction in HCC [5].

Conclusion: Integrative analysis of genomic and immune microenvironmental properties provides complementary basic and translational insights in multiple cancer types.

Outcome / Impact: Analysis of driver-ICI interactions highlights potential prognostic biomarkers, helps elucidate biological mechanisms of oncogenesis and tumor progression, and may reveal clues towards precision therapeutic strategies.


References:

[1] Wellenstein, M.D., de Visser, K.E. “Cancer-Cell-Intrinsic Mechanisms Shaping the Tumor Immune Landscape”. Immunity (2018). https://doi.org/10.1016/j.immuni.2018.03.004

[2] Yang, J.D., Hainaut, P., Gores, G.J. et al. “A global view of hepatocellular carcinoma: trends, risk, prevention and management”. Nat Rev Gastroenterol Hepatol (2019). https://doi.org/10.1038/s41575-019-0186-y

[3] Newman, A.M., Steen, C.B., Liu, C.L. et al. “Determining cell type abundance and expression from bulk tissues with digital cytometry”. Nat Biotechnol (2019). https://doi.org/10.1038/s41587-019-0114-2

[4] The Cancer Genome Atlas Research Network., Weinstein, J., Collisson, E. et al. “The Cancer Genome Atlas Pan-Cancer analysis project”. Nat Genet (2013). https://doi.org/10.1038/ng.2764

[5] Li, X., Ramadori, P., Pfister, D. et al. “The immunological and metabolic landscape in primary and metastatic liver cancer”. Nat Rev Cancer (2021). https://doi.org/10.1038/s41568-021-00383-9

63: Neural-network based classification of regulatory elements active in human gliomas identifies DNA shape features as important for regulatory activity
Track: General Session
  • Magdalena Machnicka, University of Warsaw, Poland
  • Marlena Osipowicz, University of Warsaw, Warsaw, Poland, Poland
  • Julia Smolik, University of Warsaw, Warsaw, Poland, Poland
  • Bartek Wilczynski, Institute of Informatics, University of Warsaw, Poland


Presentation Overview: Show

Gene regulatory DNA sequences, and enhancers and promoters in particular, are very important for gene expression regulation in eukaryotes. However, even though cells of these organisms seem to have no problem in identification of these functional elements among millions of bases of DNA that has no regulatory function, our computational models have great difficulty in recognizing the functional regulatory elements from non-functional and discerning between enhancers and promoters showing activity in different conditions or cell-types proves to be even more difficult.

In the last decade, many approaches to gene regulatory element classification were proposed and tested. In the last years, we have been using Bayesian networks (Bonn et al. 2012), Random Forests (Herman-Iżycka et al. 2017), support vector machines (Podsiadło et al. 2013) to predict enhancer and promoter positions in human and model organism genomes. However, despite the fact that each of these models was quite effective at the respective dataset, it seems to be very difficult to translate the results obtained in one of the studied systems to other biological contexts. In recent years, the wave of methods based on artificial neural networks has shown great success in many areas including classification tasks originating from molecular biology. One such approach, Basset (Kelley et al. 2016) applied a particular type of convolutional neural network to predicting DNA regions of accessible (open) chromatin in different tissues.

Since we have recently published (Stepniak et al. 2021) an atlas of regulatory elements (promoters and enhancers) active in gliomas we were interested to see if we can modify the Basset model to suit our task of discerning between active and inactive enhancers and promoters in the context of glioma samples taken from multiple patients. This was an especially interesting case for study, as we did not only have the positions and activity of these elements measured (by means of ChIP-Seq of histone modifications), but we also had these patients genotyped, allowing us to ascertain the potential role of DNA variants on the activity of tested regulatory elements. After modifications of the model and creating several different training(?) datasets we can not only show that our convolutional neural network provides better classification accuracy than the classical methods (AUC>80% vs <70% for Random Forests), but we can also see that integration of patient mutations in the process of neural network training can further increase the method performance (AUC even above 90%). After careful study of the internal structure of the filters learned by the network we can also show connections between features used by our model to classify sequences and DNA sequence specificity of transcription factors as well as DNA shape parameters (as defined by Zhou et al. 2013). What came to us as a surprising outcome of this study is that many filters that are essential for the network performance are solely attributable to DNA shape rather than transcription factor binding.

In summary, we can present a novel, neural network based approach to regulatory element classification that shows performance superior to our earlier methods and allows for introspection that identifies novel features important for regulatory sequence activity.

65: Overture: An Open Source Platform for Scalable Genomics Data Infrastructure
Track: General Session
  • Mitchell Shiell, Ontario Institute of Cancer Research (OICR), Canada
  • Jon Eubank, Ontario Institute of Cancer Research (OICR), Canada
  • Justin Richardson, Ontario Institute of Cancer Research (OICR), Canada
  • Brandon Chan, Ontario Institute of Cancer Research (OICR), Canada
  • Puneet Bajwa, Ontario Institute of Cancer Research (OICR), Canada
  • Robin Haw, Ontario Institute of Cancer Research (OICR), Canada
  • Christina Yung, Ontario Institute of Cancer Research (OICR), Canada
  • Lincoln Stein, Ontario Institute of Cancer Research (OICR), Canada
  • Melanie Courtot, Ontario Institute of Cancer Research (OICR), Canada
  • Overture Team, Ontario Institute of Cancer Research (OICR), Canada


Presentation Overview: Show

Large-scale and sustainable data repositories (data commons) are essential resources that accelerate scientific discoveries by facilitating the creation and consumption of unified genomics datasets. Unfortunately, building out and maintaining data commons is a resource-intensive pursuit requiring a team of software engineers, cloud infrastructure specialists and bioinformaticians. With advances in next-generation sequencing, genomics data flows have propelled both small and large genomics research efforts to the cloud, exacerbating the need for software infrastructures that securely connect and unify large and often distributed genomics data. Overture1,2 addresses this with an extensible open-source platform of modular components made to build into scalable genomics data infrastructures.

The five core microservices of Overture: Ego, Score, Song, Maestro and Arranger work in concert to create fully functional and scalable data commons. Ego handles authentication and authorization through single-sign-on identity providers and an administrative Ego UI component. Score is our file transfer and object storage microservice with integrated SAMtools3 functionalities, including BAM and CRAM splicing. Metadata management is handled by Song, which tracks and validates file metadata across distributed servers and against user-defined schemas. Maestro handles the indexing of Song repositories into a single Elasticsearch index. Arranger, our data-agnostic search API with pre-built and configurable UI components, then consumes these indexes. This configurable UI enables researchers to create and customize interactive data portals for filtering, querying and collaborating on large datasets.

Our experiences working on International Cancer Genome Consortium (ICGC)4,5 and the NCI Genomic Data Commons (GDC) Data Portal6 initially informed Overture's core capabilities. Today, Overture powers and informs all our projects, notably ICGC-Accelerating Research in Genomic Ontology (ICGC-ARGO)7, a data commons that will analyze specimens from 100,000 cancer patients alongside high-quality clinical data. With researchers worldwide using applications powered by Overture, it has collectively handled approximately 2.5 petabytes of data, equalling around three million genomes. In addition, we aim to expand our community of bioinformaticians and software engineers dedicated to advancing genomics research. We will continue to share, support, and expand Overture services to achieve this.

References

1. Overture- Software for big data genomic science. https://www.overture.bio/.
2. Overture source code. https://github.com/overture-stack.
3. Twelve years of SAMtools and BCFtools. Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li. GigaScience, Volume 10, Issue 2, February 2021, giab008 https://doi.org/10.1093/gigascience/giab008
4. Zhang, J. et al. The International Cancer Genome Consortium Data Portal. Nature Biotechnology vol. 37 367–369 (2019).
5. ICGC Data Portal. http://dcc.icgc.org.
6. Wilson, S. et al. Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API. Cancer Res. 77, e15–e18 (2017).
7. ICGC ARGO Data Portal. https://platform.icgc-argo.org/.

69: RiboInfo: a database of experimentally assessed riboswitches.
Track: General Session
  • Papa Mamadou Ndoye, Stagiaire, Canada
  • Meryem Raies, Chercheuse, Canada
  • Emre Yurdusev, Chercheur, Canada
  • Jonathan Perreault, Chercheur - Directeur du labo, Canada


Presentation Overview: Show

Riboswitches are important RNA elements for understanding gene regulation in microorganisms. They are located in untranslated regions (UTRs) upstream of genes or groups of genes for which they direct their expression. They are structured RNA elements composed of a natural aptamer and an expression platform. The RNA of the aptamer part can bind specifically to a given ligand. The binding of the ligand to the aptamer causes a conformational change acting on the secondary and tertiary structures of the expression platform. This results in the induced regulation of the gene downstream of the riboswitch.
Based on the conservation of sequence and structure through evolution, tens of thousands of sequences have been annotated through the efforts of bioinformatics. However, only a few hundred have been experimentally evaluated. It is therefore possible that many of the annotated riboswitches do not function exactly as expected. To better evaluate this hypothesis, the first step is to establish an exhaustive list containing relevant information on the experimentally evaluated riboswitches, hence the creation of a database. To collect the desired information, the first step was to collect experimental results and methods from articles; and, combined with the Ribogap database [Naghdi et al. 2017] to collect corresponding intergenic regions (IGR) with Java scripts, to extract the exact riboswitch sequences experimentally assessed in the articles. After collecting the essential data, a database was set up with MySQL into six interconnected tables. The designed database currently gathers information on 175 riboswitch sequences, including 71 wild-type sequences and 104 of induced mutants, representing a total of 48 riboswitches studied from 44 articles, with many more in the process of being included.
The information gathered may help provide new interpretations about the mechanisms of riboswitches and at the same time also help identify knowledge gaps. Similarly, it allows to put in perspective our limited knowledge of the tens of thousands of annotations available, especially with regards to less studied aspects, like expression platforms. Furthermore, the database can contribute in part to a better knowledge of aptamers thanks to the information obtained on aptamer-ligand binding, including with mutant versions. Later, the collected information could be used to help design new tools for prediction and design of riboswitches, as well as be a resource for future conception of tools to define with more rigor and certainty the real function of annotated riboswitches.


Reference
Naghdi et al. “Search for 5'-leader regulatory RNA structures based on gene annotation aided by the RiboGap database.” Methods. vol. 117 (2017): 3-13. doi:10.1016/j.ymeth.2017.02.009

71: Dimensional reduction to represent multi-layer epigenomic data
Track: General Session
  • Joseph Boyd, University of Vermont, United States
  • Seth Frietze, University of Vermont, United States


Presentation Overview: Show

The epigenome is comprised of multiple layers of epigenetic modifications that occur in cell-type specific patterns. Epigenomic profiling methods have been developed to provide one-dimensional genomic maps of distinct chromatin marks in different cell types, but due to data interpretation and visualization challenges, the central question of how combinatorial patterns of different chromatin marks superimpose across the genome remains difficult to interpret. Here we present ChIP-tsne as a visualization tool for the discovery and understanding of varied epigenomic patterns from chromatin state maps. ChIP-tsne is an R-based chromatin pattern comparison method that applies dimensionality reduction, namely t-Distributed Stochastic Neighbor Embedding (t-SNE), to multi-layer epigenomic data. We used ChIP-tsne to explore the patterns of interaction among multiple epigenetic modifications between different cell types. We find that distinct sets of epigenomic features superimpose to provide cell type specific chromatin maps, revealing distinct enriched pathways and gene expression differences. ChIP-tsne should be broadly applicable for epigenomic comparisons and provides a powerful new tool for studying multidimensional chromatin differences at the genome scale.

73: Treatment-associated mutation rate variation in regulatory elements in metastatic cancer genomes
Track: General Session
  • Kevin Cheng, University of Toronto, Canada
  • Jüri Reimand, University of Toronto, Canada


Presentation Overview: Show

Most mutations in cancer genomes are selectively neutral mutations called passengers, which accumulate during tumor evolution. Although passenger mutations do not directly drive carcinogenesis, they reflect the mutational processes that contributed to tumor evolution. Cancer therapies are often mutagenic and are reflected in the catalogue of mutations in cells that evade therapy to propagate as recurrent or metastatic tumors. It is known that mutation rate is not uniform in cancer genomes. Some functional elements, such as transcription factor binding sites, are especially susceptible to somatic mutations.
The contribution of cancer therapies to the mutational landscape in the non-coding genome has been understudied. Here, we show that radiotherapy exposure is associated with high mutation rate in constitutively active CTCF binding sites in metastatic colorectal cancer (CRC) independent of patient clinical factors. Mutational signature analysis revealed an enrichment of select mutational processes in CTCF sites in tumors with radiotherapy treatment history. We found a subset of CTCF sites with an aberrantly higher mutation rate. Pathway analysis showed that these sites are associated with genes in cancer-related pathways.
Our results show that radiotherapy exposure is associated with increased activity of a subset of mutational processes. These mutational processes may target active CTCF binding sites, and they aggregate on sites that are associated with cancer-related pathways. These results reveal a potential side effect of radiotherapy through non-coding mutations and provide motivation for a more detailed understanding of the possible consequences of cancer therapies.

75: A generalizable Cas9/sgRNA prediction model using machine transfer learning with small high-quality datasets
Track: General Session
  • Tyler Browne, Western University (Ontario), Canada
  • Dalton Ham, Western University (Ontario), Canada
  • Tyler Wilson, Tesseraqt Optimization Inc., Canada
  • Richard Michael, Tesseraqt Optimization Inc., Canada
  • Pooja Banglorewala, Western University (Ontario), Canada
  • David Edgell, Western University (Ontario), Canada
  • Gregory Gloor, Western University (Ontario), Canada


Presentation Overview: Show

The bacterial adaptive immune system CRISPR has shown promise as a tool for use in genetic engineering, and more recently as a next-generation antimicrobial agent. To perform these functions, the CRISPR-Cas9 nuclease is targeted to a site by a single guide RNA molecule (sgRNA) where it then introduces a double-strand DNA break. The success of this tool is limited by the wide variation in activity of the targeting sgRNA sequence. Over the past decade there have been several data driven attempts to predict the on- and off-target activity of sgRNAs, primarily in eukaryotic cells. However, current sgRNA activity prediction models do not generalize well to SpCas9/sgRNA activity prediction in bacteria, possibly because the underlying datasets used to train the models do not accurately measure SpCas9/sgRNA cleavage activity and cannot distinguish cleavage activity from toxicity. We solved this problem by using a two-plasmid positive selection system to generate high-quality biologically-relevant data that more accurately reports on SpCas9/sgRNA cleavage activity and that separates activity from toxicity. We then developed a new machine learning model for sgRNA activity prediction which is designed to enhance predictive performance through transfer learning with smaller amounts of high-quality data. Our unique dual branch deep learning architecture, comprised of convolutional and recurrent neural network layers, optimizes the information transfer from the initial model — trained on a larger prior sgRNA activity dataset — to smaller datasets used for fine tuning the performance. We present crisprHAL, a sgRNA activity prediction model which recapitulates known SpCas9/sgRNA-target DNA interactions and provides a pathway to a generalizable sgRNA bacterial activity prediction tool.

77: SYSTEMS GENOMICS AND NETWORK APPROACHES UNCOVER HOW SINGLE NUCLEOTIDE POLYMORPHISMS IN INFLAMMATORY BOWEL DISEASE AFFECT COMMON PROCESSES THROUGH DIFFERENT MECHANISMS IN A PATIENT-SPECIFIC WAY
Track: General Session
  • Dezso Modos, Quadram Institute Bioscience, United Kingdom
  • Martina Poletti, Earlham Institute, United Kingdom
  • Johanne Brooks, EAST AND NORTH HERTFORDSHIRE NHS TRUST, United Kingdom
  • Matthew Madgwick, Earlham Institute, United Kingdom
  • Balazs Bohar, Imperial College London, United Kingdom
  • Simon Carding, Quadram Institute Bioscience, United Kingdom
  • Severine Vermeire, KU Leuven, Belgium
  • Bram Verstockt, KU Leuven, Belgium
  • Tamas Korcsmaros, Imperial College London, United Kingdom


Presentation Overview: Show

Background: Inflammatory bowel disease (IBD) is a life-long, chronic disease of the gut. It has two type, ulcerative colitis (UC) and Crohn’s disease (CD). Both types are associated with single nucleotide polymorphisms (SNP). Some of these SNPs are shared between the two diseases, and many of them are predominantly in non-coding regions of the genome making their functional interpretation challenging. We previously developed the iSNP pipeline to predict how SNPs in transcription factor (TF) binding sites and miRNA target sites affect signaling networks and contribute to pathogenesis. We applied this method to study UC pathomechanisms. Here, we further developed iSNP to investigate the pathomechanism of UC and CD in downstream regulatory levels, and to compare these mechanisms between individual patients.
Methods: The iSNP pipeline was used to analyze in a patient-specific manner IBD-associated SNP affected genes from 1695 CD and 941 UC patients. To uncover downstream signaling pathways and regulatory processes affected by patient-specific SNPs, we used a heat propagation model to connect SNP-affected proteins to other proteins via a signaling network (using OmniPath), ultimately reaching affected TFs, and their target genes through TF-target gene networks (using DoRothEA).
Results: In our cohorts, we found 47 regulatory SNPs in CD and also 47 SNPs in UC (including 8 SNPs overlapping). The iSNP pipeline predicted these SNPs may regulate the expression of 83 proteins in CD and 121 in UC. Network propagation analysis further identified 518 and 330 proteins affected significantly in CD and UC, respectively; with only 99 proteins shared in both diseases. The functional analysis of the shared proteins indicated the importance of ubiquitin ligases, cell cycle, T-cell mediated immunity and the WNT pathway. In CD-associated signaling networks, processes such as antigen presentation, C-type lectin signaling and cell cycle arrest were uniquely enriched, while T-cell activation and interleukin-2 production were enriched uniquely in UC. Analysis of affected TFs and their target genes identified a downstream regulatory network containing 2067 and 3352 genes in CD and UC, respectively. In CD, affected genes were involved in T cell activation, other immune functions, DNA repair and autophagy. Interestingly, a few genes were affected in over 70% of patients (MMP9, RARA and SSB2 in CD; TNFR13 and EGFR in UC).
Conclusion: The iSNP pipeline listed directly and indirectly affected processes in IBD for nearly 3000 patients. Our analysis confirmed the importance of autophagy in CD and T-cell activation in UC with the novelty of pointing out patient-specific different modes how these processes are perturbed. The patient-specific analysis showed the expected heterogeneity of affected genes but pointed out unexpectedly a few common key genes in most of the patients. We showed how mostly different CD and UC associated SNPs use the same biological mechanisms to influence similar immune and developmental pathways with distinguishable outcomes.

79: The use of chatGPT on wastewater-based epidemiology, an example, with SARs-Cov-2 collected data in Northern Ontario.
Track: General Session
  • Gustavo Ybazeta, Health Sciences North Research Institute, Canada
  • James Knockleby, Health Sciences North Research Institute, Canada
  • Aleksandra Mloszewska, Health Sciences North Research Institute, Canada
  • Dania Andino, Health Sciences North Research Institute, Canada
  • Anu Nair, Health Sciences North Research Institute, Canada


Presentation Overview: Show

Our work explores the chatGPT box and its potential to use the R language to help users produce visualizations, data analysis tools, dashboards and summary reports. Furthermore, we use a customized graphical user interface based on a chatbot which utilizes natural language processing to aid users in creating informative and comprehensive visualizations in R. In this case, we use the data collected during the two-year SARs-CoV-2 pandemic data in two NorthEstern locations in Ontario - Canada. This approach allows us for the correlation analysis of various factors, flow, precipitation, hospitalizations and active cases with the N1 and N2 genetic markers used as a proxy for identifying SARS-CoV-2 concentrations in the wastewater. We also use the public health environmental Surveillance Open Data Model (PHES-ODM, or ODM) in Ontario's Wastewater Surveillance Initiative (WSI) spreadsheets to formalize the data entry and formats. This poster shows the queries and results produced on the chatbox platform. We observe a few errors and sometimes format problems copying the code to the R environment. However, we could correct them with minimal knowledge of the R grammar and sometimes just changing format mistakes. The chatGPT can help users use R and produce quick statistical analyses, visualizations, and dashboards. In addition, it helps to summarize reports under the user's control. Also, this can help new R users with knowledge of epidemiology and statistics to use this language's power rapidly and efficiently and increase their practical knowledge. This is the same for users with a medium to a higher understanding of R. In this way, the chatGPT will rapidly become a helper to the users to achieve results that are ready to share with the public health units and other healthcare providers, thus potentially accelerating the process of decisions based on data.
Also, this approach can be used in future databases from others pathogens of interest or re-use the code in the form of functions and eventually with the creation of new R packages to target these particular epidemiological data. We demonstrate the potential of the chatGPT box of using R language in wastewater monitoring SARS-CoV-2. We can do the same in other data frames from future wastewater-based epidemiology data projects. In future work, we will test with different computer languages and the production of API and apps. This approach leads to faster data analysis and potentially accelerates results sharing with other stakeholders. This work is part of broader efforts to explore this platform's use in bioinformatics and biocomputing.

81: Construction of an Ultraviolet Light-Response Gene Signature to Predict the Prognosis of Uveal Melanoma
Track: General Session
  • Alejandro Mejia Garcia, Universidad of antioquia, Colombia
  • Diego Bonilla, Research Division, Dynamical Business & Science Society – DBSS International S.A.S, Colombia
  • Claudia Ramírez, Health and Sport Sciences Research Group, School of Health and Sport Sciences, Fundación Universitaria del Area Andina, Colombia
  • Johana Gonzales, Fundación Universitaria del Area Andina, Colombia
  • Diego Forero, Fundación Universitaria del Area Andina, Colombia
  • Luis Castro Vega, Paris Brain Institute, France
  • Richard B Kreider, Texas A&M University, United States
  • Carlos Orozco, Fundación Universitaria del Área Andina, Colombia


Presentation Overview: Show

Background and methods: Uveal melanoma (UM) is one of the most common eye cancers in adults. It is a highly aggressive type of cancer with a poor prognosis and limited therapeutic options. The development of effective treatments for UM relies heavily on an understanding of the molecular mechanisms underlying its progression and metastasis. To this end, the construction of a novel gene signature based on the UV-response related genes of UM is proposed. This gene signature could be used to identify patients with poor prognosis and to develop novel therapies targeting some genes. We selected a gene panel of 158 UV-Light response genes from the Hallmark_UV_Response_UP database in the Gene Set Enrichment Database (GSEA). Later, the signature was constructed using the TCGA uveal melanoma (TCGA-UVM) dataset that includes transcriptomic and survival data for 80 patients. We implemented a RIDGE regression using the “glmnet” R package that assigned weights to each gene in order to keep the genes that are more relevant for overall survival (11 genes were selected). Risk scores were then calculated based on the Ridge regression coefficients, using the expression of the UV-Light response genes and the following formula: sum(Expression of each gene * ridge coefficient for that gene). Patients were divided into high-risk and low-risk groups using the median risk score. We conducted a survival analysis for both high-risk and low-risk patient groups using the UCSC Xena browser (http://xena.ucsc.edu/). XENA, following their division based on the established prognostic model. We performed GO enrichment analysis on DEGs between high and low risk patients. We estimated immune and stromal infiltration using XCELL R packages and then compared high and low risk groups. Finally, we used the pRRophetic"" R package in order to predict IC50 of drugs in patients with high and low risk.

Results: Univariate cox regression results revealed that upregulation of CXCL2, IL6, TCHH, PDAP1, and CHRNA5 and downregulation of CDK2, RXRB, CCND3, POLR2H, and WIZ were associated with worse survival outcomes. A survival analysis was conducted to evaluate the prognostic model with this genes (11 genes). The Kaplan-Meier (KM) survival curve for the low-risk and high-risk groups showed that patients in the high-risk group experienced a more significant decrease in survival rate (p-value=0.0003). We validated the prognostic model with dataset GSE22138 (63 patients), using metastasis free survival as an outcome. We estimated risk scores for each patient based on expression and ridge coefficients. Results showed that low risk patients live longer without metastasis (HR: 0.37, C.I 0.18 – 0.74, p-value=0.003). We found that cytokines signaling pathways, cellular response to cytokine stimulus, and regulation of immune response for biological processes are enriched in High-risk patients. High-risk patients also had significantly higher levels of stromal cell infiltration, such as endothelial cells, chondrocytes and melanocytes. Antitumor drugs sensitivity analysis showed significantly lower IC50 for Gefitinib (a), JNK inhibitor (b), Lapatinib (c), Temsirolimus (d) in the high-risk group.

Conclusions: the UV light gene signature constructed is able to predict survival in the TCGA and validation cohort. Moreover, immune microenvironment composition and response are different in High and Low-risk patients. High -risk patients could benefit from treatment JNK inhibitor (b), Lapatinib (c), Temsirolimus drugs due to their expression profile.

83: Tracking the Elusive Path of Cancer Stem Cells: A Study on Their Trajectory and Implications in Cancer Evolution
Track: General Session
  • Zixuan Lan, Ontario Institute for Cancer Research, Canada
  • Philip Awadalla, Ontario Institute for Cancer Research, Canada


Presentation Overview: Show

There exist a number of similarities between stem cells and some cancer cells, including the ability to proliferate and differentiate and heterogeneity. Cancer stem cells (CSCs) disrupt signalling pathways that regulate normal stem cell renewal. The presence of CSCs in tumour populations explains tumour recurrence, tumour dormancy, metastasis, and drug resistance. Cancer cells process a hierarchy determined by differentiation capacity. CSCs reside in the apex compromises the most stem-like cells, playing a role in initiating prodigious tumour, driving tumour heterogeneity, and facilitating tumour growth. Both genetic and epigenetic changes contribute to the plasticity of the hierarchy such that differentiated cells can dedifferentiate to replace the lost stem cells. Single-cell RNA sequencing (scRNA-seq) is a powerful tool in identifying heterogenous populations, while single-cell epigenomics is capable of adding another layer of heterogeneity. Current studies regarding CSCs focused on developing mathematical models for aligning cells to developmental maps and identifying potential therapeutic targets. However, these previous studies were limited to a specific cancer type and scRNA-seq datasets, while the dedifferentiation associated with cancer progression would suggest that models incorporating multiple tissues would be highly beneficial with respect to mapping cell types and determining both the clonality and stemness of biopsies as cancers evolve. Thus, there is an urgent need for building developmental trajectory reference for cancer cross-tissues and cross-omics. Given the potential of integrating transcriptomic and epigenomic datasets, we propose to construct a map deciphering the progression of cancer cell between states. Using unsupervised and supervised models, we aim to generate a model stratifying tissues and an indicator of cancer stemness. Our proposed research carries implication for progression and proportion of tumour, with potential application in clinical settings such as prognosis and personalized treatment strategies.

85: An interactive Visualization Interface to Query Gene Expression Time Series Datasets
Track: General Session
  • Hoang Le Tran, Grand Valley State University, United States
  • Roshan Shrestha, Grand Valley State University, United States
  • Rahat Ibn Rafiq, Grand Valley State University, United States
  • Guenter Tusch, Grand Valley State University, United States


Presentation Overview: Show

Graph-based prediction and natural language processing have been introduced to bioinformatics for several years, for example to model protein-protein interactions or to understand textual meaning. At the same time, a wealth of gene expression data has been accumulated in public repositories like NCBI Gene Expression Omnibus (GEO) or EBIs ArrayExpress. Since annotation of these datasets originally followed the MIAME and MINSEQE guidelines and more recently the FAIR Guiding Principles for scientific data management and stewardship, it is much easier to find datasets on the Internet. However, a researcher who conducted a time series experiment to, for example, study and model dynamic biological processes, would like to find datasets that exhibit a similar metabolic pathway dynamic as the experimental data under investigation. A GEO query does not provide sufficient detail or annotation to be specific enough. However, Pubmed abstracts are well annotated, for example by MeSH terms. MeSH is the National Library of Medicine's controlled vocabulary or subject heading list. It's used by human indexers to annotate subject content of journal articles. In addition, in the past years NCBI has significantly improved the annotation of Pubmed abstracts with PubTator Central. It is a Web-based system to automatically annotate biological concepts like genes and mutations in both PubMed abstracts and PMC full-text articles.

We present an interactive Web application powered by Python’s Streamlit library and the NCBI APIs that facilitates the search for relevant datasets. Instead of searching for the datasets directly, we will first search for abstracts and then find the attached datasets, assuming that most datasets of value to the researcher will be published. The user will first create a Pubmed query that incorporates MeSH (Medical Subject Terms) terms, such as “Time Factors” or “Gene Expression Profiling'', that best describe their publication or dataset. The web application will then extract, digest, and present the query results through an interactive graph-based visualization, including abstract, genes, diseases, and other relevant information. While G-Profiler returns all possible metabolic pathways the system identifies the important pathways by intersection with several genes. The user then can identify similar datasets by selecting all pathways with similar pathway dynamics through interacting with the graph. We are currently testing the interactive web application with promising early results.

The system can be used for a variety of applications, for example, transcriptome meta-analysis identifies genes, signaling pathways and biomarkers, GWAS or PheWas studies with time-course measurements or help identify potential pathways for drug repurposing.

87: Cross-species knowledge transfer by jointly modeling genome-wide molecular networks across multiple species
Track: General Session
  • Christopher Mancuso, University of Colorado Anschutz Medical Campus, United States
  • Kayla Johnson, University of Colorado Anschutz Medical Campus, United States
  • Renming Liu, Michigan State University, United States
  • Arjun Krishnan, University of Colorado Anschutz Medical Campus, United States


Presentation Overview: Show

Predicting novel un(der)-characterized genes associated with cellular functions, complex traits, and multifactorial diseases based on an underlying genome-scale gene network is a preeminent challenge in the field of network biology. Coupled with this challenge is the fact that our ability to transfer such gene-level knowledge from one species to another is lacking, often causing failures in clinical trials. We present GenePlexusZoo, a framework for network-based gene classification that enables leveraging the high evolutionary molecular, functional, and phenotypic conservation across species by combining network information from multiple species into a single model. As part of this framework, we developed a novel method to create a concise representation of genome-scale networks from multiple species that can be used to train ML models on-the-fly for any molecular context. Creating this representation entails connecting genes across networks of different species with directional edges between genes of different species that are weighted by node degree of the source node. We show that these degree-informed connections lead to creating a highly-aligned embedding space across species as it encourages random walks from a node to more readily visit nodes of differing species, which are used to generate the low-dimensional embedding space. Additionally, our method is able to seamlessly handle the use of one-to-many and many-to-many ortholog mappings, allowing a more complete transfer of knowledge across species compared to analyses restricted to one-to-one orthologs. Our extensive evaluations based on study-bias holdout demonstrate that it is always the case that projecting low-dimensional embeddings of multi-species networks outperform more naive methods. Furthermore, we illustrate how GenePlexusZoo can effectively transfer knowledge across species by training classifiers based on human gene annotations for a disease and predicting relevant genes, biological processes, and phenotypes in model species. We additionally highlight the power of this network-based method by providing examples of how GenePlexusZoo can discover disease-related biological processes and phenotypes in model species, even when they have no orthologous overlap with the genes associated with the human disease.