Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

POSTER PRESENTATIONS



P01
Temporal variation in lymphocyte proteomics

Subject: Inference and Pattern Discovery

Presenting Author: Carolyn Allen, Brigham Young University, United States

Co-Author(s):
Michaela McCown, Brigham Young University, United States
Daniel Machado, Brigham Young University, United States
Hannah Boekweg, Brigham Young University, United States
Yiran Liang, Brigham Young University, United States
Andikan Nwosu, Brigham Young University, United States
Ryan Kelly, Brigham Young University, United States
Samuel Payne, Brigham Young University, United States

Abstract:

Chronic Lymphocytic Leukemia (CLL) is a slow progressing disease, characterized by a long asymptomatic stage followed by a symptomatic stage during which patients receive treatment. While proteomic studies have discovered differential pathways in CLL, the proteomic evolution of CLL during the asymptomatic stage has not been studied. In this pilot study, we show that by using small sample sizes comprising ~145 cells, we can detect important features of CLL necessary for studying tumor evolution. Our small samples are collected at two time points and reveal large proteomic changes in healthy individuals over time. A meta-analysis of two CLL proteomic papers showed little commonality in differentially expressed proteins and demonstrates the need for larger control populations sampled over time. To account for proteomic variability between time points and individuals, large control populations sampled at multiple time points are necessary for understanding CLL progression.



P02
Incorporating prior knowledge in breast cancer subtype classifier improves performance on shifted test sets

Subject: Machine Learning

Presenting Author: Paul Anderson, California Polytechnic State University, United States

Co-Author(s):
Richa Gadgil, California Polytechnic State University-, United States
Ella Schwab, California Polytechnic State University, United States
William Johnson, California Polytechnic State University, United States
Jean Davidson, California Polytechnic State University, United States

Abstract:

Deep learning neural networks have improved performance in many cancer informatics problems, including breast cancer subtype classification. However, many networks experience underspecification where multiple combinations of parameters achieve similar performance, both in training and validation. Additionally, certain parameter combinations may under perform expectations when the test distribution differs from the training distribution. Embedding prior knowledge from the literature may address this issue by boosting predictive models that provide crucial, in-depth information about a given disease. We draw on past research on breast cancer subtype biomarkers, label propagation, and neural graph machines to present a case study for embedding knowledge into machine learning systems. Our results show that this methodology reduces predictor variability on state-of-the-art deep learning architectures and increases predictor consistency leading to improved interpretation. We also find that our model produces a higher and more consistent accuracy on shifted test sets.



P03
The Cultural Evolution of Vaccine Hesitancy: Modeling the Interaction between Beliefs and Vaccination Behaviors

Subject: Qualitative Modelling and Simulation

Presenting Author: Kerri-Ann Anderson, Vanderbilt University, United States

Co-Author(s):
Nicole Creanza, Vanderbilt University, United States

Abstract:

Vaccine-preventable diseases (VPDs), such as measles, pertussis, and polio, have resurged in the developed world as a result of decreasing vaccination coverage due to increased vaccine hesitancy. The current COVID-19 pandemic demonstrates the complexities of health behaviors and underscores the relevance of these behaviors to public health. Society, culture, and individual motivations affect health-related decisions, and health perceptions and behaviors can change as cultures evolve. In recent years, mathematical models of disease dynamics have begun to incorporate aspects of human behavior; however, they do not address how cultural beliefs influence these behaviors, or how these behaviors in turn impact cultural beliefs. Using a mathematical modeling framework, we explore the effects of cultural evolution on vaccine hesitancy and vaccination behavior. With this model, we shed light on the facets of cultural evolution that facilitate vaccine hesitancy, ultimately affecting levels of vaccination coverage and VPD outbreak risk. We show vaccine confidence and cultural selection pressures are driving forces of vaccination behavior, leading to a general pattern in which the spread of vaccine confidence leads to high vaccination coverage. We then demonstrate that an assortative preference among vaccine-hesitant individuals can lead to increased vaccine hesitancy and lower vaccination coverage. Further, we show that vaccine mandates can foster vaccine hesitancy despite high vaccination coverage, whereas vaccine scarcity can result in the opposite pattern of high vaccine confidence but low vaccination coverage. We present our model as a generalizable framework for exploring cultural evolution when beliefs influence, but do not strictly dictate, human behaviors.



P04
Rigorous Benchmarking of HLA Callers for RNA Sequencing Data

Subject: Metogenomics

Presenting Author: Ram Ayyala, University of Southern California, United States
Dottie Yu, University of Southern California, Department of Clinical Pharmacy, United States

Co-Author(s):
Sergey Kynavez, Georgia State University, United States
Junghyun Jung, University of Southern California, United States
Serghei Mangul, University of Southern California, United States
Dottie Yu, University of Southern California, United States

Abstract:

Although precise identification of the human leukocyte antigen (HLA) allele is crucial for various clinical and research applications, HLA typing remains challenging due to high polymorphism of the HLA loci. However, with Next-Generation Sequencing (NGS) data becoming widely accessible, many computational tools have been developed to predict HLA types from RNA sequencing (RNA-seq) data. However, there is a lack of comprehensive and systematic benchmarking of RNA-seq HLA callers using large-scale and realist gold standards. In order to address this limitation, we rigorously compared the performance of 12 HLA callers over 50,000 HLA tasks including searching 30 pairwise combinations of HLA callers and reference in over 1,500 samples. In each case, we produced evaluation metrics of accuracy that is the percentage of correctly predicted alleles (two and four-digit resolution) based on six gold standard datasets spanning 650 RNA-seq samples. To determine the influence of the relationship of the read length over the HLA region on prediction quality using each tool, we explored the read length effect by considering read length in the range 37-126 bp, which was available in our gold standard datasets. Moreover, using the Genotype-Tissue Expression (GTEx) v8 data, we carried out evaluation metrics by calculating the concordance of the same HLA type across different tissues from the same individual to evaluate how well the HLA callers can maintain consistent results across various tissues of the same individual. This study offers crucial information for researchers regarding appropriate choices of methods for an HLA analysis.



P05
Exploring hypotheses of small cell lung cancer growth mechanisms using Bayesian mutlimodel inference

Subject: Simulation and Numeric Computing

Presenting Author: Samantha Beik, Vanderbilt University, United States

Co-Author(s):
Leonard Harris, University of Arkansas, United States
Vito Quaranta, Vanderbilt University, United States
Carlos Lopez, Vanderbilt University, United States

Abstract:

Small cell lung cancer (SCLC) is a phenotypically heterogeneous disease, comprising multiple cellular subtypes within a tumor that exhibit differential sensitivity to drug treatments. SCLC heterogeneity is hypothesized to be responsible for rapid development of chemotherapy resistance, leading to the dismal 6% five-year survival rate for this disease. Experimental results from several studies suggest that treatment alters tumor composition from an initial makeup of phenotypic subtype(s) to different, less-treatment-sensitive subtypes. We hypothesize that this change arises from phenotypic transitions, rather than outgrowth of subclone(s) selected for by treatment, and that interactions between subtypes are key for tumor survival. We set out to use mathematical modeling to investigate the theoretical basis for SCLC tumor growth, but soon realized that analysis of only one interpretation of SCLC data (one model) would be flawed, and turned to multimodel inference (MMI) to address this issue. We move beyond traditional information theoretic MMI to a fully Bayesian approach, applying MMI to population dynamics models fit to SCLC tumor steady-state data. We extend our findings beyond a ranking of models toward a probabilistic view of subtype behaviors, determining that the existence of reversible phenotypic transitions is highly likely in SCLC. Our results highlight what knowledge is supported by the data and where more experiments are needed, with an aim to modulate tumor composition and decrease treatment resistance. This is sorely needed in SCLC, for which the survival rate has barely improved in decades. With sensible treatment options, the burden of this aggressive disease can be decreased.



P06
COPD subtypes identified from blood RNA-seq data using single sample gene network perturbations

Subject: Inference and Pattern Discovery

Presenting Author: Panayiotis (Takis) Benos, University of Pittsburgh, United States

Co-Author(s):
Kristina Buschur, Columbia University, United States
Craig Riley, University of Pittsburgh, United States
Aabida Saferali, Brigham Women's Hospital, United States
Peter Castaldi, Brigham Women's Hospital, United States
Grace Zhang, University of Pittsburgh, United States
R. Graham Barr, University of Columbia, United States
Frank Sciurba, University of Pittsburgh, United States
Craig Hersh, Brigham Women's Hospital, United States

Abstract:

Chronic Obstructive Pulmonary Disease (COPD) diagnosis is based on spirometric measures. However, COPD is heterogeneous in the rate of progression, response to treatment, and symptom burden. Identifying COPD subtypes from easily accessible tissue is thus very important for disease management.<br>A network perturbation approach was used to identify gene expression network changes in single samples from COPD patients. The single sample perturbation vectors were used to cluster patients into subtypes. We identified 4 COPD subtypes in a training cohort of 617 former smokers from COPDGene. The four subtypes differ in their symptom severity, clinical characteristics, and mortality. Two of the clusters are considered "mild", but they differ in the use of corticosteroids. Another cluster contains the most severe patients, while the last is "intermediate". These results were validated in a second cohort (n=769), also from COPDGene. Additionally, we identified several significantly deregulated genes across subtypes, including DSP and GSTM1, which have been previously associated with COPD through GWAS. These findings may constitute a significant step towards COPD subtyping. The identified subtypes can be used for new patient stratification and disease prognosis.



P08-A
A Systems Biology Approach for Identifying Escherichia coli Genes and Pathways Involved in Response to Multiple Stressors

Subject: System Integration

Presenting Author: Ramy Aziz, Children's Cancer Hospital, Egypt

Co-Author(s):
Eman Abdelwahed, Faculty of Pharmacy, Cairo University, Egypt
Nahla Hussein, National Research Center, Egypt
Ahmed Moustafa, The American University in Cairo, Egypt
Nayera Moneib, Faculty of Pharmacy, Cairo University, Egypt

Abstract:

Stress response helps bacteria survive extreme environmental conditions and other stresses, e.g., host immunity. It may lead to development of drug resistance or augmentation of virulence, posing a threat human health. Although dozens of studies have used reductionist approaches to study specific stress response genes or systems approaches to study all genes involved in specific stress conditions, little is known about genes involved in response to multiple stressors. Here, we used a systems biology approach, by analyzing hundreds of transcriptomic data sets to delineate key genes and pathways involved in the stress tolerance of Escherichia coli, as a model for bacterial response to multiple stressors. Specifically, we investigated the E. coli K-12 MG1655 transcriptome under five types of stresses: heat, cold, oxidative stress, nitrosative stress, and antibiotic treatment. Implementing percentile rank normalization to minimize batch effects among separate transcriptomic data sets, the comparative analysis revealed large overlaps of transcriptional changes between studies of each stress factor and between different stressors. Overall, functions such as energy-requiring metabolic pathways, transport, and motility are typically downregulated to conserve energy, while genes related to survival, bona fide stress response, biofilm formation, and DNA replication and repair are mainly upregulated. Of interest, 15 genes with uncharacterized functions are upregulated in response to multiple stressors, which suggests they may play novel roles in stress response. In conclusion, we identified a set of E. coli stress response genes and pathways, which could be potential targets for overcoming antibiotic tolerance or multidrug resistance.



P08-B
COCTAIL: The COVID-19 Clinical Trial and Intervention Literature Hub

Subject: Text Mining

Presenting Author: Ramy Aziz, Faculty of Pharmacy, Cairo University, Egypt

Co-Author(s):
Mahmoud Faheem, Faculty of Pharmacy, Cairo University, Egypt
Shaimaa Hegazy, Children's Cancer Hospital Egypt 57357, Egypt

Abstract:

The sudden emergence of the COVID-19 pandemic, its formidable impact, and the subsequent lockdowns directly and indirectly led to a surge in scientific publications, many of which about COVID-19. Because of the global crisis, laboratory closures, and resource prioritization that limited research practice to essential topics, specialists and non-specialists got engaged into COVID-19 research, generating the current unprecedented research output about almost every aspect of the outbreak. In fewer than 18 months, the National Center for Biotechnology Information (NCBI) PubMed database has indexed 100,000 literature records, with “COVID-19” as keyword, and this number exceeded 180,000 in October 2021. The immediate challenge of the above literature deluge is the inability of interested researchers to keep up with the flow of information. A more important challenge is the difficulty to assess the reliability of results, to weigh the scientific evidence behind different interventions, or to distinguish rigorous from precarious studies. To address this problem, we launched COCTAIL : The COVID-19 Clinical Trial and Intervention Literature Hub (URL: https://covid.host-pathogen.net/coctail.html), which shortlists and summarizes all published clinical studies and meta-analyses related to COVID-19 diagnosis, prevention, or intervention measures. COCTAIL is based on a core manually curated evidence table, and each included study is evaluated according on different criteria, including number of subjects, type of design, study duration, and data consistency. Five releases have been launched since June 2020, and a sixth release is scheduled in October 2021, with 1,200 analyzed publications. Ongoing improvements include a natural language processing engine to support curators’ efforts.



P09
Characterizing Important Algorithmic Features of Single Cell Proteomic Data

Subject: Inference and Pattern Discovery

Presenting Author: Hannah Boekweg, Brigham Young UNiversity, United States

Co-Author(s):
Samuel Payne, Brigham Young University, United States
Daisha vanderwatt, Brigham Young University, United States

Abstract:

The goal of proteomics is to identify and quantify the complete set of proteins in a biological sample. Single cell proteomics specializes in identification and quantitation of proteins for individual cells, often used to elucidate cellular heterogeneity. As an emerging field, there are no software tools specifically designed for single cell proteomics. All peptide identification algorithms have been developed on spectra from bulk samples with millions of cells and the associated ion rich spectra. The significant reduction in ions introduced into the mass spectrometer for single cell samples could alter the features of MS2 fragmentation spectra and might be challenging for current algorithms. The accuracy and coverage of single cell proteomics could be greatly improved by optimizing and developing algorithms specifically for single cell data. We analyze thousands of spectra from both single cell data and bulk data and find important differences between the two data types. Specifically we examine three fundamental spectral features that are likely to affect peptide identification performance. All features show significant changes in single cell spectra, including loss of annotated fragment ions, blurring signal and background peaks due to diminishing ion intensity and distinct fragmentation pattern compared to bulk spectra. As each of these features is a foundational part of peptide identification algorithms, it is critical to adjust algorithms to compensate for these losses.



P10
NetPropR: an unsupervised network propagation method

Subject: System Integration

Presenting Author: Samuel Boyd, University of Kansas Medical Center, United States


Abstract:

The integration of high-throughput ‘omics data and molecular networks has been shown to be an effective way to discover novel disease-related genes and gene modules. However, many of these network propagation methods are supervised or semi-supervised, relying on arbitrary thresholds to filter out nodes. <br>We present a novel, unsupervised network propagation method that combines gene expression data with a protein-protein interaction network in order to find biologically relevant gene subnetworks. This is an iterative process, which uses the Random Walk with Restart algorithm to obtain node scores that are then used to find a maximally scoring subnetwork. This subnetwork is the input for the next iteration. At each iteration, the subnetwork is given a score Si, which is the product of the mean weight derived from the experimental data (e.g., fold change) and the mean clustering coefficient. This is intended to represent desirable characteristics of a subnetwork, namely that it has many clusters of genes with large fold change (or any other gene-wise summary of the experimental data). If Si+1 < Si, the process stops and the result from the ith iteration is the final subnetwork. <br>We demonstrate this method on a GLUT4 knockout/over-expression experiment in mice, which aimed to determine the effect of GLUT4 on insulin sensitivity in adipose tissue. We arrived at a subnetwork (126 nodes, 620 edges) that was enriched for many biologically relevant pathways pertaining to cellular metabolism. Notably, this unsupervised method can potentially be applied to many types of ‘omics data and molecular networks.



P11
Adventures in Integrating Biobank Scale Data with Genomic, Phenotypic and Imaging Data from Other Sources

Subject: Data Management Methods and Systems

Presenting Author: Ben Busby, DNAnexus, United States


Abstract:

We are creating a system to extend the cohort level analysis of a biobank to additional data types in order to enable mechanistic understanding. While all public data should be FAIR, privacy concerns prevent full access to all metadata fields for much of the data in question. Therefore we have harmonized the metadata on select fields of several hundred thousand pilot datasets, allowing researchers to see all of the tangential datasets available for their cohorts in what we are calling a “Boolean Knowledge Graph“. This “try before you buy” approach allows researchers to understand the available data landscape surrounding their cohort before investing the paperwork and/or computational work necessary to import said datasets. Once datasets are selected and access is granted, models can be generated that make predictions about the granular facets of these datasets and the results of these models can be appended to the original data set. In one specific example, we have crossed phenotypes from the ukbiobank with refine.bio data to see all RNAseq data available for any particular disease cohort. If a salient hypothesis were made from a given set of RNAseq data, we can import a variety of other biomedical data using OMOP and other standards leveraging the same phased approach. While the tools we have developed are useful, it is our hope that the insights we have gained in exploring these use cases are more widely applicable to the integration of biomedical data. <br>



P12
Neoantigen immunogenicity is influenced by parent protein subcellular location

Subject: Inference and Pattern Discovery

Presenting Author: Andrea Castro, University of California San Diego, United States

Co-Author(s):
Hannah Carter, University of California San Diego, United States

Abstract:

Antigen presentation via the major histocompatibility complex (MHC) is essential for anti-tumor immunity, however the rules that determine what tumor-derived peptides will be immunogenic are still incompletely understood. Current approaches to predict neopeptide immunogenicity (i.e. which tumor mutations will result in neopeptides that are both displayed and recognized by T cells as foreign) largely focus on peptide-MHC affinity, with non-standardized, and sometimes controversial, incorporation of other peptide characteristics such as peptide-MHC stability, agretopicity, foreignness, hydrophobicity, mutation position within the neopeptide, and neopeptide RNA abundance. Here we investigate whether protein subcellular location driven constraints on accessibility of peptides to the MHC associate with potential for peptide immunogenicity. Analyzing thousands of peptides from studies of MHC presentation and peptide immunogenicity, we find clear spatial biases in both eluted and immunogenic peptides. We find that including parent protein location improves prediction of peptide immunogenicity in multiple datasets. In human immunotherapy cohorts, location was associated with response to a neoantigen vaccine, and immune checkpoint blockade responders generally had a higher burden of neopeptides from accessible locations. We conclude that protein subcellular location adds important information for optimizing immunotherapies.<br>



P13
RNA-seq data science: landscape for modern RNA-seq tools

Subject: other

Presenting Author: Karishma Chhugani, University of Southern California, United States

Co-Author(s):
Dhrithi Deshpande, University of Southern California, United States
Serghei Mangul, University of Southern California, United States

Abstract:

Despite the tremendous efforts of the bioinformatics community, there is a lack of comprehensive cataloging efforts of RNA-seq computational tools. Most of the existing papers only focus on domains or methods such as in read alignment or differential gene expression. Our paper provides a comprehensive overview of over 200 RNA-seq tools published between 2008 and 2020, which have varying purposes and capabilities based on the type of biological analysis. The survey summarizes evolution of multiple features (eg, computational expertise required, etc) of these tools across various domains of RNA-seq analysis. Our survey suggests that tool developers have slowed the pace of producing new tools since the late 2010s and the average annual growth rate of computational tools developed for RNA-seq analysis from 2008-2014 was 114.4% and from 2015-2020 was 8.97%. Additionally, we assessed the usability of these tools and observed that tools which are available as package managers had significantly more citations per year compared with tools which are not available as package managers (p-value= 1.43 × 10−7). Additionally, we plan to survey/verify the consistency of the output format of each tool across every domain, and evaluate how new tools are benchmarked against previously published tools. Along with the manuscript, we created an interactive resource on GitHub (https://github.com/Mangul-Lab-USC/RNA-seq) for the scientific community as we hope our resource will provide systematic cataloging of RNA-seq tools and will help researchers make informed decisions when selecting a tool for a specific type of research question.



P14
DEGAS: Mapping clinical metrics to spatial transcriptomics with deep learning

Subject: Machine Learning

Presenting Author: Justin Couetil, Indiana University School of Medicine, United States

Co-Author(s):
Justin Couetil, Indiana University School of Medicine, United States
Jie Zhang, Indiana University School of Medicine, United States
Kun Huang, Indiana University School of Medicine, United States
Travis Johnson, Indiana University School of Medicine, United States

Abstract:

To search for links between cancer genotype and phenotype, we developed the DEGAS framework to map disease information to spatially resolved tumors.<br><br>In the era of precision medicine, spatial transcriptomics (ST) offers a unique opportunity to characterize tumor morphology and transcriptional heterogeneity simultaneously. We train deep transfer learning networks on ST and bulk-RNA seq with disease information (i.e., survival, treatment response, disease status, risk factors) to infer these characteristics spatially on the ST slide. Using the breast cancer data from TCGA, normal tissue from GTEX, and three 10x Genomics ST data sets of breast ductal adenocarcinoma, we identify high-risk regions of tumor tissue that align with 76-84% of the clusters derived from ST data alone. This shows that we can infer clinical attributes while maintaining the transcriptional differences in the ST slide.<br><br>Our methodology includes gold-standard preprocessing, feature selection, model training, post-processing, and data visualization tools. This represents a robust framework to use clinical data to identify regions of tumor which may reflect resistance to certain therapies, have certain mutations, or RNA signatures corresponding to lifestyle risk factors like smoking. As spatial transcriptomics become higher resolution and less costly, we hope our framework can be used as a “spotlight” to show researchers which subpopulations and spatial organizations of tumor cells may contribute to a patient’s clinical trajectory.<br><br>We plan to develop multimodal DEGAS models, allowing researchers to use this framework to link clinical phenotype to genomic (e.g. circulating tumor DNA), histologic, transcriptional, and proteomic data.



P15
Research Assistant Professor

Subject: Machine Learning

Presenting Author: Yulin Dai, University of Texas Health Science Center at Houston, United States

Co-Author(s):
Junke Wang, MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, United States
Hyun-Hwan Jeong, University of Texas health science center at Houston, United States
Wenhao Chen, Houston Methodist Research Institute and Institute for Academic Medicine, United States
Peilin Jia, University of Texas health science center at Houston, United States
Zhongming Zhao, University of Texas health science center at Houston, United States

Abstract:

The coronavirus disease 2019 (COVID-19) is an infectious disease that mainly affects the host respiratory system with ~80% asymptomatic or mild cases and ~5% severe cases. Recent genome-wide association studies (GWAS) have identified several genetic loci associated with severe COVID-19 symptoms, such as 3p21.31 locus (SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, and XCR1).<br>We implemented integrative approaches, including transcriptome-wide association studies (TWAS), colocalization and functional element prediction analyses, to interpret the genetic risks in lung, whole blood, and immune cells using two independent GWAS datasets from Host Genetics Initiative round 4 A2 and Severe COVID-19 GWAS Group. To understand the context-specific molecular alteration, we further performed deep learning-based single cell transcriptomic analyses on a bronchoalveolar lavage fluid (BALF) dataset from moderate and severe COVID-19 patients.<br>In TWAS, colocalization, and functional analysis, we discovered CXCR6 has a protective effect on lung and a risk effect on whole blood, respectively. In lung resident memory CD8+ T (TRM) cells, we found a 2.24-fold decrease of cell proportion and lower expression of CXCR6 (fold change = 0.56, two-sided Wilcoxon p = 1.8 × 10-18) in severe patients than moderate patients. Pro-inflammatory transcriptional programs, apoptosis, and hypoxia pathways were highlighted in TRM cells transition from moderate to severe groups. We illustrated one potential mechanism of host genetic variants or other unknown risks that might impact the severity of COVID-19 through altering the expression of CXCR6 and lung TRM cell proportion and stability, therefore, impairing the first-line defense in lung.



P16
BuDDI: Bulk Deconvolution with Domain Invariance

Subject: Machine Learning

Presenting Author: Natalie Davidson, CU Anschutz, United States

Co-Author(s):
Casey Greene, CU Anschutz, United States

Abstract:

Single-cell experiments provide greater resolution within a single sample, however, they currently lack coverage across diseases, tissues, and perturbations. To bridge this gap, we propose BuDDI (BUlk Deconvolution with Domain Invariance), which estimates cell-type proportions in bulk samples using single-cell expression estimates. BuDDI 1) models structured and unstructured noise; 2) is a generative model that can assist in downstream counterfactual analysis; 3) is semi-supervised, enabling additional training on external bulk samples; 4) does not require matched single-cell and bulk profiles. Features 1-3 are novel to BuDDI.<br><br>Reconciling the variability between single-cell and bulk experiments is a major methodological challenge. BuDDI addresses this by disentangling experimental and biological noise from a target signal (cell-type proportion) by learning three independent latent spaces within a single variational autoencoder: 1) target signal, 2) structured noise, 3) remaining variation.<br><br>To train BuDDI and compare against existing methods (BayesPrism and CIBERSORTx), we generated pseudo-bulks with ground truth cell-type proportions. We used three noise settings: unseen pseudo-bulks generated from 1) the same single-cell profiles used in training, 2) a different biological replicate, 3) a different biological replicate and sequencing platform. In quantifying cell-type proportions BuDDI performed better than CIBERSORTx and equal to or better than BayesPrism in all settings. In addition, since BuDDI can learn disentangled latent representations, we validated that the latent representation of cell-type proportions was disentangled from the latent representations of structured and unstructured noise. In future work, we will independently perturb each latent sub-representation to apply counterfactual reasoning to biomarker prediction tasks.<br>



P17
Robust Distribution of Cancer Data

Subject: Data Management Methods and Systems

Presenting Author: Corbin Day, Brigham Young University, United States

Co-Author(s):
Samuel Payne, Brigham Young University, United States
Caleb Lindgren, Brigham Young University, United States
Robert Oldroyd, Brigham Young University, United States

Abstract:

Cancer biology is producing enough data to take a seat at the data science and machine learning table. Cancer cohorts typically contain tumors from > 100 patients and tumors are frequently characterized with genomics, proteomics, imaging and clinical data. This wealth of data is ripe for machine learning analyses to advance diagnosis and treatment. However, for this abundance of data to be useful and readily available for analysis by these tools, it must be organized in a manner consistent with the expectations of the computational data science community. The CPTAC python package serves proteogenomic data for several cancers and is still under active development by nearly 20 contributors with over 1,700 commits. The package has extensive community use, with 70k hits/downloads. Robustness is crucial for any software that attempts to provide useful data organization for a large community, but ensuring this for a large and growing package can be difficult. Here we show how unit testing, regression testing, and continuous integration bulletproofed our software. Metaprogramming reduced test-code duplicity while allowing it to grow with the package. We found that expanding package functionality became safer for the package and easier for individual developers with easy-to-use testing and integration tools. By creating a robust testing suite for the CPTAC package, we created a dependable product for external researchers, including data scientists and other computational researchers. With a robustly tested solution, our product can withstand a high volume of usage and greatly expand the reach of cancer data.



P18
Ecology Data Browser: An Introductory Data Tool For Ecology Students

Subject: Remote Applications

Presenting Author: Louisa Dayton, Brigham Young University, United States

Co-Author(s):
Samuel Payne, Brigham Young University, United States

Abstract:

Data literacy is an important skill for ecology students, but it is frequently underdeveloped. Ecology students and educators need a tool which can help students learn how to understand, analyze, and visualize real-world ecological data. We solve this problem with the Ecology Data Browser, an R Shiny application tailored to the needs of ecology students. This tool gives students easy access to R functionality in working with ecological data, which would ordinarily require programming experience. Specifically, the Ecology Data Browser allows students to tidy data, calculate species diversity indices, graph and map data, and run statistical tests on their data or a preloaded dataset. Additionally, we conducted a survey among students to evaluate the serviceability, benefits and weaknesses of the Ecology Data Browser. The responses indicate an increase in time spent on data interpretation instead of data formatting and chart creation and an expansion in the quantity of data analyzed. Students also find greater ease of use in the Shiny application than in Excel or GoogleSheets. Through this tool, students experience a strong reduction in anxiety while working with data. Most importantly, this application provides students without a computational background exposure to developmental tools in computational ecology. In the future, we hope to expand this application to encompass diverse types of analysis compatible with student-generated data.



P19
Co-expression Network and Change Point Detection Using Combined Gene Expression and miRNA Profiles to Identify Functionally Important Circulating miRNA Markers Overlap in Humans and Mice

Subject: other

Presenting Author: Lianghao Ding, University of Texas Southwestern Medical Center, United States

Co-Author(s):
Michael Weil, Colorado State University, United States
Christina Fallgren, Colorado State University, United States
Michael Story, University of Texas Southwestern Medical Center, United States

Abstract:

Mouse models are essential for studying human diseases including cancer. This is especially true for biomarker discovery at very early stage of cancer when human experimental samples are difficult to recruit. Circulating miRNA are promising biomarkers because they are stable in biofluid and can be collected with minimal-invasive approaches. The question remains whether the data obtained from mice are applicable to humans. We hypothesize that the functionally important miRNAs are conserved across species and their applications in human diseases can be studied in animal models.<br>Using a weighted co-expression network analysis (WGCNA) approach, we combined circulating miRNA profiles with whole genome expression data to classify co-expression modules and subsequently identify “hub” miRNAs that strongly associated with expressions of target genes in radiation-induced mouse hepatocellular carcinoma (HCC) model. Our analysis identified 16 miRNA-containing network modules and 71 “hub” miRNAs within these networks that associated with HCC.<br>A previous study has identified 10 human circulating miRNAs that significantly elevated in the plasma of HCC patients. We compared the HCC-associated hub miRNAs from our mouse model with human data and found that all 10 human miRNA markers were included in the 71 hub miRNAs in mice. We further validated these miRNAs in a time-series collection of plasma samples in the HCC model from another mouse strain and confirmed that they were increased in tumor-bearing mice. Using change-point detection models we identified 4 miRNAs that elevated 1 year before the tumor diagnosis, suggesting the potential to identify miRNA markers for early diagnosis of cancer.



P20
Age-dependent machine learning model improves head shape characterization and enables longitudinal evaluation of patients with craniosynostosis

Subject: Machine Learning

Presenting Author: Connor Elkhill, University of Colorado Anschutz Medical Campus, United States

Co-Author(s):
Marius Linguraru, Children's National Hospital, United States
Jiawei Liu, University of Colorado Anschutz Medical Campus, United States
Scott LeBeau, University of Colorado Anschutz Medical Campus, United States
Brooke French, Children's Hospital Colorado, United States
Aaron Mason, Children's Hospital Colorado, United States
Antonio Porras, University of Colorado Anschutz Medical Campus, United States

Abstract:

PURPOSE: Current methods to quantify head shape abnormalities in patients with craniosynostosis using three-dimensional (3D) images do not account for the normal head shape changes that occur during childhood. We present an age-dependent head shape model and a new method to quantify head abnormalities, achieving increased robustness in the detection of craniosynostosis.<br>METHODS: We created an age-specific model using segmentations of the external head surface in retrospective CT images of 631 subjects without cranial pathology (age 35.04 ± 35.05 months). We performed principal component analysis and used temporal regression of the projections to establish age-specific normative cranial shape ranges. We computed the statistical distance of 286 CT images of patients with craniosynostosis (age 10.73 ± 18.31 months) to the normative model and trained a machine learning classifier to estimate the probability of patients having abnormal head shapes associated with craniosynostosis. Finally, we evaluated our model on the 3D photographs of 11 patients (age 3.83 ± 0.80 months) before and after endoscopic surgery.<br>RESULTS: Cross-validation provided an accuracy of 92.37% detecting craniosynostosis in our 917 subjects, which was significantly higher than an age-agnostic model trained on the same data (accuracy 90.51%, p < 0.01). Independent 3D photographs classifier probability improved from 92.23% ± 7.2% before surgery to 62.60% ± 22.07%, evaluated 8.11 ± 0.71 months after surgery.<br>CONCLUSIONS: Our age-dependent statistical head shape model and machine learning methods provide a significant improvement of head shape assessment compared to age-agnostic models and were successfully validated using an independent 3D photogrammetry dataset.



P21
The Systems Biology Simulation Core Library

Subject: Simulation and Numeric Computing

Presenting Author: Andreas Dräger, University of Tübingen, Germany

Co-Author(s):
Hemil Panchiwala, Indian Institute of Technology, India
Shalin Shah, Duke University, United States
Hannes Planatscher, Signatope GmbH, Germany
Mykola Zakharchuk, University of Tübingen, Germany
Matthias König, Humboldt University of Berlin, Germany

Abstract:

Summary:<br>Studying biological systems generally relies on computational modeling and simulation, e.g., model-driven discovery and hypothesis testing. Progress in standardization efforts led to the development of interrelated file formats to exchange and reuse models in systems biology, such as SBML, the Simulation Experiment Description Markup Language (SED-ML), or the Open Modeling EXchange format (OMEX). Conducting simulation experiments based on these formats requires efficient and reusable implementations to make them accessible to the broader scientific community and ensure the reproducibility of the results. The Systems Biology Simulation Core Library (SBSCL) provides interpreters and solvers for these standards as a versatile open-source API in Java™. The library simulates even complex biomodels and supports deterministic Ordinary Differential Equations (ODEs); Stochastic Differential Equations (SDEs); constraint-based analyses; recent SBML and SED-ML versions; exchange of results, and visualization of in silico experiments; open modeling exchange formats (COMBINE archives); hierarchically structured models; and compatibility with standard testing systems, including the Systems Biology Test Suite and published models from the BioModels and BiGG databases.<br><br>Availability:<br>SBSCL is freely available at https://draeger-lab.github.io/SBSCL/ and via Maven Central.



P22
Mapping DNA methylation trajectories in culture models of senescence

Subject: Inference and Pattern Discovery

Presenting Author: Jamie Endicott, Van Andel Institute, United States

Co-Author(s):
Paula Nolte, Van Andel Institute, United States
Hui Shen, Van Andel Institute, United States
Peter Laird, Van Andel Institute, United States

Abstract:

Methylation clocks estimating chronological or biological age have recently emerged as proxies for evaluating age reversal therapies. However, the biological functions that drive clock CpG behaviors are often poorly understood.<br>Here, we performed longitudinal DNA methylation profiling of primary human cell lines cultured through replicative senescence and identified various behaviors. We suggest biological drivers for multiple methylation trajectories, via experiments manipulating senescence and enrichment analyses. Additionally, the various methylation trajectories informed improved modeling of a methylation 'clock' that estimates cumulative mitoses.



P23
Characterizing Regulatory Mechanisms of NPEPPS in Bladder Cancer Chemoresistance

Subject: other

Presenting Author: Lily Elizabeth Feldman, University of Colorado Anschutz Medical Campus, United States


Abstract:

The current standard of care for eligible muscle-invasive bladder cancer (MIBC) patients is cisplatin-based neoadjuvant chemotherapy followed by radical cystectomy. About 25% of chemotherapy is effective for patients, resulting in a 5-year survival of ~80%, while the remaining patients where chemotherapy is ineffective have a 5-year survival of under 30%. Our goal is to identify mechanisms that explain why neoadjuvant chemotherapy is ineffective in 75% of patients, and to determine if these mechanisms offer a means to improve treatment effectiveness. Recent Bioinformatics analyses from our group identified puromycin-sensitive aminopeptidase, NPEPPS, as a novel regulator of cisplatin response. We have further shown that NPEPPS mRNA and protein are upregulated across multiple cisplatin-resistant human MIBC cell lines. Additionally, patients with MIBC and lower NPEPPS expression have better average five-year survival compared to individuals with more NPEPPS. Based on these findings, here we present the results of investigating upstream regulation of NPEPPS expression. First, we used ENCODE and JASPAR databases to assess transcription factor binding motifs that may contribute to NPEPPS transcriptional regulation. Next, we show the chromatin signatures associated with NPEPPS and the results of correlating genomic dependencies with NPEPPS expression, including pathway analysis. Finally, we show the results of comparative genomic analysis evaluating conservation of genetic elements across the mammalian lineage. These analyses provide insights into the possible mechanisms by which NPEPPS is upregulated to drive cisplatin resistance. Going forward, we will expand our knowledge of the system by applying this workflow to additional proteins implicated in the resistance pathway.



P24
A Novel Transcriptomics-based Approach to the Identification of Drug Combinations in Personalized Medicine

Subject: Machine Learning

Presenting Author: Lon Wolf Fong, University of Texas MD Anderson Cancer Center, United States


Abstract:

Drug combinations are a potential solution to the problem of drug resistance in cancer therapy; using multiple drugs with different sets of targets decreases the chances of cancer cells developing resistance to a given regimen. With an overwhelming number of possible drug combinations, the problem of identifying the best ones for a given patient is ideal for an in silico approach. <br><br>We first developed a workflow to process gene-expression data from individual patients, identify key sets of genes (disease module) that may be driving their disease and represent therapeutic targets, and finally use these disease-module genes to identify dysregulated pathways. Information about dysregulated pathways is then combined with drug-target data and input into either of two different types of algorithms to predict the efficacy of drug combinations for an individual patient: one is knowledge-based and makes predictions based on rules derived from known principles, and the second are machine-learning models, which make predictions based on statistical models trained on data sets. <br><br>We first validate our workflow and then compare the performance of several different types of knowledge-based and machine-learning models in correctly ranking drug combinations by efficacy for a given cancer cell line. Our results show that the disease-module genes identified by our workflow have the properties expected of disease-related genes; the pathways identified as dysregulated are in agreement with the literature; and machine-learning models outperform knowledge-based models, as evaluated by rank correlation of predicted values with the actual values, and have a relatively high accuracy.



P25
A workflow for variable selection in longitudinal microbiome studies

Subject: Inference and Pattern Discovery

Presenting Author: Jennifer Fouquier, University of Colorado, Anschutz Medical Campus, United States

Co-Author(s):
Catherine Lozupone, University of Colorado Anschutz Medical Campus, United States

Abstract:

Gut microbiome imbalances are linked to diseases including HIV, autism, and obesity. Because the gut microbiome can be modified via diet or other environmental influences, there is hope for disease treatment. The gut microbiome is commonly described by the relative abundance of bacterial species, as assessed using next-generation sequencing of the 16S rRNA gene region in DNA extracted from fecal material. Longitudinal gut microbiome studies are increasingly common with promise for understanding how microbiome changes relate to factors they may influence such as metabolites or immune factors, and symptoms of disease. However, longitudinal studies involve analytical challenges when trying to integrate multivariate data. Furthermore, longitudinal studies can be observational (i.e., measurements made over time, related to natural fluctuations in symptoms) or involve interventions, such as diet modifications. In an observational study of autistic individuals, we identified relationships between intraindividual changes in microbial communities over time and changes in behavior using mixed effects models. In an interventional study, we have been exploring agrarian diet effects on inflammatory and metabolic phenotypes in HIV+ individuals using Random Forests. We are developing a workflow for multivariate longitudinal microbiome analysis using machine learning methods to address challenges such as feature reduction for high-dimensional data, compositionality, and diverse data (clinical, omic, numeric, categoric, etc.). After integration, changes in variables are calculated for each individual, and between timepoints. Next, Mixed Effects Random Forest regression identifies variables that explain changes in a response. Finally, post-hoc testing is performed. This bioinformatics workflow will streamline longitudinal microbiome analysis.



P26
Doctor PEPper: Hemolysis Aware Antiviral Peptide Generation with Machine Learning

Subject: Machine Learning

Presenting Author: Andrew Gao, Canyon Crest Academy, United States


Abstract:

Antiviral peptides (AVPs) are short sequences of amino acids with antiviral effects. Interest has arisen in AVPs due to lower toxicity and higher specificity than conventional drugs. Furthermore, due to broad-spectrum mechanisms of action, evidence indicates that AVPs induce less resistance than conventional antiviral drugs. The rise of viruses like SARS-CoV-2 and Ebola makes clear the importance of antiviral therapeutics. One bottleneck to the development of AVP treatments is the limited number of known AVPs (~3,000). Other obstacles are low efficacy at low concentrations and undesired hemolytic activity. To address these challenges, a type of recurrent neural network, the Gated Recurrent Unit (GRU), was trained on existing short AVPs (5-30 amino acids) and used to generate thousands of novel AVPs. These generated AVPs were then evaluated with four machine learning models, to assess not only antiviral potential but also predict IC50 bioactivity values and the likelihood of unwanted hemolytic effects. The fittest AVPs based on these models were used to retrain the GRU, from which the second generation of novel AVPs was created. Finally, the most promising second-generation AVPs were analyzed. 82.6% of second-generation were non-hemolytic (compared to 70.7% of the original peptides and 70.2% of first-generation peptides). 36.0% had predicted antiviral activity. In the field of AVPs, this study is the first to apply GRUs and integrate a hemolysis classifier. Future steps include testing the most promising potential AVPs in vitro. The results have implications for combatting devastating viruses, such as SARS-COV-2 and HIV.



P27
Identifying Transcriptional Patterns Specific to Down Syndrome Using Generative Neural Networks

Subject: Inference and Pattern Discovery

Presenting Author: Lucas Gillenwater, University of Colorado Anschutz Medical Campus, United States

Co-Author(s):
Alexandra Lee, University of Pennsylvania, United States
Casey Greene, University of Colorado Anschutz Medical Campus, United States
Matthew Galbraith, University of Colorado Anschutz Medical Campus, United States
Joaquin Espinosa, University of Colorado Anschutz Medical Campus, United States
James Costello, University of Colorado Anschutz Medical Campus, United States

Abstract:

Down syndrome (DS), the result of a triplication of chromosome 21 (T21), is the most common chromosomal disorder in humans, occuring at a frequency of 1 in 700. Individuals with DS and their families face significant health challenges from co-occurring conditions that are poorly understood in context of T21. Understanding the molecular underpinnings leading to the spectrum of co-occurring conditions is complicated by the upregulation of ~225 genes located on chromosome 21 in combination with the (epi-)genetic context of the individual. Disentangling the molecular mechanisms for any specific condition requires a novel approach that takes into account the genetic complexity of T21. Here, we present an analysis of T21 transcriptome data using the “Specific cOntext Pattern Highlighting in Expression data” (SOPHIE) method, which uses deep learning to distinguish between common and specific transcriptional patterns. A variational autoencoder (VAE) was trained on uniformly normalized RNASeq samples from a diverse set of 49,651 human samples. Gene expression profiles from 8 individuals (4 T21; 4 euploid) were encoded into the latent space of the trained VAE and shifted to create simulated experimental profiles. These simulated experiments contain the same magnitude of change between the T21 and euploid individuals in different biological contexts. Experimental results identified T21-specific differential expression in early estrogen response, MYC targeting, and interferon pathways, supporting the interferonopathy hypothesis of DS. Future analyses will identify T21-specific alterations in co-occurring conditions through translation of VAE derived latent space representations of single and combinations of conditions.



P28
Characterizing mRNA/Protein Relationships in Cancer

Subject: Inference and Pattern Discovery

Presenting Author: Jose Giraldez Chavez, Brigham Young University, United States

Co-Author(s):
Samuel Payne, Brigham Young University, United States
Nathaniel Barton, Brigham Young University, United States
Benjamin Kimball, Brigham Young University, United States

Abstract:

Proteogenomic analyses of cancer cells measure protein and mRNA abundance, which is used to infer biological activities in the tumor. The dysfunctional phenotypes of tumors are most often described by comparing these abundances between tumor and normal tissue. Differentially expressed protein/mRNA molecules are asserted to be driving cancer behavior. Differential expression as a metric ignores the relationship between protein and mRNA abundance, a regulatory relationship that might be altered in cancer. Here we describe the creation of a new metric to help find changes in the relationship between mRNA and protein, indicative of changes in the cellularly regulatory mechanisms governing proteostasis. We calculated the change in a gene’s mRNA/protein relationship by observing a change in mRNA/protein abundance correlation between normal and cancer tissues. This new metric, Δ Correlation, measures the difference in protein/mRNA correlations showing a change in the regulatory mechanisms of the cell as well as the change in direction of the relationship. We used data from the CPTAC consortium and analyzed 5 different tissues using our metric. Our results show that this new way of measuring tumors can find genes that are affected in cancer that do not show up in differential expression or somatic mutation analysis.



P29
Incorporating Metabolomic and Metagenomic Data to Improve Protein Function Annotations

Subject: Metogenomics

Presenting Author: Lucia Guatney, University of Colorado, United States

Co-Author(s):
Mikayla Borton, Pacific Northwest National Laboratory, United States
Rebecca Daly, Colorado State University, United States
Kelly Wrighton, Colorado State University, United States
Christopher Miller, University of Colorado, United States

Abstract:

Shotgun metagenomic data has the potential to inform numerous applications (human gut health, sustainable agriculture, etc.) but faces the bottleneck of inferring gene functions. Specifically, the high number of previously unstudied organisms from many environments make laboratory validation of protein function impractical for metagenomic data, generating a need for other means of functional inference. Most gene annotations are assigned based on homology to genes of known function, though this method has limitations: shared evolutionary ancestry does not always translate to shared function, previous computational inferences have populated databases/ontologies with unvalidated annotations, and databases/ontologies overrepresent knowledge from certain well-studied phyla. One avenue for non-homology-based annotation of metagenomic data is the use of paired metabolomic data. This data may be used to find relationships between metabolite occurrence and genes to infer or corroborate protein function predictions. A potential advantage of using metabolite data is that it may relate more directly to phenotype than sequence similarity. Here, we are working to integrate metabolomic data into probabilistic models to inform protein function annotations. We compare the performance of standard metabolite-naïve homology-based approaches to probabilistic models using gene/metabolite co-occurrence data from linear methods or neural networks. As a test dataset, we are using the Genome Resolved Open Watersheds (GROW) database, which consists of multi-omic sampling of microbial communities from over 250 sampling sites at rivers worldwide, including paired samples with FT-ICR MS metabolomic data and metatranscriptomic data. Improved protein function annotation of this dataset should aid modeling of microbial metabolism in rivers.



P30
Whole genome assembly and analysis of clinical mycobacterial species using Oxford Nanopore Technologies and Illumina sequencing

Subject: other

Presenting Author: Jo Hendrix, University of Colorado, United States

Co-Author(s):
Elaine Epperson, National Jewish Health, United States
Nabeeh Hasan, National Jewish Health, United States
Michael Strong, National Jewish Health, United States

Abstract:

Cystic Fibrosis (CF) is an autosomal recessive disease in which the cells in the lungs are less able to clear mucus from the airways, causing a buildup that provides an ideal environment for opportunistic bacterial species such as nontuberculous mycobacteria (NTM). NTM are opportunistic pathogens with innate resistance to many antimicrobials, so treatment often involves multiple drug cocktails administered over years. Sequencing efforts to produce complete genomes assemblies of clinically relevant NTM species are important for tracking disease spread and to better understand the genomic capacity of NTM isolates. When selecting a sequencing platform, there is a tradeoff between read length and accuracy. Illumina Next Generation Sequencing is highly accurate, but reads tend to be 250-300 bp in length. Oxford Nanopore Technologies (ONT) is less accurate overall but generate reads that average tens of thousands of bases. Here, we used a hybrid approach to capitalize on the accuracy of Illumina sequencing and the read length of ONT to produce highly connected and accurate genome assemblies of seven clinical NTM isolates isolated from patients with CF. Four assemblies were closed with no remaining gaps. We identified the isolates as M. abscessus (n=3), M. avium (n=2), M. chimaera (n=1), and M. intracellulare (n=1) and placed each within its phylogenetic context. Further, we showed that compared to the type-strains, use of patient-specific genome assemblies better detected genomic changes across longitudinal samples. The research detailed here demonstrates how complete genome assemblies can further our understanding of difficult to treat and diverse mycobacterial infections.



P31
The systematic assessment for the completeness of metadata information accompanying omics studies

Subject: System Integration

Presenting Author: Yu-Ning Huang, University of Southern California, United States

Co-Author(s):
Serghei Mangul, University of Southern California, United States
Anushka Rajesh, University of Southern California, United States
Jieting Hu, University of Southern California, United States
Ruiwei Guo, University of Southern California, United States
Man Yee Wong, University of Southern California, United States
Jiaqi Fu, University of Southern California, United States
Elizabeth Ling, University of Southern California, United States
Irina Nakashidze, Batumi Shota Rustaveli State University, Georgia
Steven Beringer, University of Southern California, United States
Aditya Sarkar, Indian Institute of Technology Mandi, India

Abstract:

Genomic data is easily accessible and available, owing to the ubiquity of public genomic repositories that allow researchers to share their study datasets. However, improperly annotated and incomplete metadata accompanying the raw data make the researchers almost impossible to reuse the data directly through the public repositories for secondary analysis and might slow down biomedical discoveries’ progress. Our study aims to assess the completeness of metadata accompanying omics studies in both publication and its related online repositories and make observations about how the process of data sharing could be made reliable. The study involved an initial literature survey in finding studies based on the seven therapeutic fields, sepsis, tuberculosis, cystic fibrosis, cardiovascular disease, acute myeloid leukemia, inflammatory bowel disease, and Alzheimer’s disease. We used computational tools (Python scripts) to extract metadata from the public repository, manually observed the availability of metadata in both publication and repositories, and then statistically visualized the results obtained from the analysis. By comparing the metadata availability on both platforms, orginal publications, and online repositories, we observed discrepancies between omics data and the corresponding metadata on public repositories. We advocate the need to have a standardized &quot;checklist&quot; for researchers to submit their study results and data to public repositories based on our results. Our study opens a comprehensive discussion about this potential solution to bridge the gap between omics data and metadata on repositories.



P32
Cell-type deconvolution of sequencing-based DNA methylation data: how to leverage the power of read-level methylomes for estimating the composition of complex tissues and tumors

Subject: Machine Learning

Presenting Author: Yunhee Jeong, DKFZ(German Cancer Research), Germany

Co-Author(s):
Marlene Ganslmeier, DKFZ(German Cancer Research), Germany
Kersten Breuer, DKFZ(German Cancer Research), Germany
Reka Toth, DKFZ(German Cancer Research), Germany
Christoph Plass, DKFZ(German Cancer Research), Germany
Pavlo Lutsik, DKFZ(German Cancer Research), Germany

Abstract:

Analysis of cell-type heterogeneity based on cell mixture (bulk) omic profiles is an active area of research, known as cell-type deconvolution. DNA methylation (DNAm) is an accessible and highly cell type-specific epigenetic mark. Sequencing-based DNAm data, in particular, has high capacity for cell-type deconvolution through leveraging read-level information. Although diverse deconvolution methods were developed for the heterogeneity analysis in sequencing-based methylomes, a comprehensive, standardized and unbiased assessment of these approaches is still lacking.<br> <br>To bridge this gap, we thoroughly evaluated five previously published methods: Bayesian Epiallele detection (BED), PRISM, csmFinder & coMethy, ClubCpG and MethylPurify, together with two array-based methods, MeDeCom and Houseman’s method, as a comparison group. Sequencing-based deconvolution methods consist of two steps, region selection and cell-type composition estimation, thus we individually assessed the performance of each step and demonstrated the impact of the former step upon the next one. After all, we comprehensively assessed the limitations of current approaches in dealing with sequencing data.<br> <br>Increasing availability of single-cell DNAm sequencing datasets warrants a new paradigm of omics analysis with clear cell type-specific features. To utilize these data for deconvolution, we propose a Hidden Markov Model trained with single-cell methylomes to classify read-level methylation patterns into cell types. Since this model has shown successful performance in read classification of single-cell data, we further designed a statistical pipeline to extend the read classification to cell-type composition estimation in bulk DNA methylomes.<br>



P33
Visual Interactive Model Selection for Determining Predictors of Lung Function

Subject: Qualitative Modelling and Simulation

Presenting Author: Ying Jin, Colorado School of Public Health, CU Anschutz Medical Campus, United States

Co-Author(s):
Fernando Holguin, School of Medicine, CU Anschutz Medical Campus, United States
Ryan Peterson, Colorado School of Public Health, CU Anschutz Medical Campus, United States
Carsten Görg, Colorado School of Public Health, CU Anschutz Medical Campus, United States

Abstract:

Selecting models for predicting clinical outcomes is a crucial task, however given ever-increasingly large and complex datasets with hundreds of dimensions, the process can become complicated and cumbersome. While a common strategy of simply selecting a set of expected confounders ahead of time based on prior knowledge remains popular, issues still often arise when there is uncertainty regarding multi-dimensional relationships or the observed variability of confounding variables. To address these issues, we designed and implemented an interactive tool for guiding the model selection process. It presents correlation in the form of network visualization, where nodes represent variables of interest and edges indicate the direction of correlation with color hue and magnitude with saturation and thickness. Inter-correlated variables are clustered close together and significant correlations are indicated with overlays on existing edges. Additional tables are included in separate tabs to present covariate coefficients of determination, variance inflation factors and a pairwise correlation matrix. Users can interactively include or remove variables of interest, and change the thresholds of significance and correlation for inclusion in the network. In our preliminary application to pulmonary function using data from an obese asthmatic cohort, we found that asthma control variables (asthma quality of life questionnaire; AQLQ and the asthma control questionnaire; ACQ) are strongly correlated. The pulmonary function outcomes, including the forced vital capacity (FVC) and the forced expiratory volume in 1-second (FEV1), are clustered and negatively correlated with age. Finally, the known biomarkers are somewhat inter-related, with a small but significant correlation.



P35
Crossing complexity of space-filling curves reveals new principles of genome folding

Subject: Qualitative Modelling and Simulation

Presenting Author: Nicholas Kinney, Virginia College of Osteopathic Medicine, United States

Co-Author(s):
Molly Hickman, Virginia Tech, United States
Ramu Anandakrishnan, Virginia College of Osteopathic Medicine, United States
Harold Garner, Virginia College of Osteopathic Medicine, United States

Abstract:

Space-filling curves have been used for decades to study the folding principles of globular proteins, compact polymers, and chromatin. Different types of curves can be distinguished by their folding principles. Random (equilibrium) curves tend to have abundant knots and tangles; on the other hand, crumpled (Hilbert) curves lack knots and tangles. This latter class of curves is thought to be biologically favorable; particularly as models of genome folding in actively dividing cells. Indeed, cell division requires robust segregation of DNA. The present work investigates a new property of space filling curves: the crossing complexity. Briefly, chain crossings are tallied in the same way that stand breaks and ligations occur during mitosis. Crossing complexity is then compared for equilibrium and Hilbert curves with two main results. First, Hilbert curves limit entanglement between chromosomes. Second, Hilbert curves do not limit entanglement in a rudimentary model of S-phase DNA. Our second result is particularly surprising yet easily rationalized with a geometric argument. The future direction of this work seeks to reconstruct space-filling curves directly from chromosome proximity ligation experiments. A candidate algorithm is discussed. If successful, this will lead to a better understanding of the folding principles that govern the human genome.



P36
Predicting protein level changes from transcript level data

Subject: Inference and Pattern Discovery

Presenting Author: Edward Lau, University of Colorado School of Medicine, United States


Abstract:

Proteins perform the majority of biological functions. It follows that gene signatures from transcriptomics data would have different biological relevance based on how well they predict protein levels. We revisit how well transcript level changes predict protein level changes at gene-wise granularity, using current sequencing and mass spectrometry data sets and comparing several statistical learning approaches. The result adds to emerging evidence for a biological basis of RNA-protein non-correlation that varies by cellular components and pathways. We identified proteins whose levels are nonlinearly related to transcript levels, as well as proteins better predicted by different transcripts than their own gene's. We propose a strategy to analyze and prioritize transcript signatures in RNA sequencing data and apply it to examine striated muscle aging mechanisms.



P38
Alzheimer's Disease: Where Are the Remaining Disease-Altering Variants?

Subject: other

Presenting Author: Emily Little, Brigham Young University, United States

Co-Author(s):
John Kauwe, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Abstract:

Alzheimer’s disease (AD) is a neurodegenerative disorder with significant genetic risk factors. Despite the prevalence of Alzheimer’s in our communities, an effective treatment is yet to be developed. Recent studies have determined that about half of the phenotypic variance in AD is explained by genetics, but to date we have only identified genomic variation that accounts for a third of heritable AD. We have two objectives in our current analyses: 1) determine genomic regions that harbor unidentified disease variants, and 2) determine potential differences in disease variants in males and females. We used Genome-wide Complex Trait Analysis to estimate the phenotypic variance explained by genetics within individual 1,000,000 base pair segments. We ran these analyses both for females and for males, as well as a combined analysis. We identified numerous regions with genomic variation that explain significant genetic variance. Using these results, we can better limit the genomic regions we interrogate to identify the remaining variants that influence risk for AD.



P39
Generalized Tensor Canonical Correlation Analysis for Network Inference Using Multi-Omics Data

Subject: Machine Learning

Presenting Author: Weixuan Liu, University of Colorado Anschutz Medical Campus, United States


Abstract:

Integrating and inferring biological networks from multiple omic data sets are becoming more commonplace for studying biological systems and diseases. However, many methods are limited to handling only two omics data sets or only focus on pairwise correlations between omics and/or phenotypes, therefore failing to consider higher-order correlation among the data sets. To address these limitations, we have developed a sparse generalized version of tensor canonical correlation network analysis (SGTCCA-Net), to perform network analysis on multi-omics data by simultaneously extracting all higher-order (tensor) and pairwise relationships between omics data and phenotype(s) of interest. In simulation studies of three omics (e.g., genomics, proteomics, metabolomics) and phenotype data with higher-order as well as lower-order correlation, SGTCCA-Net improves the detection of networks compared with other multi-omics networks analysis pipelines. Application to a chronic obstructive pulmonary disease multi-omics data set identifies networks of omics features that are highly associated with lung function and emphysema. The sparse version of this approach not only gives sparse solutions to the pipeline but significantly improves the computational efficiency of the algorithm. Future improvements will include improving the pruning of inferred networks to remove noisy nodes and edges. In summary, SGTCCA-Net provides a framework for inferring multi-omic networks associated with phenotypes that capture both pairwise and higher-order tensor relationships occurring in the data.



P40
Benchmarking the performance of six variant callers on synthetic and real ctDNA datasets

Subject: Data Management Methods and Systems

Presenting Author: Rugare Maruzani, University of Liverpool, United Kingdom

Co-Author(s):
Anna Auer-Fowler, University of Liverpool, United Kingdom
Liam Brierley, University of Liverpool, United Kingdom
Andrea Jorgensen, University of Liverpool, United Kingdom

Abstract:

Tissue biopsies are routine for informing personalised treatment options for cancer patients. However, tissue biopsies can be invasive and therefore have limited potential in disease monitoring. The utility of circulating tumour DNA (ctDNA) in blood as a non-invasive alternative has recently been investigated. However, the fraction of ctDNA in plasma is typically low compared to healthy cell free DNA, increasing the challenge of detecting true cancer variants. ctDNA mutations can occur at frequencies of <0.1%. Therefore, the feasibility of using ctDNA for personalising treatment depends on the ability of computational tools to call mutations at low frequencies.<br>We benchmarked the performance of six variant callers, (Mutect2, FreeBayes, LoFreq, Octopus, Platypus and bcftools) on ctDNA datasets. We utilised a synthetic dataset to assess sensitivities at seven mutant allele frequencies (MAF). As synthetic datasets may not accurately represent the complexities of real data, we also assessed the performance of callers on two real datasets, utilising cancer mutation databases in the absence of a truth set.<br>In the synthetic dataset, Mutect2 performed the best, achieving 100% sensitivity at frequencies of ≥ 5%. Ultimately, all callers perform inadequately on synthetic ctDNA. Only Mutect2 called more than 20% of true variants at 5% MAF. On average 40% of variants called in the real dataset were reported in the COSMIC database. The results of this study suggest new bioinformatic tools and NGS practices are required for calling variants in ctDNA datasets. This is essential to fully realise the potential of ctDNA in personalised treatment of cancer.



P41
MRI Mediators: How Causal Mediation Analysis Can Explain the Relationship Between Genomic Data and Survival Outcomes

Subject: Inference and Pattern Discovery

Presenting Author: Emily Mastej, University of Colorado Anschutz, United States


Abstract:

While machine learning models have recently become popular due to their ability to make accurate predictions, there is still a black box nature to the models that prevents users from understanding how features affect the outcome. Causal mediation analysis could help users understand the causal relationship between an exposure and an outcome. Mediation allows for the dissection of an exposure’s total effect into direct and indirect effect, with the indirect effect representing the amount of the total effect that works through a mediator. Although causal mediation analysis has been widely used in psychological research, mediation analysis with MRI radiomic data as the mediator has been under explored. To gain a deeper understanding of the causal pathway from genetic mutation to survival time, we used a high dimensional mediation analysis tool to identify T2 MRI radiomic features that mediate the relationship between genetic data and survival outcomes. We applied the mediation model to a cohort of TCIA subjects with glioblastoma or a lower grade glioma (N = 203) to find networks of radiomic features that mediate the pathway between IDH mutation and survival time. Mediation analysis can be an important modeling tool to help users understand the causal pathway of how genomic mutations affect the survival outcome through brain phenotypes.



P43
Using Image Recognition to Predict the Proteogenomic Status of Breast Cancer

Subject: Machine Learning

Presenting Author: Bryn Mendenhall, Brigham Young University, United States


Abstract:

Standard of care for cancer patients includes a core needle biopsy taken from the tumor. These biopsies are the primary source of information for diagnosis and initial treatment. But previous studies have found that pathologists have a 45.6% accuracy rate in diagnosing tumor grade and subtype with biopsy images alone. Genomic and proteomic characterization of tumors has driven the emergence of molecular subtypes, which can lead to personalized medicine and more effective treatments. Unfortunately, proteogenomics is not available for many patients. Using the proteogenomic dataset from triple negative breast cancer tumors in CPTAC, our goal is to use machine learning to predict molecular subtypes from histology images. This project uses deep convolutional neural networks (DCNN) for feature recognition in biopsy cell images to predict triple negative breast cancer. The results show that triple negative breast cancer was able to be diagnosed with cell images using DCNN with an average of 80% accuracy. We anticipate that the model used to diagnose the breast cancer subtypes will be able to eventually diagnose all cancer subtypes. These findings are a stepping stone to more efficient and accurate diagnostics by pathologists, which would give better treatment options for patients and increase overall survival rates.



P44
Multispecies cities in the Anthropocene: bioremediation and biomining potential of the Gowanus Canal Microbiome, an urban Superfund site

Subject: Metogenomics

Presenting Author: Chandrima Bhattacharya, Weill Cornell Medicine, United States

Co-Author(s):
Rupobrata Panja, CSIR-Institute of Minerals and Materials Technolog, India
Ian Quate, ,
Matthew Seibert, ,
Ellen Jorgensen , ,
Christopher Mason, Weill Cornell Medicine, United States
Sergios-Orestis Kolokotronis, SUNY, United States
Elizabeth Henaff, NYU, United States

Abstract:

The environment of the Gowanus Canal in New York City is emblematic of the many post-industrial Superfund sites across the country. Many of these locations were important hubs for manufacturing industries or research and development, and have now been abandoned, leaving a legacy of toxicity and pollutants not only in the canal itself but also in the surrounding areas. We explore microbial bioremediation of hazardous polluted sites as a promising field of study, especially when it is possible to potentially mine the microbes for novel secondary metabolites, including identification of molecules related to microbial multi-drug resistance as well as species harboring extreme adaptability characteristics. We present the largest metagenomic analysis consisting of both longitudinal study and depth-based study of sediment from the Gowanus Canal. We identify extremophiles as well as marine and freshwater sediment species and demonstrate enrichment of bioremediation-related metabolic pathways. These metabolisms include remediation of industrial pollutants of historical significance to the industrialization of the area including heavy metals and organic pollutants. We identify a cluster of genes related to antimicrobial resistance present in the Canal microbiome. Our findings on the Gowanus Canal microcosm usher in the potential of discovery and research on other extreme environments for novel species and secondary metabolites from biosynthetic gene clusters. We can conclude microbes associated with Extreme Environments including those in Superfund Sites can show adaptation to not only remediate and clean up hazardous material but also produce significant secondary metabolites with prospective biological significance to make life better in the Anthropocene.



P45
The Ramp Atlas: Facilitating tissue-specific ramp sequence analyses across humans and SARS-CoV-2

Subject: Web Services

Presenting Author: Justin Miller, University of Kentucky; Brigham Young University, United States

Co-Author(s):
Taylor Meurs, Brigham Young University, United States
Matthew Hodgman, University of Kentucky, United States
Benjamin Song, Brigham Young University, United States
Kyle Miller, Utah Valley University, United States
Mark Ebbert, University of Kentucky, United States
John Kauwe, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Abstract:

Ramp sequences are essential genetic regulatory regions that counterintuitively function to slow initial translation, which ultimately maximizes overall translational efficiency by evenly spacing ribosomes and limiting downstream ribosomal collisions. Since widespread tissue-specific differences in relative codon adaptiveness occur, we predicted that the existence of a gene-specific ramp sequence would change between tissues without altering the underlying genetic code and would ultimately result in differential tissue-specific gene expression. Here, we present the first comprehensive analysis of tissue-specific ramp sequences, and report 3,108 genes with ramp sequences that change between tissues. The Ramp Atlas (https://ramps.byu.edu/) is an accompanying web portal that shows that the presence of a ramp sequence significantly correlates with higher gene expression in The Functional Annotation of Mammalian Genomes (FANTOM5) dataset (odds=1.1152; p-value=3.00x10-99), The Genotype-Tissue Expression (GTEx) Project dataset (odds=1.1578; p-value=9,48x10-155), The Human Protein Atlas dataset (odds=1.1947; p-value=1.27x10-306), and a consensus dataset (odds=1.1477; p-value=1.00x10-254). We also identified seven SARS-CoV-2 genes and seven human SARS-CoV-2 entry factor genes with tissue-specific ramp sequences that are present more frequently in tissues that the virus is known to infect (p-value=0.009918), which may help explain viral proliferation within those tissues. The Ramp Atlas facilitates wider adoption and application of ramp sequences through interactive graphics and an online programmatic interface. We propose that future ramp sequence calculations should consider ramp sequence variability that may occur within an organism based on tissue-specific codon optimality, and variable ramp sequences might be an additional mechanism for regulating tissue and cell type-specific differential gene expression that warrants further exploration.



P46
Automated and systematic verification and validation increases quality and long-term reuse of models

Subject: Qualitative Modelling and Simulation

Presenting Author: Natasa Miskov-Zivanov, University of Pittsburgh, United States


Abstract:

Although modeling is an important component of a research pipeline in biology, most often there is no systematic or standardized approach for quality assessment and annotation of models, reducing their trustworthiness and reuse potential. Moreover, most of the model design and documenting steps are still done manually. Creating useful and reliable models of cellular signaling requires thorough and careful information extraction, knowledge assembly, comprehensive model verification and validation, which can take months, sometimes even years. The verification step, assessing whether the model structure is correct by finding support for all its elements and interactions, and the validation step, evaluation of model behavior against experimental observations and data, usually occur iteratively with model expansion before the model can be used to make predictions or explanations. The objective of our work is to develop an architecture that will allow researchers to automatically verify, assess the quality, annotate and expand their models, utilizing available literature and model databases. We have developed several methods and tools to automatically verify models using the information from literature and databases, and to test for contradictions between new knowledge and existing models. Our tools are able to process large amounts of information from literature and compare with models within seconds, a task that would take days to complete manually, and would likely be prone to errors. Outcomes of this work will contribute to increasing the long-term reuse of models and aid computational and systems biology researchers in assembling or selecting models with trusted quality.



P47
PROFILING GENE FUSIONS OF INFLAMMATORY BREAST CANCER USING A NEW BIOINFORMATIC APPROACH HIGHLIGHTS KEY DIFFERENCES BETWEEN MOLECULAR SUBTYPES

Subject: Inference and Pattern Discovery

Presenting Author: Carlos Morales, University of Puerto Rico Río Piedras, Puerto Rico

Co-Author(s):
Esther Peterson, University of Puerto Rico Rio Piedras, Puerto Rico
Josue Perez, University of Puerto Rico, Puerto Rico

Abstract:

Inflammatory breast cancer (IBC) is the rarest and most lethal subtype of breast cancer with a 5-year survival rate below 50%. Its prognosis is challenging due to the lack of definitive molecular markers requiring new approaches for early onset identification. Genomic translocations have been shown to be beneficial for cancer progression and malignancy, disrupting gene regulation and even joining exons from spatially unrelated genes. Sequencing data from cancer subtypes can be used to identify translocations by using powerful bioinformatic approaches. We developed FAV_CTS, which leverages STAR, STAR-Fusion, FusionInspector, and Trinity to identify, visualize and assemble fusion transcripts from RNA-seq data. Using the IBC cell line SUM149PT as a model, we identified and validated experimentally a model fusion gene ZDHHC5—EPB41L5. Sequence assemblies cross-validated in silico the presence of this fusion and its multiple splice sites, one of which was predicted to encode a fusion peptide. Differential gene expression between SUM149PT and SUM190PT, an IBC cell line without evidence for the tested translocation, showed loss of expression of ZDHHC5 but no change in EPB41L5, suggesting the latter retained its expression in its new genomic region. Successful identification and assembly of fusion transcripts from oral cancer cell lines OC3 and OECM-1 data showcased its usefulness across a diversity of cancer subtypes. Although, further studies must be conducted to determine the function of this novel fusion peptide and its role in IBC progression, identifying and validating translocations in IBC are necessary to understand this complex disease.



P48
Scalable estimation of microbial co-occurrence networks with Variational Autoencoders

Subject: Machine Learning

Presenting Author: James Morton, Simons Foundation, United States

Co-Author(s):
Justin Silverman, Pennsylvania State University, United States
Gleb Tikhonov, University of Helsinki, Finland
Harri Lähdesmäki, University of Aalto, Finland
Richard Bonneau, Simons Foundation, United States

Abstract:

Estimating microbe-microbe interactions is critical for understanding the ecological laws governing microbial communities. Rapidly decreasing sequencing costs have promised new opportunities to estimate microbe-microbe interactions across thousands of uncultured, unknown microbes. However, typical microbiome datasets are very high dimensional and accurate estimating of microbial correlations requires tens of thousands of samples, exceeding the computational capabilities of existing methodologies. Furthermore, the vast majority of microbiome studies collect compositional metagenomics data which enforces a negative bias when computing microbe-microbe correlations. The Multinomial Logistic Normal (MLN) distribution has been shown to be effective at inferring microbial-microbe correlations, however scalable Bayesian inference of these distributions has remained elusive. Here, we show that carefully constructed Variational Autoencoders (VAEs) augmented with the Isometric Log-ratio (ILR) transform can estimate low-rank MLN distributions thousands of times faster than existing methods. These VAEs can be trained on tens of thousands of samples, enabling co-occurrence inference across tens of thousands of microbes without regularization. The latent embedding distances computed from these VAEs are competitive with existing beta-diversity methods across a variety of mouse and human microbiome classification and regression tasks, with notable improvements on longitudinal studies.



P49
Mitochondrial transcriptome and genetic risk for schizophrenia

Subject: other

Presenting Author: Sreya Mukherjee, Lieber Institute for Brain Development, United States

Co-Author(s):
Claudia Calabrese, University of Cambridge, United Kingdom
Gianluca Ursini, Lieber Institute for Brain Development, United States

Abstract:

Mitochondria, the cellular powerhouse contains copies of mitochondrial DNA (mtDNA), which follows maternal inheritance. Mitochondrial dysfunctions have been associated with schizophrenia, with mtDNA proposed as an underlying etiologic genetic factor. This study aimed to extract mitochondrial genome data from RNA sequencing (RNAseq) data from post-mortem dorsolateral prefrontal cortex (DLPFC) brain samples of controls (N=225) and patients with schizophrenia (N=143), to evaluate the relationship of mitochondrial genomics with schizophrenia risk. We first quantified mitochondrial gene expression and performed a differential expression analysis of mitochondrial genes in controls and patients with schizophrenia. We then analyzed the relationship of genetic risk for schizophrenia with mitochondrial gene expression with analyses corrected for sex, age, genetic principal components, and RNA quality. We further developed a pipeline for the detection of mitochondrial genetic variants, the quantification of mitochondrial heteroplasmies, and the prediction of mitochondrial haplogroups, using MtoolBox. Several mitochondrial genes were upregulated or downregulated in schizophrenia, while genetic risk for schizophrenia was also associated with gene expression. Genes of the electron transport chain (MT-CYB, MT-ATP6, MT-ATP8) were consistently associated with schizophrenia case-control status (t=3.657,3.215,2.473; p=0.000295,0.00143,0.01381) and genomic risk (t=4.206,3.151,2.573; p=3.35E-5,0.00178,0.0105 respectively). The number of heteroplasmies was not significantly different between patients and controls, however a strong association between genetic risk for schizophrenia and the number of pathogenic mitochondrial heteroplasmies suggested that susceptibility to schizophrenia may be linked to mitochondrial biology. These results confer that RNAseq data can provide insight about the role of mitochondrial biology in risk for schizophrenia.



P50
Computational mapping of inter-brain-region gene networks dysregulated in Alzheimer's disease

Subject: Inference and Pattern Discovery

Presenting Author: Manikandan Narayanan, Indian Institute of Technology (IIT) Madras, India

Co-Author(s):
Sanga Mitra, IIT Madras, India
Kailash BP, IIT Madras, India
Srivatsan CR, IIT Madras, India
Naga Venkata Saikumar, IIT Madras, India
Philge Phillip, IIT Madras, India

Abstract:

A fundamental challenge in neuroscience is to understand how the activities across brain regions are coordinated. Genomic investigation can systematically uncover genes involved in such coordination, but current genomic studies have focused mostly on within-region analysis of healthy vs. disease states. This study performed an inter-region differential correlation (DC) analysis on post-mortem human brain RNAseq data (87 control vs. 131 Alzheimers Disease (AD) individuals) from Mount Sinai Brain Bank, and identified how AD rewires the gene-gene correlation structure across four brain regions. The systematic computational approach we employ first adjusts the data for predicted frequencies of major cell types in these brain regions and then performs inter-region DC analysis, so that the resulting network of DC gene pairs are less likely to be confounded by variation in cellular composition across individuals. Examining these DC networks, we found that each brain region uses a different set of DC genes while interacting with other brain regions, and clustering such DC gene networks into bipartite (region-region) communities revealed that synaptic signaling and transporter activities are the most altered biological processes, further uncovering genes that have previously not been identified as AD biomarkers. Thus, inter-region comparison provides a new perspective for comprehending AD aetiology.<br><br>This work is supported by the WellcomeTrust/DBT India Alliance Intermediate Fellowship Award IA/I/17/2/503323 awarded to MN. Please note the equal contributions by co-authors Sanga Mitra and BP Kailash.



P51
Transferring Biological Networks by Solving Clique_based Graph Alignments

Subject: Optimization and Search

Presenting Author: Maryam Nazarieh, Alumni of Saarland University, Germany


Abstract:

The graph edit distance problem which is an NP-hard problem measures the edit distance between two graphs by considering the minimum number of operations to transfer one graph into the other graph. The problem becomes more complicated when multiple graphs are to be considered. Here, I suggest mapping the problem to the polynomial problem of finding cliques of a fixed size where the size of the clique is defined by the number of graphs under consideration. Here, the nodes of the graph can be genes, proteins, DNA sequences, protein sequences, and so on, where two nodes of two graphs are taken into one clique if they have less dissimilarity than a specified threshold. This indicates that all n nodes taken from n graphs constructing one clique hold the distance threshold by pair-wise comparison. This approach can lead to transfer biological networks by reducing the running time and maintaining high accuracy.



P52
Assessing Equivalent and Inverse Change in Genes between Diverse Experiments

Subject: Inference and Pattern Discovery

Presenting Author: Lisa Neums, University of Kansas Medical Center, United States


Abstract:

It is important to identify when two exposures impact a molecular marker (e.g., a gene’s expression) in similar ways, for example, to learn that a new drug has a similar effect to an existing drug. Currently, statistically robust approaches for making comparisons of equivalence of effect sizes obtained from two independently run treatment versus control comparisons have not been developed. Here, we propose two approaches for evaluating the question of equivalence between effect sizes of two independent studies: a bootstrap test of the Equivalent Change Index (ECI), which we previously developed, and performing Two One-Sided t-Tests (TOST) on the difference in log-fold changes directly. We used a series of simulation studies to compare the two tests on the basis of balanced accuracy and F1-socre. We found that TOST is not efficient for identifying equivalently changed genes (F1-score = 0) because it is too conservative, while the ECI bootstrap test shows good performance (F1-score = 0.96). Furthermore, applying the ECI bootstrap test and TOST to publicly available microarray expression data from pancreatic cancer of tumor tissue and peripheral blood mononuclear cells (PBMC) showed that, while TOST was not able to identify any equivalently or inversely changed genes, the ECI bootstrap test identified genes associated with pancreatic cancer. In conclusion, a bootstrap test of the ECI is a promising new statistical approach for determining if two diverse studies show similarity in the differential expression of genes and can help to identify genes which are similarly influenced by a specific treatment or exposure.



P53
Ramp Sequences in Alzheimer's Disease and Related Dementias

Subject: other

Presenting Author: Alyssa Nitz, Brigham Young University, United States

Co-Author(s):
Perry Ridge, Brigham Young University, United States

Abstract:

Many SNPs associated with Alzheimer’s disease and related dementias (ADRD), including Lewy body dementia, frontotemporal dementia, and vascular dementia, have been identified using genome-wide association studies (GWAS). Despite the many successes of GWAS, no effective treatments have yet been developed and a majority of identified GWAS SNPs are not believed to be functional. Ramp sequences are short stretches of slowly translated, or inefficient, codons at the 5’ end of genes that, counterintuitively, increase overall translational efficiency by evenly spacing ribosomes on the mRNA. Since ramp sequences are defined as inefficient codons, even synonymous SNPs can affect a ramp sequence. The effects of GWAS SNPs in ADRDs on the ramp sequences of the associated genes is unknown. Thus, by determining if these SNPs modify ramp sequences, we can possibly identify the functional effects these SNPs are tagging. Using the software package PLINK and algorithm ExtRamp, we are compiling lists of SNPs associated with ADRDs and variants in linkage disequilibrium (LD) with known variants, and isolating which SNPs affect the ramp sequences in the relevant genes. First, using the GWAS Catalog, we gathered 7 607 variants/risk alleles associated with ADRDs. Next, using PLINK to calculate the r2 values between different SNPs, we compiled a list of 49 949 SNPs that includes SNPs associated with ADRDs and variants in LD with the known variants. Finally, using ExtRamp, we will be able to determine which SNPs modify the ramp sequences of the associated genes.



P54
Comprehensive Compendium of Breast Cancer Gene-Expression Datasets

Subject: Data Management Methods and Systems

Presenting Author: Ifeanyichukwu Nwosu, Brigham Young University, United States

Co-Author(s):
Stephen Piccolo, Brigham Young University, United States

Abstract:

Before a researcher can perform statistical inferences or make biological interpretations from transcriptomic data, they must download the data, perform quality checks, clean, and standardize the data. Thus, despite the wide availability of transcriptomic data in the public domain, it is difficult for researchers—especially those with limited computational expertise—to perform these processing steps and then be able to make sound interpretations of the data.<br><br>We have curated 50 publicly available, breast-cancer datasets representing 15,137 patients, uniformly processed them, and standardized the metadata variables against the National Cancer Institute Thesaurus, a popular reference standard with unique codes for biomedical terms. This is useful because it makes it easier to infer semantic meaning when researchers analyze and combine datasets. Our curated datasets have a wide range of metadata variables that are not included in existing packages. Such variables include hormone receptor status, race, disease stage, tumor size etc. We plan to make the curated data freely available for other researchers to analyze. We believe that having this resource together in one place will minimize time spent on data preparation and allow researchers to focus on answering biomedical questions rather than on developing computational pipelines to process the data, thus potentially accelerating biomedical research. An additional benefit is that this data resource will not be bound to any programming language or statistical environment, making it easier for researchers from diverse training backgrounds to use.



P55
Improving the interpretability of random forest models of genetic association in the presence of non-additive interactions

Subject: Machine Learning

Presenting Author: Alena Orlenko, University of Pennsylvania, United States

Co-Author(s):
Jason Moore, University of Pennsylvania, United States

Abstract:

Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer’s, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest (RF) method is often employed in these efforts due to its ability to detect and model non-additive interactions. RF has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which RF feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis.&lt;br&gt;&lt;br&gt;To address this issue, and to improve interpretability of RF predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions.



P56
Quantification and visualization of the tumor microenvironment heterogeneity from spatial transcriptomic experiments

Subject: other

Presenting Author: Oscar Ospina, Moffitt Cancer Center, United States

Co-Author(s):
Alex Soupir, Moffitt Cancer Center, United States
Christopher Wilson, Moffitt Cancer Center, United States
Anders Berglund, Moffitt Cancer Center, United States
Inna Smalley, Moffitt Cancer Center, United States
Kenneth Tsai, Moffitt Cancer Center, United States

Abstract:

Spatially-resolved transcriptomics (ST) allows for a better assessment of tissue structure and function. In the context of cancer research, ST promises to deepen our understanding of the tumor microenvironment and lead to improved cancer prognosis and therapies. We present spatialGE, an R package for the visualization and quantification of gene expression heterogeneity from ST experiments. Our software has adapted geostatistical methods for the 1) generation of high-resolution gene expression surfaces via spatial interpolation and 2) the quantification of spatial heterogeneity measures that can be compared against clinical information (e.g., patient survival). In addition, spatialGE includes 3) cell deconvolution methods at the spot level; 4) a fast spatially-informed clustering approach (STClust); and 5) a new data structure that allows storage and analysis of multiple ST samples simultaneously. To demonstrate the utility of spatialGE, we used a publicly available ST data set from stage III melanoma lymph node biopsies [Thrane et al (2018); Cancer Research]. Spatial variation in gene expression was observed in a number of genes, including key cancer and immune-related genes such as PMEL and IGLL5. After applying deconvolution methods (e.g., xCell, ESTIMATE), B cells showed high spatial variation across the sampled locations. Moreover, tissue sections showing the highest non-uniform spatial distributions of B cell (as quantified by Moran’s I and Geary’s C) were extracted from a patient with the highest survival time. These results provide support to the hypothesis that spatial heterogeneity in the tumor microenvironment is a potential predictor of patient outcomes.



P57
Predicting functional consequences of mutations using molecular interaction network features

Subject: Graph Theory

Presenting Author: Kivilcim Ozturk, University of California San Diego, United States

Co-Author(s):
Hannah Carter, University of California San Diego, United States

Abstract:

Variant interpretation remains a central challenge for precision medicine. Missense variants are particularly difficult to understand as they change only a single amino acid in a protein sequence yet can have large and varied effects on protein activity. Numerous tools have been developed to identify missense variants with putative disease consequences from protein sequence and structure. However, biological function arises through higher order interactions among proteins and molecules within cells. We therefore sought to capture information about the potential of missense mutations to perturb protein-protein interaction networks by integrating protein structure and interaction data. We developed 16 network-based annotations for missense mutations that provide orthogonal information to features classically used to prioritize variants. We then evaluated them in the context of a proven machine-learning framework for variant effect prediction across multiple benchmark datasets and demonstrated their potential to improve variant classification. Interestingly, network features resulted in larger performance gains for classifying somatic mutations than for germline variants, possibly due to different constraints on what mutations are tolerated at the cellular versus organismal level. Our results suggest that modeling variant potential to perturb context-specific interactome networks is a fruitful strategy to advance in silico variant effect prediction.



P58
Germline modifiers of the tumor immune microenvironment reveal drivers of immunotherapy response

Subject: other

Presenting Author: Meghana Pagadala, UCSD, United States

Co-Author(s):
Victoria Wu, Moores Cancer Center, United States
Hyo Kim, UCSD, United States
Andrea Castro, UCSD, United States
James Talwar, UCSD, United States
Cristian Gonzalez-Colin, La Jolla Institute of Immunology, United States
Steven Cao, UCSD, United States
Benjamin Schmiedel, La Jolla Institute of Immunology, United States
Shervin Goudarzi, Canyon Crest Academy, United States
Divya Kirani, UCSD, United States
Rany Salem, UCSD, United States
Gerald Morris, UCSD, United States
Olivier Harismendy, Moores Cancer Center, United States
Sandip Patel, Morres Cancer Center, United States
Jill Mesirov, UCSD, United States
Maurizio Zanetti, Moores Cancer Center, United States
Chi-Ping Day, National Institutes of Health, United States
Chun Fan, UCSD, United States
Wesley Thompon, UCSD, United States
Glenn Merlino, National Institutes of Health, United States
Eva Pérez-Guijarro, National Institutes of Health, United States
J Silvio Gutkind, Moores Cancer Center, United States
Pandurangan Vijayanand, La Jolla Institute of Immunology, United States
Hannah Carter, UCSD, United States

Abstract:

With the continued promise of immunotherapy as an avenue for treating cancer, understanding how host genetics contributes to the tumor immune microenvironment (TIME) is essential to tailoring cancer risk screening and treatment strategies. Using genotypes from over 8,000 European individuals in The Cancer Genome Atlas and 137 heritable tumor immune phenotype components (IP components), we identified and investigated 532 TIME-SNPs. Focusing on 77 variants that were relevant to cancer risk, survival, or treatment response, we explored their potential to reveal novel targets for immunotherapy. Many variants overlapped regions with histone marks indicating active transcription, and influenced gene activities in specific immune cell subsets, such as macrophages and dendritic cells. TIME-SNPs implicated genes such as LAIR1, TREX1, CTSS, CTSW and LILRB2 were differentially expressed between responders and non-responders to immune-checkpoint blockade (ICB) in preclinical studies. Of these, LILRB2 and LAIR1 have already been identified as putative targets for immunotherapy. Here we found that inhibition of CTSS led to better tumor control and survival in murine models, alone or in combination with anti-PD-1. Collectively we show that through an integrative approach, it is possible to link host genetics to TIME characteristics, informing novel biomarkers for cancer risk and target identification in immunotherapy.



P60
Geographical Support Vector Machines (GSVM) for the Analysis of Spatially Data

Subject: Machine Learning

Presenting Author: Shachi Patel, University of Kansas Medical Center, United States


Abstract:

Analysis of geographical data presents a unique challenge because of the spatial correlation among the variables. Geographically Weighted Regression(GWR) has been developed as a tool to capture the strong effect of local variations. However, GWR is a parametric technique, and therefore, it requires an assumption about the functional form between the response and the independent variables. On the other hand, Support Vector Machines(SVM) do not require a specific relationship between response and independent variables. However, not many studies have incorporated spatial weights of the geographical data with SVM. Therefore, we developed a method called Geographical Support Vector Machines(GSVM), which combines geographically related data with SVM. This approach creates separate SVM for each local context and weighting observations based on their distance to the local context. We tested our method on two different datasets: urologist dataset and election results dataset. For the urologist dataset, we built a model to predict the counties that exhibit an increase in urologist availability from 2010 to 2018 using the training dataset of an increase in urologist availability from 2000 to 2010 and socioeconomic variables of each county as a predictive parameter. For the election dataset, we used election results of 2012 and population-socioeconomic variables of each county to predict the election results of 2016. In both datasets, the GSVM model performs significantly better than SVM. In conclusion, we have developed a non-parametric spatial analysis technique that can estimate an arbitrary functional relationship among predictors and responses to analyze the geographically correlated data.



P61
Deep probabilistic generative modeling and visual evaluation of conditional single cell RNA-sequencing expression data

Subject: Machine Learning

Presenting Author: Eric Prince, University of Colorado Anschutz Medical Campus, United States

Co-Author(s):
Todd Hankinson, University of Colorado Anschutz Medical Campus, United States
Rajeev Vibhakar, University of Colorado Anschutz Medical Campus, United States
Debashis Ghosh, Colorado School of Public Health, United States
Carsten Goerg, Colorado School of Public Health, United States

Abstract:

Background: Single-cell RNA-sequencing (scRNA-seq) is a common assay in biomedical research. While offering high-resolution data, there are known technical issues such as sampling bias and dropout. A variational autoencoder (VAE) is a type of deep learning model that can learn the data-generating process of a given scRNA-seq expression matrix. We present an unsupervised probabilistic deep learning method and corresponding visual analytics software for developing synthetic scRNA-seq experiments. <br><br>Methods: We developed a VAE with 2 hidden layers using TensorFlow and trained separate VAEs under a Bayesian paradigm on three separate human scRNA-seq datasets. We generated synthetic expression data cell-wise in triplicate and used the Seurat package for preprocessing and analysis. We used the Harmony package to perform batch-effect correction. Synthetic matrices were contrasted with real expression matrices using differential enrichment analysis, and hierarchical clustering of sample-wise distances. We developed an R/Shiny tool for generation and visual inspection of synthetic scRNA-seq expression data. Visual inspection includes clustering, differential enrichment, and information theory-based analyses, enabling selection of simulated scRNA-seq expression matrices with specific transcriptional signatures. <br><br>Results: Synthetic expression data is clustered homogeneously with real expression data in both PCA and UMAP embedding space. Differential expression and functional enrichment analysis between real and synthetic data revealed rational signatures for each dataset context. Analysis of cell- and feature-level reconstruction error identified which signals in the dataset were best learned. Hierarchical clustering of various distance matrices between real and synthetic data revealed the wide range of overall quality of synthetic data for the respective experimental conditions.



P62
TidyGEO - Web-based, open-source tool for downloading, tidying, and restructuring data series from Gene Expression Omnibus (GEO)

Subject: Data Management Methods and Systems

Presenting Author: Badí Israel Quinteros, Brigham Young University, United States

Co-Author(s):
Elizabeth Anderson, Brigham Young University, United States
Ashlie Johnson, Brigham Young University, United States
Avery Bell, Brigham Young University, United States
Stephen Piccolo, Brigham Young University, United States

Abstract:

TidyGEO is a Web-based, open-source tool for downloading, tidying, and restructuring data series from Gene Expression Omnibus (GEO). As biological data are made available in public repositories, important discoveries can be made via secondary research. As a freely accessible repository with data from over 4 million biological samples across more than 4,000 organisms, GEO provides diverse opportunities for secondary research. Gene-expression data are most common in GEO, but other assay types are also prevalent, including DNA methylation, comparative genomic hybridization, and chromatin-accessibility profiles. GEO&#039;s diversity and expansiveness present opportunities and challenges. Although scientists may find assay data relevant to a given research question, most analyses require sample annotations, such as a sample&#039;s treatment group, disease subtype, or age. In GEO, such annotations are stored alongside assay data in consistently formatted, text-based files. However, the structure and semantics of the annotations vary widely from one data series to another. Thus, before annotations can be used in quantitative analyses, they must be preprocessed. This is typically accomplished using manual approaches, which are labor intensive and error prone. These efforts take time away from other research tasks and may be daunting to scientists with limited computational experience. TidyGEO can perform essential data-cleaning tasks for sample-level annotations, such as selecting informative columns, renaming columns, splitting or merging columns, standardizing data values, and filtering samples. Additionally, users can integrate these annotations with assay data, restructure assay data, and generate code that enables others to reproduce these steps. TidyGEO can be found at https://github.com/srp33/TidyGEO.



P63
CREPE: A pipeline for streamlined transcription factor identification and cataloguing

Subject: Inference and Pattern Discovery

Presenting Author: Diego A Rosado-Tristani, University of Puerto Rico, Puerto Rico


Abstract:

Gene regulation is essential to all organisms. The unraveling of gene regulatory networks that control phenotypic determination is currently an ongoing endeavor. A first step in deciphering these networks is the identification and cataloguing of transcription factors (TFs), because they serve as nodes to these networks. TFs are proteins with DNA-binding domains (DBD) that bind to regulatory elements in the genome in a sequence-specific manner and modulate transcription. As, such TFs act as mediators from genome to phenome. In this work we present the Cis-regulatory element-binding Protein Elucidator (CREPE) pipeline, capable of parsing TFs from proteomes and annotate them. In brief, a proteome is scanned against a database of Eukaryotic transcription factor DBD Hidden Markov Models (HMM) which are then extracted and the TF family distributions visualized. This is followed by orthology inferences of the putative TFs made using tree-based methods. Putative TFs are assigned the gene name with the shortest patristic distance relative to it. The purpose of the orthology inferences is to harmonize annotations because in many non-model organisms annotations may not be available. To measure the performance of CREPE we applied the pipeline to the human proteome available from Ensemble (~100,000 proteins) and compared the results against Cis-BP, a comprehensive database of TFs (1546 proteins). Our results show near identical TF family distributions between CREPE and Cis-BP. CREPE identified 1400 proteins as putative TFs, all of them included in Cis-BP, for an accuracy of 90%. CREPE streamlines TF identification, highlighting its potential use on non-model organisms.



P64
Understanding the Microbiome and Disease Using Vector Embeddings

Subject: Metogenomics

Presenting Author: Brook Santangelo, University of Colorado Anschutz Medical Campus, United States

Co-Author(s):
Larry Hunter, University of Colorado Anschutz Medical Campus, United States
William Baumgartner, University of Colorado Anschutz Medical Campus, United States
Mike Bada, University of Colorado Anschutz Medical Campus, United States
Tiffany Callahan, University of Colorado Anschutz Medical Campus, United States
Catherine Lozupone, University of Colorado Anschutz Medical Campus, United States

Abstract:

Vector embeddings are used to represent large-scale networks and enable meaningful predictions to be made on a complex knowledge base. Understanding the gut microbiome in the context of disease benefits from this global context and abstraction because it allows for the embedding of multi-omic clinical data which can provide mechanistic insight into patient phenotypes. We hypothesized that introducing microbes to a biologically relevant knowledge graph would stratify individuals by disease and highlight the role of microbiome composition and immune signatures on disease phenotypes. We manually extracted microbial relationships from PubMed and placed them into a biomedical knowledge graph using novel logical definitions that incorporate 9 primary biological ontologies. Vector embeddings of each entity were generated from the knowledge graph using Node2Vec. A dataset of HIV positive and negative individuals was represented by these embeddings and scaled according to microbiome and immune data. Cosine similarity scores against gold standard coordinates of HIV phenotypes in PCA space were then evaluated to understand the patient clustering. We found that the vector representation of individuals from this biomedical knowledge graph clearly stratified them by disease phenotypes. Furthermore, the stratification supports the idea that ART treatment improves HIV disease status based on microbiome and immune signatures. Including novel microbiome-relevant nodes in the knowledge graph did not significantly affect the performance. This suggests that biomedical entities can be used to represent microbes, allowing for vector embeddings from existing biomedical knowledge graphs to be used as a method for improving our understanding of the microbiome and disease.



P65
Adaptive evolution of conserved non-coding elements underlying the high-altitude phenotype among mammalian lineages

Subject: Inference and Pattern Discovery

Presenting Author: Elysia Saputra, University of Pittsburgh, United States

Co-Author(s):
Allie Graham, University of Utah, United States
Maria Chikina, University of Pittsburgh, United States
Nathan Clark, University of Utah, United States

Abstract:

Distantly related species that adapt to common selective pressures can develop similar phenotypic traits, driven by convergent molecular changes. A case-in-point is the convergent mammalian adaptation to the high altitude environment, in which species independently acquired morphological and molecular changes to adapt to the hyperbaric hypoxia, UV exposure, and extreme cold that accompany living at altitude. While convergent evolution of the hypoxia-inducible factor (HIF) pathway, the modulator of hypoxia response, has been observed in high-altitude species, the regulatory adaptations underlying the high-altitude phenotype are still not completely understood. We performed genome-wide comparative analyses to identify conserved non-coding elements (CNEs) that drive high-altitude adaptation in mammals. Using a multi-species genome alignment of 120 mammals, we identified 22 extant and ancestral species that resided at high altitude. We then used our new computational approach to evaluate 1,222,955 CNEs for departures from the neutral evolutionary rate, and annotated 25,689 CNEs that exhibited significant rate deceleration, and 15,344 CNEs that exhibited significant rate acceleration, among high-altitude species. Decelerated CNEs, which experienced heightened selective constraint, were enriched for pathways that were regulators, effectors, or collaborators of the HIF pathway. Meanwhile, accelerated CNEs, which experienced relaxed constraint or positive selection, were enriched for pathways inhibited by the HIF pathway, and others whose inhibition facilitated hypoxia response. Additionally, accelerated CNEs were enriched for functions related to hormone metabolism, ocular phenotypes, ischemia, reduced glomerular filtration rate, flaring of rib cage, and metabolic alkalosis, which were relevant to high altitude adaptation.



P66
A comprehensive benchmarking of WGS-based structural variant callers

Subject: Simulation and Numeric Computing

Presenting Author: Varuni Sarwal, University of California Los Angeles, United States

Co-Author(s):
Sebastian Niehus, Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, 10178 Berlin, Germany , Germany
Eleazar Eskin, University of California Los Angeles, United States
Jonathan Flint, University of California Los Angeles, United States
Serghei Mangul, serghei.mangul@gmail.com , United States

Abstract:

A comprehensive benchmarking of WGS-based structural variant callers&lt;br&gt;Advances in whole-genome sequencing promise to enable accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole-genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this project, we evaluated the performance of SV-detection tools on mouse and human WGS data using a comprehensive PCR-confirmed gold standard set of SVs and the GIAB variant set, respectively. In contrast to the previous benchmarking studies, our mouse gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Manta was the top-performing tool for both mouse and human data, with F-score values consistently above 0.6. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data as well as for different deletion length categories. We hope that the results reported in this benchmarking study can help researchers choose appropriate variant calling tools based on the organism, data coverage, and deletion length.



P67
Multi-omics data integration of early lung adenocarcinoma reveals an association between radiomics features and immune response

Subject: Machine Learning

Presenting Author: Maria-Fernanda Senosain, Vanderbilt University, United States

Co-Author(s):
Yong Zou, Vanderbilt University Medical Center, United States
Khushbu Patel, Vanderbilt University Medical Center, United States
Vera Pancaldi, INSERM, France
Carlos Lopez, Vanderbilt University, United States
Pierre Massion, Vanderbilt University Medical Center, United States

Abstract:

Lung adenocarcinoma (ADC) is a heterogeneous group of tumors associated with different survival rates, even when detected at an early stage. Here, we aim to investigate the biological determinants of early ADC indolence or aggressiveness using radiomics as a surrogate of behavior.<br><br>We present a set of 93 ADC patients with data collected across different methodologies. Patients were risk-stratified by analyzing their CT scans using the SILA software (continuous score, 0=least aggressive, 1= most aggressive). Using CyTOF we identified epithelial, mesenchymal, and immune subpopulations characterized by high HLA-DR expression that were associated with indolent behavior. In the RNA-Seq dataset, pathways related to immune response were downregulated in aggressive tumors compared to indolent while pathways related with cell cycle and proliferation were upregulated. We used HealthMyne (HM) software to extract radiomics features and computed a pairwise correlation with SILA. <br><br>For the data integration step, we selected features that were significantly associated with SILA. Features’ clusters were obtained by computing a squared dissimilarity matrix and applying K-medoids algorithm (k=6). Cluster 4 (C4) was composed by proteomics and transcriptomics features associated with immune response, antigen presentation, HLA-DR expression and HM features negatively associated with SILA, such as percent of ground glass opacity. C5 was associated with proliferation and cell cycle. C6 was associated with defective metabolic pathways and HM features positively associated with SILA. Patients with high C4 had low C5 and C6 and vice versa. In conclusion, we found a bridge between radiomics and tumor biology which could improve the discrimination between indolent and aggressive ADC tumors.



P68
Sequential Imputation of Sparse Metabolomic Data

Subject: other

Presenting Author: Elin Shaddox, University of Colorado Anschutz Medical Campus, United States

Co-Author(s):
Katerina Kechris, University of Colorado Anschutz Medical Campus, United States
Debashis Ghosh, University of Colorado Anschutz Medical Campus, United States

Abstract:

Missing quantitative measurements are commonly encountered during the comprehensive profiling of metabolomics abundances, or metabolomics. There are three mechanisms of missingness that can occur in metabolomics, such as 1) metabolite abundance signals lower than the limit of detection for mass spectrometry technologies, 2) environmental factors such as lab technician or instruments and variation in processing, and 3) random sparsity. In practice, missing data is handled by omission of missing abundances or imputation approaches. Standard imputation methods tend to be sensitive to missingness mechanisms, and often cannot handle all three mechanisms of missingness. We propose a sequential imputation approach for handling general missingness encountered in metabolomics data that is more robust to missingness mechanism. Our sequential imputation approach incorporates an empirical Bayesian procedure to estimate the posterior distribution of the data and performs well without iterations. We demonstrate the performance of our approach compared to alternative imputation approaches through simulated datasets of varying patterns and levels of missingness. Our model is also applied to an untargeted metabolomics study in plasma for chronic obstructive pulmonary disease (COPD).



P69
Using Sub-Pattern Frequency Prunning to Find Multi-Gene Expression Patterns in Alzheimer Disease

Subject: Optimization and Search

Presenting Author: Kenneth Smith, University of Missouri - St. Louis, United States

Co-Author(s):
Jamie Lea, University of Missouri – St. Louis, United States
Carlos Cruchaga, Washington University School of Medicine, United States
Sharlee Climer, University of Missouri – St. Louis, United States

Abstract:

Biomarker identification, such as gene expression levels, are commonly used in aiding disease prediction and treatment. Traditionally, gene expression analysis focuses on differently expressed genes, while ignoring combinations of genes whose co-expression is associated with disease risk. Recently, step-wise regression and machine learning methods, such as random forests and gradient boosting, have been proposed to select features for classification models. We present a method for discovering candidate risk patterns called Sync. Sync is applied to Cerebrospinal Fluid profiles from an Alzheimer case-control study (n = 279). The continuous expression data is discretized on a per analyte basis and patterns associated with disease risk are identified for combinations of analytes, ranging from a single feature up to seven co-expressed features. For each pattern size, the strength of the risk association is measured by the difference in pattern frequency between cases and controls. The optimal pattern and a set of near optimal solutions are determined for each pattern size. Using the patterns identified from the discrete data, logistic regression models are trained on the test set and then validated on a holdout set (n = 210). From the 705 analytes that remained after QC, our program yielded 5285 significant patterns, 548 of which outperformed the best differential expression model (AUC=0.8325). Our approach is computationally efficient, and our open-source software is freely available. This method holds potential for biomarker discovery for diverse phenotypes and is also applicable for identifying patterns hidden within non-biological real-valued data sets.



P70
Gender, Sex, and Sexual Orientation Project

Subject: Optimization and Search

Presenting Author: Devorah Stucki, Brigham Young University, United States


Abstract:

The electronic health record stores patient data as unformatted text input describing the visit, discussion, and diagnoses. These notes are most useful for data analysis when they use a controlled vocabulary of terms. An ontology is a network of concepts and categories explaining relationships between terms. Such systems promote accuracy, transparency and increase data use and analysis. Although there are hundreds of biomedical ontologies, none cover LGBTQIA+ subject areas, or the areas of gender, sex, and sexual orientation. The goal of this project was to implement the Gender, Sex, and Sexual Orientation Ontology (GSSO) in a clinical setting and compare its efficacy to those of current identification systems. This was done by a patient chart review that compared the GSSO and similar systems (i.e natural language processing, manual provider input) to ICD10 codes. We have shown that while the GSSO has high accuracy, the sensitivity does not meet the threshold of other identification systems. Effective extraction of clinical data facilitates accurate patient identification. Improving the accuracy of the GSSO will help facilitate better communication between LGBTQIA+ persons and health care professionals to improve health outcomes.



P71
Co-evolution based machine-learning for predicting functional interactions between human genes

Subject: Machine Learning

Presenting Author: Yuval Tabach, The Hebrew University-Hadassah Medical School, Israel


Abstract:

Over the next decade, more than a million eukaryotic species are expected to be fully sequenced. This has the potential improve our understanding of genotype and phenotype crosstalk, gene function and interactions, and answer evolutionary questions. Here, we develop a machine-learning approach for utilizing phylogenetic-profiles across 1154 eukaryotic species. This method integrates co-evolution across eukaryotic clades to predict functional interactions between human genes and the context for these interactions. We benchmarked our approach and found a 14% performance increase (auROC) compared to previous methods. Using this approach, we enabled functional annotation for less studied genes. We focused on DNA repair and verified that 9 of the top 50 predicted genes have been identified elsewhere, with others previously prioritized by high-throughput screens. Overall, our approach enables better annotation of function and functional interactions and facilitates the understanding of evolutionary processes underlying co-evolution. This work is accompanied by a webserver available at: https://mlpp.cs.huji.ac.il.<br>



P72
Online Tools for Teaching Cancer Bioinformatics

Subject: other

Presenting Author: Mason Taylor, Brigham Young University, United States

Co-Author(s):
Bryn Mendenhall, Brigham Young University, United States
Calvin Woods, Brigham Young University, United States
Madeline Rasband, Brigham Young University, United States
Milene Vallejo, Brigham Young University, United States
Elizabeth Bailey, Brigham Young University, United States
Samuel Payne, Brigham Young University, United States

Abstract:

The rise of deep molecular characterization with omics data as a standard in biological sciences has highlighted a need for expanded instruction in bioinformatics curricula. Many large biology data sets are publicly available and offer an incredible opportunity for educators to help students explore biological phenomena with computational tools, including data manipulation, visualization, and statistical assessment. However, logistical barriers to data access and integration often complicate their use in undergraduate education. Here, we present a cancer bioinformatics module that is designed to overcome these barriers through six exercises containing authentic, biologically motivated computational exercises that demonstrate how modern omics data are used in precision oncology. Upper-division undergraduate students develop advanced Python programming and data analysis skills with real-world oncology data which integrates proteomics and genomics. The module is publicly available and open source at https://paynelab.github.io/biograder/bio462. These hands-on activities include explanatory text, code demonstrations, and practice problems and are ready to implement in bioinformatics courses.



P73
Best Practices for the Analysis of Whole Mitochondrial Genomes

Subject: Data Management Methods and Systems

Presenting Author: Nicholas Tenney, Brigham Young University, United States

Co-Author(s):
Brady Neeley, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Abstract:

Despite the large variety of tools to analyze DNA from next generation sequencing (NGS), relatively little research has been done with respect to mitochondrial DNA (mtDNA). The differences between the mitochondrial and nuclear genomes generally require that unique tools be used to analyze them. Our purpose was to evaluate available tools built to analyze mitochondrial genomes from NGS in order to establish best practices. We sequenced 40 whole human genomes using NGS and Sanger sequenced whole mitochondrial genomes for each of these samples. We tested each of the softwares on the NGS-sequenced mtDNA and validated the results with those found from Sanger sequencing. These include seven tools for variant calling, four to detect heteroplamsy, five tools to annotate haplogroups, and three to find the mtDNA copy number. We found that, with respect to variant calling, heteroplasmy, and mtDNA copy number, no software performed perfectly, with a trade-off between false positive and negative rates. The variant calling software that performed the best was mitoCaller, with the highest overall accuracy and precision. Haplogrep was able to find the haplogroup of the mtDNA with 100% accuracy. Analyses are currently being performed to find the best of the heteroplamsy and mtDNA copy number tools. Knowing the best software for each category of analysis will allow us to provide an evidence-based pipeline for the analysis of mtDNA from NGS projects.



P74
Modeling Mechanisms of Cellular Plasticity in SCLC using Boolean Gene Regulatory Network

Subject: Qualitative Modelling and Simulation

Presenting Author: Perry Wasdin, Vanderbilt University, United States

Co-Author(s):
Bryan Glazer, Vanderbilt University, United States
Carlos F Lopez, Vanderbilt University, United States

Abstract:

Despite the immense amount of single cell data introduced by next generation -omics technologies, a quantitative understanding of how the transcriptome shapes cellular identity, function, and reprogramming is still lacking. Here we present a pipeline designed to construct a data-driven Boolean network in order to model phenotypic transitions and investigate how patterns of gene regulation enable increased plasticity in small cell lung cancer (SCLC) cells following treatment. Previous studies have demonstrated that SCLC tumors acquire treatment resistance when treated with platinum-based therapies due to increased intratumoral heterogeneity that gives rise to multiple resistance mechanisms, but specific mechanisms have yet to be uncovered. Our approach leverages previously published scRNA-seq data from a series of circulating tumor cell-derived xenograft models before and after treatment with cisplatin to construct and train a gene regulatory network. Relevant transcription factors (TFs) were identified by finding significantly enriched gene ontology and pathway terms at the single cell level. Differential terms between the two treatment groups were then identified, yielding a group of pertinent TFs which were used as nodes for the network. Edges between these nodes were constructed using Enrichr to find enriched interactions between the TFs using data from ChIP-seq databases to ensure that the network model is based on physical interactions, enabling us to infer causal relationships between genes. This network will be used to simulate transitions between different steady states observed in the data, elucidating concurrent resistance mechanisms to understand how the tumor cells are able to de-differentiate and transition between these states.



P75
Morphology and gene expression profiling provide complementary information for mapping cell state

Subject: Inference and Pattern Discovery

Presenting Author: Gregory Way, University of Colorado Anschutz, United States

Co-Author(s):
Ted Natoli, Broad Institute of MIT and Harvard, United States
Adeniyi Adeboye, Broad Institute of MIT and Harvard, United States
Lev Litichevskiy, Broad Institute of MIT and Harvard, United States
Andrew Yang, Broad Institute of MIT and Harvard, United States
Xiaodong Lu, Broad Institute of MIT and Harvard, United States
Juan Caicedo, Broad Institute of MIT and Harvard, United States
Beth Cimini, Broad Institute of MIT and Harvard, United States
Kyle Karhohs, Broad Institute of MIT and Harvard, United States
David Logan, Pfizer, United States
Mohammad Rohban, Imaging Platform, United States
Maria Kost-Alimova, Center for the Development of Therapeutics, United States
Kate Hartland, Center for the Development of Therapeutics, United States
Michael Bornholdt, Imaging Platform, United States
Niranj Chandrasekaran, Imaging Platform, United States
Marzieh Haghighi, Imaging Platform, United States
Shantanu Singh, Imaging Platform, United States
Aravind Subramanian, Cancer Program, United States
Anne Carpenter, Imaging Platform, United States

Abstract:

Deep profiling of cell states can provide a broad picture of biological changes that occur in disease, mutation, or in response to drug or chemical treatments. Morphological and gene expression profiling, for example, can cost-effectively capture thousands of features in thousands of samples across perturbations, but it is unclear to what extent the two modalities capture overlapping versus complementary mechanistic information. Here, using both the L1000 and Cell Painting assays to profile gene expression and cell morphology, respectively, we perturb A549 lung cancer with 1,327 small molecules from the Drug Repurposing Hub across six doses. We determine that the two assays capture some shared and some complementary information in mapping cell state. We find that as compared to L1000, Cell Painting captures a higher proportion of reproducible compounds and mechanisms and has more diverse samples, but measures fewer distinct groups of features. In a deep learning analysis, L1000 predicted more compound mechanisms of action (MOA). In general, the two assays together provide a complementary view of drug mechanisms for follow up analyses. Our analysis answers fundamental biological questions comparing the two biological modalities and, given the numerous applications of profiling in biology, provides guidance for planning experiments that profile cells for detecting distinct cell types, disease phenotypes, and response to chemical or genetic perturbations.



P76
Microbial growth dynamic estimation in metagenomic data as a tool for differentiating health-promoting bacterial strains from healthy and diseased populations.

Subject: Metogenomics

Presenting Author: Brendan Wee, Second Genome inc., United States

Co-Author(s):
Irina Shilova, Second Genome Inc., United States

Abstract:

Peak-to-Trough Ratio (PTR) is a powerful approach for in silico estimation of bacterial DNA replication dynamics in genomics and metagenomic sequence data. The method serves as a proxy for cell growth rate and allows distinguishing active versus inactive and potentially dead cells in metagenomics datasets. Since its first application [1], the method has shown that PTR of bacterial species in the human gut might differentiate healthy from diseased human populations [2]. Second Genome’s discovery platform is based on a meta-analysis of omics datasets to identify bacterial strains that are significantly more abundant in healthy populations versus populations with a disease, such as Inflammatory Bowel Disease (IBD). From these strains, we then identify potentially active peptides that are subsequently tested in vitro and in vivo for desired activity. As a proof-of-concept, we are analyzing if the PTR approach improves the identification of health-promoting bacterial strains in IBD.



P77
Lower sequencing coverage of germline and tumor exomes of Black patients in The Cancer Genome Atlas

Subject: other

Presenting Author: Daniel Wickland, Mayo Clinic, United States

Co-Author(s):
Mark Sherman, Mayo Clinic, United States
Derek Radisky, Mayo Clinic, United States
Aaron Mansfield, Mayo Clinic, United States
Yan Asmann, Mayo Clinic, United States

Abstract:

In the U.S., cancer disproportionately impacts Blacks / African Americans. Identifying genetic factors underlying this cancer disparity has been an important research focus and requires data that are equitable in both quantity and quality across racial groups. It is widely recognized that DNA databases quantitatively underrepresent minorities. However, the differences in data quality between racial groups have not been well studied. We analyzed germline and tumor exomes from Black and White patients in The Cancer Genome Atlas (TCGA) of 7 cancers that profiled at least 50 Black tumors, comparing them in the context of sequencing depth, tumor purity, and the qualities of germline variants and somatic mutations. Germline and tumor exomes from Black patients were sequenced at significantly lower depth in 6 out of the 7 cancer types studied. For 3 of the cancers, the majority of White patients were studied in early sample batches and sequenced at significantly higher coverage, whereas Black patients were concentrated in later batches and sequenced at much lower depth. For the other 3 cancers, the reasons underlying lower sequencing coverage of Blacks remain unknown. Furthermore, even when the total sequencing depths were comparable, Black exomes had a disproportionately higher percentage of positions with insufficient coverage, likely due to the known White bias in the human reference genome that impacted exome capture kit design. Overall and positional lower sequencing depths of Black exomes in TCGA led to under-detection and lower quality of variants, highlighting the need to consider epidemiological factors for future genomics studies.



P78
ProTaxa: software to easily perform phylogenomic analyses for prokaryotic taxa

Subject: Data Management Methods and Systems

Presenting Author: Joseph Wirth, Harvey Mudd College, United States

Co-Author(s):
Eliot Bush, Harvey Mudd College, United States

Abstract:

The nucleotide sequences of 16S ribosomal RNA (rRNA) genes have been used to inform the taxonomic placement of prokaryotes for several decades, but recent research has demonstrated that whole-genome approaches can better resolve the evolutionary relationships of organisms, especially when taxa are closely-related. However, the vast number of publicly available 16S rRNA gene sequences make this gene useful for obtaining a rough estimate of the phylogeny for a given taxon. Unfortunately, the reliance of 16S rRNA as the sole phylogenetic marker often causes closely-related organisms to be omitted from taxonomic analyses. In addition, NCBI Taxonomy is not an authoritative database for prokaryotic taxonomy. Although it is roughly accurate, the database has many erroneous entries, especially when it comes to the accurate designation of type material. While there are existing tools available to facilitate taxonomic placement, the genome-selection methods leave much to be desired. For example, the TYpe strain Genome Server (TYGS) uses a proprietary genome database, and the Microbial Genome Atlas (MiGA) relies on relationships provided by NCBI's Taxonomy database. ProTaxa was developed to resolve these issues in a freely accessible (open source) way. NCBI's Taxonomy database is cross-referenced with the List of Prokaryotes with Standing in Nomenclature (LPSN), a more definitive resource for prokaryotic taxonomy, which allows easy linking to NCBI’s sequence databases. This software also employs a unique strategy to identify closely-related genomes that were omitted by identifying and utilizing phylogenetic markers specific to the input genome. These approaches greatly improve taxonomic placements and are largely automated.



P79
sciCAN: Single-cell chromatin accessibility and gene expression data integration via Cycle-consistent Adversarial Network

Subject: Machine Learning

Presenting Author: Yang Xu, The University of Tennessee, Knoxville, United States


Abstract:

As the booming single-cell sequencing technologies bring a surge of high dimensional data that come from different sources and represent cellular systems with different features, there is an equivalent increase in the challenges of integrating single-cell sequencing data across modalities. Here, we present a novel adversarial approach (sciCAN) to integrate single-cell chromatin accessibility and gene expression data in an unsupervised manner. We benchmarked sciCAN with 3 state-of-the-art (SOTA) methods in 5 different ATAC-seq/RNA-seq datasets, and we demonstrated that sciCAN dealt with data integration with better balance of mutual transferring between modalities than the other 3 SOTA methods. sciCAN, along with Seurat, has the best integration performance. Next, we applied sciCAN to both PBMC RNA-seq and ATAC-seq data and showed that the integrated representation learned sciCAN preserved HSC-centered hematopoiesis hierarchy in both modalities. Finally, we used sciCAN to jointly cluster single-cell CRISPR-screed K562 RNA-seq and ATAC-seq data, and we identified a subcluster enriching similar markers in both modalities, suggesting a common effect after CRISPR perturbation.



P80
NLP Sandbox: Model-to-data ecosystem for utilizing clinical notes and benchmarking NLP tools

Subject: Inference and Pattern Discovery

Presenting Author: Yao Yan, Sage Bionetworks, United States

Co-Author(s):
Thomas Yu, Sage Bionetworks, United States
George Kowalski, Medical College of Wisconsin, United States
Sijia Liu, Mayo Clinic, United States
Connor Boyle, University of Washington, United States
Jiaxin Zheng, Sage Bionetworks, United States
James Eddy, Sage Bionetworks, United States
Hongfang Liu, Mayo, United States
Bradley Taylor, Medical College of Wisconsin, United States
Justin Guinney, Sage Bionetworks, United States
Thomas Schaffter, Sage Bionetworks, United States

Abstract:

Critical patient information derived from academic research, health care, and clinical<br>trials are off limits for traditional data-to-model (whereby data is transferred/downloaded into a new environment to be colocated with the executable model) benchmarking of NLP tools. Existing barriers include restricted access to prohibitively large or sensitive data. In addition to data access constraints, we also lack effective frameworks for assessing the performance and generalizability of NLP tools.<br><br>The NLP Sandbox adopts a model-to-data approach to enable NLP developers to assess the performance of their tools on public and private datasets. When a developer submits a tool, partner organizations (e.g., hospitals, universities) automatically provision a tool, execute it, and evaluate its performance against their private data in a secure environment. Upon successful completion, the partner organization reports what the performance of the tool is and this report is automatically published in the NLP Sandbox leaderboards.<br><br>The first series of NLP tasks that the NLP Sandbox supports is the annotation of Protected Health Information (PHI) in clinical notes. These tasks have been identified through our collaboration with the National Center for Data to Health (CD2H). Submitted tools are currently evaluated on the dataset of the 2014 i2b2 NLP De-identification Challenge and private data from MCW. Additional data sites are currently being onboarded (Mayo Clinic, UW).



P81
Identifying Viruses from Host Genomes and Deep Learning for Prediction of Viral Integration Sites

Subject: Machine Learning

Presenting Author: Zhongming Zhao, University of Texas Health Science Center at Houston, United States


Abstract:

Viral infections are commonly observed in nature. Effective and efficient detection of viruses in host genomes, together with tracking how viruses interact with host genomes, are major challenges. We recently developed an algorithm called VERSE: Virus intEgration sites through iterative Reference SEquence customization, which can effectively detect viruses with viral mutations from next generation sequencing data. VERSE improves detection through customizing reference genomes. Using 19 human tumors and cancer cell lines as test data, we demonstrated that VERSE substantially enhanced the sensitivity of virus integration site detection. VERSE has been used by some large network projects such as The International Cancer Genome Consortium (ICGC, 25k whole genome sequencing data). We next manually collected and curated viral integration sites (VISs, total 77,632 sites) from published works and made them publicly available through VISDB: Viral Integration Site DataBase. Furthermore, we developed a deep learning method, DeepVISP, for viral site integration prediction and motif discovery. DeepVISP is based on deep convolutional neural network (CNN) model with attention architecture. We demonstrated DeepVISP can accurately predict oncogenic VISs in the human genome using our curated benchmark integration data of three viruses, hepatitis B virus (HBV), human herpesvirus (HPV), and Epstein-Barr virus (EBV). Comparing to six classical machine learning methods, DeepVISP achieves higher accuracy and more robust performance for all three viruses through automatically learning informative features and essential genomic positions only from the DNA sequences. A user-friendly web server is developed for predicting putative oncogenic VISs in the human genome using DeepVISP.



P82
NF-κB signaling alterations associated with the aggressive prostate cancer subtype defined by MAP3K7 and CHD1 co-loss

Subject: other

Presenting Author: Michael Orman, U Colorado | AMC, United States

Co-Author(s):
Jim Costello, U Colorado | AMC, United States

Abstract:

Recent work by our group identified an aggressive prostate cancer (PCa) subtype defined by co-loss of the tumor suppressor genes MAP3K7 and CHD1. This subtype constitutes 10 – 20% of localized PCa and 20 – 25% of metastatic PCa cases, with patients having higher relapse, worse overall survival, and is associated with neuroendocrine differentiation. The majority of experiments have been performed in models of the primary tumor, thus the mechanisms that drive metastasis in CHD1 and MAP3K7-deleted PCa remain unstudied. To identify molecular factors that are associated with metastasis in this subtype, we analyzed multiple independent cohorts of primary compared to metastatic PCa tumors. We integrated genome-wide measurements of copy number alteration and mRNA expression to identify gene alterations that are enriched in metastatic tumors. We focused on the strongest statistical loci at chromosome 10q23-10q24, which includes PTEN, an established PCa-associated gene. However, another gene with reported tumor suppressive activity is CHUK, or inhibitor of NFκB subunit alpha (IKKα). NFκB signaling enhances PCa metastasis and is induced by inflammatory signals through MAP3K7.



P83
Identification and evolutionary diversification of novel immunoglobulin and ion channel proteins in SARS-CoV-2 and related viruses

Subject: other

Presenting Author: Yongjun Tan, Saint Louis University, United States

Co-Author(s):
Dapeng Zhang, Saint Louis University, United States
Theresa Schneider, Saint Louis University, United States
Matthew Leong, Saint Louis University, United States
L Aravind, National Institutes of Health, United States

Abstract:

The ongoing COVID-19 pandemic strongly emphasizes the need for a better understanding of the function and evolution of its causative agent SARS-CoV-2. Despite intense scrutiny, several structural/accessory proteins of SARS-CoV-2 remain enigmatic. By using a series of dedicated computational methods, we have successfully uncovered several previously unrecognized families of immunoglobulin (Ig) proteins and ion channel proteins in SARS-CoV-2 and many other viruses. The novel Ig proteins include the mysterious ORF8 proteins from SARS-CoV/SARS-CoV-2 related viruses, many proteins from alpha-CoVs and unrelated animal viruses. We show that the ORF8 proteins from the SARS-CoV/SARS-CoV-2 clade are rapidly evolving, which suggests that they might function as immune modulators to delay/attenuate the host immune response against viruses. In addition, we unified the SARS-CoV ORF3a family with several families of viral proteins, including ORF5 from MERS-CoVs, ORF3c from beta-CoVs, ORF3b in alpha-CoVs, most importantly, the Matrix proteins from all CoVs, and more distant homologs from other nidoviruses. We presented computational evidence that these viral families might utilize specific conserved polar residues to constitute an aqueous pore within the membrane-spanning region. This suggest that the novel coronavirus Matrix/ORF3 ion channel proteins might confer a role in virion assembly and membrane budding.





International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube