Candida auris, the multidrug-resistant human fungal pathogen, emerged as four major distinct geographical clades (clade 1-clade 4) in the past decade. Though isolates of the same species, C. auris clinical strains exhibit clade-specific properties associated with virulence and drug resistance. In this study, we report the identification of unique DNA sequence junctions by mapping clade-specific regions through comparative analysis of whole-genome sequences of strains belonging to different clades. These unique DNA sequence stretches are used to identify C. auris isolates at the clade level in subsequent in silico and experimental analyses. We develop a colony PCR-based clade-identification system (ClaID), which is rapid and specific. In summary, we demonstrate a proof-of-concept for using unique DNA sequence junctions conserved in a clade-specific manner for the rapid identification of each of the four major clades of C. auris. C. auris was first isolated in Japan in 2009 as an antifungal drug-susceptible pathogen causing localized infections. Within a decade, it simultaneously evolved in different parts of the world as distinct clades exhibiting resistance to antifungal drugs at varying levels. Recent studies hinted the mixing of isolates belonging to different geographical clades in a single location, suggesting that the area of isolation alone may not indicate the clade status of an isolate. We propose the utilization of whole genome sequence data to extract clade-specific sequences for clade-typing.

Jeffrey Thompson, University of Kansas Medical Center, United States

Smoking, drinking, and obesity are all major cancer risk factors. A recent study showed that obesity is more prevalent among adults living in nonmetropolitan counties compared to those living in metropolitan counties. However, the authors assumed the variables were stationary and did not consider the local relationship between the predictors and the outcome of interest. To overcome this limitation of the global set of estimates, we explored the multiscale geographically weighted regression (MSGWR) model to investigate the effects of rurality and other demographic variables on the following cancer risk factors: obesity, alcohol usage, and cigarette smoking. 3140 county-level prevalence of cancer risk factors were collected from the Institute for Health Metrics. In the MSGWR model, we assume the variables are nonstationary and model the local relationship between the predictors and an outcome of interest. Our results from MSGWR show that, on average, men living in metropolitan counties have lower rates of obesity than men living in nonmetropolitan counties, after adjusting for income, education, and race. On the contrary, women in metropolitan counties have higher rates of obesity than women living in nonmetropolitan counties, after adjusting for income, education, and race. Within metropolitan counties, adults in the western region of the US tend to have a higher rate of obesity as compared to those in the southern region of the US. On the contrary, these adults have lower rates of binge drinking as compared to those in the southern region of the US, after adjusting for income, education, and race.

In next-generation paired-end sequencing, a fragment is read from both ends, generating paired-end reads. In many modern protocols that sample short fragments (e.g., ATAC-seq, CUT&RUN), these read pairs often overlap. Merging overlapping read pairs can provide longer, higher quality single reads for use in downstream bioinformatics pipelines, especially in low quality datasets. While there exist a multitude of publicly available tools to join overlapping paired-end reads, such as USEARCH, PEAR, FLASH, COPE, PANDAseq, CASPER, and BBMerge, these tools prioritize speed over accuracy. They often ignore quality scores and, capitalizing on the low insertions/deletion rate of Illumina sequencing, ignore or discard alignments that include indels. However, there are datasets where indels are unusually common and pipelines where accuracy is of paramount importance. We examine the performance of several state-of-the-art read pair merging tools on simulated and real datasets, representing various indels rates, overlap sizes, and read qualities. We also present a highly accurate read-pair merger based on Needleman-Wunsch alignment with custom quality-based scoring system and demonstrate its superior performance on noisy datasets.

Tracey Lamb, University of Utah , United States

Understanding small transcriptional changes in rare cell populations through techniques such single-cell sequencing (scRNA-seq) are critical to understanding defects in the immune response. The mechanisms behind diseases such as malaria, which kills nearly 500,000 people each year, that explain the lack of an efficacious and broadly-reactive humoral immune response after infection are poorly elucidated by scRNA-seq due to the noisy, sparse, and biased representations created during clustering that obscure biologically informative signals in cells with a high degree of transcriptional similarity. To fulfill this gap, we present SpliceCluster, a novel pipeline which integrates gene regulatory information to better capture the genetic variation in immune single-cell data sets and provide improved discriminative clustering utility for scRNA-seq analysis. Using the power of deep learning, SpliceCluster utilizes an autoencoder latent representation of attenuated RNA splicing data to isolate communities of interest. Initial benchmarking of SpliceCluster indicates an improvement in classification accuracy with a notable decrease in several heuristic biases of current methods on sequence-orthogonally validated data sets. Additionally, SpliceCluster has revealed important alterations to cyclic affinity maturation pathways in the malaria germinal center that are implicated in defective B memory cell formation. SpliceCluster offers the potential for immunologists to deal with highly-similar immune scRNA-seq data to assemble and characterize previously hidden subpopulations.

AMPylation is an emerging post-translational modification that occurs on the hydroxyl group of threonine, serine, or tyrosine via a phosphodiester bond. AMPylators catalyze this process as covalent attachment of adenosine monophosphate to the amino acid side chain of a peptide. Recent studies have shown that this post-translational modification is directly responsible for the regulation of neurodevelopment and neurodegeneration and is also involved in many physiological processes. Despite the importance of this post-translational modification, there is no peptide sequence dataset available for conducting computation analysis. Therefore, so far, no computational approach has been proposed for predicting AMPylation. In this study, we introduce a new dataset of this distinct post-translational modification and develop a new machine learning tool using a deep convolutional neural network called DeepAmp to predict AMPylation sites in proteins. DeepAmp achieves 77.7%, 79.1%, 76.8%, 0.55, and 0.85 in terms of Accuracy, Sensitivity, Specificity, Matthews Correlation Coefficient, and Area Under Curve for AMPylation site prediction task, respectively. As the first machine learning model, DeepAmp demonstrates promising results which highlight its potential to solve this problem. Our presented dataset and DeepAmp as a standalone predictor are publicly available at https://github.com/MehediAzim/DeepAmp.

Mallika Varkhedi, University of South Florida, United States
Dhruv Patel, University of South Florida, United States
Monica Hsiang, University of South Florida, United States
Andrea Chobrutskiy, University of South Florida, United States
Boris Chobrutskiy, Oregon Health and Science University, United States
George Blanck, University of South Florida, United States

Despite comprising over 60% of all ovarian cancers, ovarian serous cystadenocarcinoma (OSC) has a poor survival rate, highlighting the need for novel treatment approaches. Using The Cancer Genome Atlas’s Ovarian (TCGA-OV) dataset, we applied a previously described algorithm for detecting chemical complementarity between candidate cancer antigens and complementarity determining region-3 (CDR3) amino acid sequences from tumor resident T-cell receptors. We overlapped the complementarity information with gene expression, copy number (CN), and survival data. Current literature demonstrates an association between high CDR3-cancer antigen complementarity and improved survival outcomes. However, we found CDR3-cancer antigen chemical complementarity in OSC was largely associated with worse outcomes. Specifically, high CDR3-MAGEB4 and CDR3-TDRD1 electrostatic complementarity was associated with lower OSC disease free survival (DFS) (high CS DFS = 15.54 months vs. low CS DFS = 21.06 months, p = 0.0056). Additionally, high CDR3-MAGEB4 and CDR3-TDRD1 electrostatic complementarity was associated with decreased gene expression and gene CN in MAGEB4 and TDRD1, respectively. Conversely, when we split the TDRD1 AA sequence into equal-sized fragments, high CDR3-TDRD1 hydro CS, for one, specific AA fragment, was associated with increased DFS rates (high CS DFS = 22.08 months vs. low CS DFS = 17.54 months, p = 0.0387), and higher immune marker expression levels. These results highlight the myriad of opportunities in immunogenomics for risk stratification and identification of potential, actionable cancer antigens for future immunotherapies.

Chris Miller , University of Colorado Denver, United States

Small subunit (16S) ribosomal RNA gene sequencing is a popular method for gaining insight into microbial communities. 16S sequencing studies contain two forms of variation: biological and technical. Biological variation in community composition can yield hypothesis-generating correlations with phenotypes or traits of interest. On the other hand, technical variation occurs due to methodological limitations and, if not accounted for, can confuse the interpretation of biological variation. For example, PCR can include stochastic selection of first templates by DNA polymerase, and sequencing errors can introduce false variation that is not accounted for by analysis software. Here, we develop methods for identifying samples with high technical variation by using highly replicated 16S sequencing of microbial communities growing on media in a 1,4-dioxane remediation bioreactor. We began by understanding the effect that different parameterization of software used in 16S pipelines implemented in the popular QIIME2 package can have on downstream analyses. We then used distance matrices using various beta-diversity metrics to quantify the expected distribution of distances for true replicates, which should be closely related. We show we are able to distinguish between replicates and non-replicates by comparing the distributions of beta-diversity distances for each category. Parametrization and sampling depth had a significant effect on our ability to distinguish between these two sets of distances. These results suggest we can use replicates to probabilistically identify specific samples with high technical variability for removal, as well as to empirically derive parameter values that minimize technical variation overall.

Stephen Piccolo, Brigham Young University, United States

Genomic, medical, and other types of biological data are frequently stored in tab-delimited files as a means of exchanging data among researchers. Such files are commonly gigabytes in size. Tabular data are typically repetitive in nature, but mainstream compression algorithms do not explicitly account for tabular structures. We are designing a custom compression algorithm that specifically targets tabular data. Our goal is to compress tabular files in a way that allows users to query the data in compressed form rather than decompressing the whole file before querying it. This concept is essential for reducing disk usage and to enable researchers to query compressed files remotely, without having to transfer the whole file to their computer. We compare our algorithm against bgzip, an existing compression algorithm that also allows users to access portions of data without having to decompress the entire file; however, it does so by compressing the file in ~64 kilobyte blocks (not necessarily corresponding to rows and columns). Our algorithm uses a two-phase approach. First, it identifies unique values in each column and uses Huffman (binary) codes to ensure that the most frequent values have the shortest codes. Second, it compresses the data using zstandard, a fast lossless compression algorithm. For a ~676 megabyte file with genomic features, this approach reduced storage requirements by 4 orders of magnitude for the coded values. Next steps are to evaluate options for optimizing query speeds and to develop a compression scheme for the Huffman dictionary.

Elena Casiraghi, University of Milan, Italy
Bryan Laraway, University of Colorado, United States
Ben Coleman, Jackson Laboratory, United States
Hannah Blau, Jackson Laboratory, United States
Adnin Zaman, University of Colorado, United States
Nomi Harris, Lawrence Berkeley National Lab, United States
Kenneth Wilkins, National Institute of Diabetes and Digestive and Kidney Diseases, United States
Blessy Antony, Virginia Tech, United States
Michael Gargano, Jackson Laboratory, United States
Giorgio Valentini, University of Milan, Italy
David Sahner, Axle Informatics, United States
Melissa Haendel, University of Colorado, United States
Peter Robinson, Jackson Laboratory, United States
Carolyn Bramante, University of Minnesota, United States
Justin Reese, Lawrence Berkeley National Laboratory, United States

Studies suggest that metformin use is associated with reduced COVID-19 severity in individuals with diabetes compared to other antihyperglycemic medications. This observational study analyzed COVID-19 severity in patients with polycystic ovary syndrome (PCOS) and prediabetes prescribed either metformin or control (levothyroxine or ondansetron) prior to COVID onset. In the prediabetes cohort, metformin use was associated with lower incidence of COVID-19 with “mild ED” or worse (OR: 0.636, 95% CI 0.455 - 0.888, p= 0.007) and “moderate” or worse severity (OR: 0.493, 95% CI 0.339 - 0.718, p = 0.0002) versus levothyroxine. Metformin was associated with lower incidence of “mild ED” or worse severity (OR: 0.039, 95% CI 0.026-0.057, p = 0), “moderate” or worse (OR: 0.045, 95% CI 0.03-0.069, p = 0), “severe” or worse (OR: 0.183, 95% CI 0.077-0.431, p = 1e-04), and “hospice/death” (OR: 0.223, 95% CI 0.071-0.694, p = 0.0096) compared with ondansetron. For PCOS, we found no significant association between metformin use and COVID-19 severity versus levothyroxine, but saw a significantly lower incidence of “mild ED” or worse (OR: 0.101, 95% CI 0.061-0.166, p = 0), and “moderate” or worse (OR: 0.094, 95% CI 0.049-0.18, p = 0) COVID infection versus ondansetron. Metformin use was associated with less severe COVID-19 in patients with prediabetes and PCOS. Further observational and prospective studies will clarify the relationship between metformin and COVID-19 severity in patients with prediabetes and PCOS.

Seungwan Hong, New York Genome Center, United States
Gamze Gürsoy, Columbia University, United States
Daniel Joo, Columbia University, United States

Finding associations between genetic variants and disease phenotypes via machine learning has been an exploding field of study in recent years. However, statistically significant inferences from these studies require a massive amount of sensitive genotype and phenotype information from thousands of patients, creating concerns related to patient privacy. These concerns are exacerbated when machine learning models themselves leak information about the patients in the training dataset. As a result, privacy concerns are constantly in conflict with the urge for widespread access to patient information for research purposes. Homomorphic encryption can be a potential solution as it allows computations on ciphertext space. While many privacy-preserving methods with homomorphic encryption have since been developed to address the privacy of input (genotype) and output (phenotype) during inference, none implemented mitigations for model privacy. This is largely due to the need for cleaning and pre-processing of large-scale genotype data, which is computationally challenging when model parameters are encrypted. Here we implemented a privacy-preserving inference model using homomorphic encryption for five different phenotype prediction tasks, where genotypes, phenotypes, and model parameters are all encrypted. Our implementation ensures no privacy leakage at any point during inference. We show that we can achieve high accuracy for all five models (≥ 94% for all phenotypes, equivalent to plaintext inference), with each inference taking less than ten seconds for ∼200 genomes. Our study shows that it is possible to achieve high quality machine learning predictions while protecting patient confidentiality against any kind of membership inference attacks with theoretical privacy guarantees.

Caiden Lukan, Butler University, United States
Christopher Bristow, MD Anderson Cancer Center, United States
Kim-Anh Do, MD Anderson Cancer Center, United States

The NanoTube is an open-source pipeline that simplifies the processing, normalization, and analysis of NanoString nCounter gene expression data. It is implemented as an extensible R library, as well as an R-Shiny web application that allows analysis by those without computer programming experience. Both versions perform standard gene expression analysis techniques, and additional functions are provided for integration with other R libraries that perform advanced NanoString analysis techniques. The NanoTube R package is available on Bioconductor under the GPL-3 license (https://www.bioconductor.org/packages/NanoTube/). The web application can be downloaded at https://github.com/calebclass/Shiny-NanoTube, or a simplified version can be run on all major browsers, at https://research.butler.edu/nanotube/.

Ella Nysetvold, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Codon bias is an overrepresentation, or preference, for a codon compared to its synonymous codons, which is due, in part, to different concentrations of tRNAs within a cell. Inefficient codons (i.e., those with relatively lower concentrations of their cognate tRNAs) are translated more slowly/inefficiently, compared to synonymous codons with higher concentrations of their cognate tRNAs. Historically, synonymous codon mutations were considered inconsequential since they do not alter the amino acid sequence or were mostly ignored since the effects often aren’t obvious. However, mutations that substitute an inefficient for an efficient synonymous codon can alter translation speed, folding, and mRNA half-life. “Codon-islands” are conserved regions of consecutive relatively inefficient codons that are believed to slow translation so that time-sensitive events can occur (e.g., co-translational folding). These regions are especially vulnerable to synonymous mutations that substitute an inefficient for efficient codon. We developed ExtCodonIslands to identify, extract, and display codon islands. ExtCodonIslands uses precalculated organism-specific efficiency (I.e., tAI) values or, if unavailable, estimates efficiency values from gene sequences for the species. Efficiencies are smoothed using a sliding window to simulate the ribosomal footprint. Codon islands are outlier regions in average codon speed in the gene and are graphed against known protein domains for better contextualization. We anticipate researchers will use ExtCodonIslands to identify islands, which can be considered for downstream studies and analyses, including phylogenetic studies, genotype/phenotype correlation studies, providing plausible functional explanations of common synonymous variants identified in association studies (i.e., fine mapping), protein folding modeling, etc.

Timothy Putman, University of Colorado Anschutz, United States
Ellen Elias, University of Colorado Anschutz, United States
Justin Reese, Lawrence Berkeley National Laboratory, United States
Melissa Haendel, University of Colorado Anschutz, United States

Ehlers Danlos Syndrome (EDS) is a heritable connective tissue disorder with 14 subtypes. This investigation focuses on the hypermobile subtype (hEDS). hEDS presents as various symptoms and medical problems such as cardiological, neurological and gastrointestinal disorders. hEDS represents 80-90% of EDS cases and is thought to affect 1 in 5000 people worldwide . Despite its prevalence, there is not one known gene association or underlying mechanism. The lack of knowledge makes diagnosis and treatment difficult. We used knowledge graphs (KGs) to interrogate the underlying mechanisms of hEDS. KGs combine heterogeneous, pre-existing data to represent relationships of interest. We used these relationships to explore potential hypotheses for the mechanism of action of hEDS, and worked with a clinical geneticist to evaluate the results. We interrogated the Monarch KG, which contains information about genes, phenotypes, diseases, and their relationships, with a set of previously identified hEDS genes of interest, and expanded the list to include mouse orthologs. We then assessed phenotypes associated with hEDS to identify new phenotypically similar genes using semantic similarity algorithms. We focused on genes not already associated with hEDS or other subtypes. Two genes returned were LOX and ATP7A; both related to copper metabolism. Notably, heterozygous LOX mutations were found by the clinical geneticist in two patients with hEDS. Mutations in both genes have phenotypes often seen in hEDS patients, indicating copper metabolism should be further explored in trying to determine the underlying causes of hEDS.

Casey Greene, University of Colorado School of Medicine, United States

Existing public molecular cancer datasets such as TCGA and CCLE include data from diverse cancer types and tissues of origin. In these datasets, all cancer types are not represented equally: some cancer types occur more frequently or are sampled more frequently than others, and some cancer types are more closely related than others. The ability to extract common signal from large pan-cancer datasets, while also generalizing accurately to rare cancer types and heterogeneous individual patients, will be crucial to future precision medicine efforts. However, most existing pan-cancer modeling studies do not measure generalization in this sense, instead reporting performance on a stratified test set with the same proportions of each cancer type observed in the dataset as a whole. In this study, we describe an experimental design for measuring generalization performance within and across cancer types. We applied our experimental setup to several pan-cancer prediction problems across TCGA and CCLE, including driver mutation presence/absence classification and drug response regression from gene expression data. We observed that in most cases generalizing to unseen cancer types (“domain generalization”) is more difficult than generalizing to held-out samples in cancer types represented in the training data (“supervised domain adaptation”), and this effect is often stronger in cancer types that are less closely related to the training data, such as sarcomas and gliomas in TCGA when the training dataset is primarily composed of carcinomas.

Derick Singleton, University of Colorado Denver, United States
Christopher Miller, University of Colorado Denver, United States
Mikayla Borton, Colorado State University, United States
Rebecca Daly, Colorado State University, United States
Kelly Wrighton, Colorado State University, United States

Pathogenic bacteria from environmental sources pose a risk towards public health. Surface waters have been shown to contain possible human pathogens, but pathogen sources and breadth of distribution remain understudied with modern methods. Utilizing high-throughput DNA sequencing methods like metagenomics, large-scale microorganism identification can be completed in an unbiased manner, including for potential pathogens. In addition, metagenomics provides the functional and metabolic potential of these communities, and may offer insight into the way pathogens persist in surface waters. Using hundreds of surface water samples from the Genome Resolved Open Watersheds (GROW) project, we are characterizing the pathogenic potential of samples taken from freshwater ecosystems impacted by a range of human activities. Samples are highly varied and come from diverse locations, ranging from relatively pristine rivers, to downstream of rural agriculture sites, to near urban wastewater treatment plant effluent. These samples provide the potential for broadly characterizing virulence factor diversity. Using a set of 2093 high-quality Metagenome Assembled Genomes, candidate virulence-associated proteins were identified using homology searches against the Virulence Factor Database (VFDB) for an array of medically significant pathogenic bacteria. The VFDB encompasses 14 categories of major bacterial virulence factors stemming from 32 genera. We identified 27,177 unique candidate virulence factor proteins using specific search criterion entailing an e-value of 1E-10, similarity score of 60%, and a bit score of 60. We are analyzing the geographic and phylogenetic distribution of these protein-coding genes to offer insight into microbial pathogen surveillance at scale in freshwater ecosystems.

Robert Schaefer, University of Minnesota, United States
James Mickelson, University of Minnesota, United States
Molly McCue, University of Minnesota, United States

Although first identified nearly 30 years ago, the central role of microRNAs (miRNAs) as post-transcriptional regulators remains incomplete. Over 2,600 canonical or mature miRNAs have been identified in the human genome, where they are pivotal in regulating gene networks spanning nearly all biological processes. As such, dysregulation of miRNAs has been associated with numerous diseases. Further, the complexity and diversity of miRNA-based gene regulation was recognized with the discovery of miRNA isoforms (isomiRs) which may possess tissue-specific expression and an altered targetome. Where much progress has been made in identifying and characterizing miRNAs in humans, the same cannot be said across agricultural species, and in particular for the horse. To address this deficiency, we have generated a gene expression catalog of all equine microRNAs and isoforms based on nearly 30 tissues from a cohort of 12 healthy American Quarter horses. A publicly available, containerized pipeline was developed, employing the most current computational approaches to identify and quantify known canonical miRNAs, associated isomiRs, and predict known miRNAs in a tissue-specific manner. Preliminary analyses have demonstrated the predominance of isomiRs in healthy tissue, including significant differential expression between tissue types, and as potentially important components in co-expression networks. Additionally, hundreds of unannotated novel miRNAs have been predicted, bringing the number of equine miRNA closer to that of humans. The significance of this study is two-fold, as both a resource for interrogating miRNA expression in healthy tissues, and as a tool for processing and analyzing small RNA-seq data via the containerized pipeline.

Cassidy Andrasz, California Polytechnic State University, United States
Nicholas Zarate, California Polytechnic State University, United States
Crow White, California Polytechnic State University, United States
Jean Davidson, California Polytechnic State University, United States
Paul Anderson, California Polytechnic State University, United States

DNA sequencing is affordable and accessible, resulting in a rapid expansion of assembled genomes and transcriptomes available. Non-model organisms have unique qualities and characteristics which may help bridge gaps in current biological understanding. However, they are intrinsically complex and difficult to study due to the lack of established references, unpredictable genome composition, and may require extensive protocol optimization to overcome various sequencing inhibitors. The Kellet’s whelk, Kelletia kelletii, is a non-model marine gastropod which has proven to be an extremely difficult organism to sequence. However, it is a fascinating keystone species with interesting population dynamics possibly related to climate changes. We propose overcoming the complexities of this non-model organism genome by creating a draft genome assembly using three sequencing technologies (PacBio, Illumina, and Oxford Nanopore Technologies), and a draft transcriptome assembly using Illumina. Multiple de novo genome and transcriptome assemblers were employed and benchmarked to yield the most complete assemblies. Metrics for comparison included draft assembly quality and assembler performance. The Kellet’s whelk genome and transcriptome will provide a model for future studies evaluating range expansion associated with a changing climate in coastal marine invertebrates. Additionally, this study proposes a pipeline for improved de novo genome assembly in non-model species. Opening the discussion on how to improve sequencing and assembly in non-model species will guide researchers wanting to explore de novo genomes and transcriptomes across the tree of life.

Giuliano Costa, University of Colorado, Boulder, United States
Casey Greene, CU Anschutz, United States

Epigenetic mechanisms guide the progression of immature to mature cell types. One such mechanism is chromatin accessibility, which is highly dynamic and directly influences gene expression by modulating the accessibility of specific genes. Several studies have explored the relationship between chromatin state and expression. These studies either focus on the temporal aspect of a single datatype, or a single timepoint with joint measurements. Both methods and data for exploring chromatin state and expression over time remain limited. However, a recent Multiome dataset from the Multimodal Single-Cell Integration Kaggle competition has a wealth of data in both the time and paired 'omics dimensions. Since this dataset has paired measurements at single-cell resolution, we can rephrase the goal of identifying the epigenetic mechanisms that lead to changes in gene expression and cell maturation as a regression task. The task is to predict gene expression from chromatin accessibility using biologically interpretable features at single-cell resolution. Traditional ATAC-Seq features are extracted using Latent Semantic Indexing, which is difficult to interpret. We instead selected interpretable feature summarization techniques, such as motif scores and the use of cell-type-specific regions. We implemented multiple time-point agnostic baseline regression methods: dummy, ridge, MLP, and deep neural networks. Using Spearman correlation as our evaluation metric, we found that we could not yet beat the baseline dummy regressor (correlation: 0.68) using each type of feature set. We believe that adding temporal features will allow for a more accurate characterization of how accessibility affects expression.

Sayed Mehedi Azim, Rutgers University, United States

The use of therapeutic peptides for the treatment of cancer has received tremendous attention in recent years. Anticancer peptides (ACPs) are considered new anticancer drugs which have several advantages over chemistry-based drugs including high specificity, strong tumor penetration capacity, and low toxicity level for normal cells. Due to the rise of experimentally verified bioactive peptides, several in silico approaches became imperative for the investigation of the characteristics of ACPs. In this paper, we proposed a new machine learning tool named iACP-RF that uses a combination of several sequence-based features and an ensemble of three heterogeneously trained Random Forest classifiers to accurately predict anticancer peptides. Experimental results on the Anticancer dataset show that our proposed model achieves an accuracy of 75.9% which outperforms other state-of-the-art methods by a significant margin. We also achieve 0.52, 75.6%, and 76.2% in terms of Matthews Correlation Coefficient (MCC), Sensitivity, and Specificity, respectively. iACP-RF as a standalone tool and its source code are publicly available at: https://github.com/MLBC-lab/iACP-RF.

Varsha Sreekanth, University of Colorado, United States
Hunter Carroll, University of Colorado, United States
Teemu Laajala, University of Turku, Finland
Svitlana Tyekucheva, Harvard University, United States
James Costello, University of Colorado, United States

Estimating the proportion of tumor cells in a bulk tissue sample (tumor purity) is crucial to accurately interpret molecular profiles, which impacts downstream pathway analysis and clinical decisions. Multiple algorithms exist to assess tumor purity, but it has been shown that published methods frequently fail to recapitulate expert pathologist estimates in prostate cancer (PCa) (Pearsons R between pathology and published methods = 0.13 to 0.39 (Haider et al. 2020)). To address this issue, we developed PROSTIMATE, which builds on the ESTIMATE algorithm to optimize tumor purity calculations in the specific context of PCa. ESTIMATE, developed by Yoshihara et al., infers the proportions of stromal, immune, and tumor cells in a biopsy using single-sample gene set enrichment analysis (ssGSEA) of cell type-specific transcriptional profiles. While ESTIMATE works well in some tumor types, we observe that it does not reliably predict cellularity in PCa, where non-cancerous epithelial cell content is significant. We hypothesize that retraining the ESTIMATE algorithm using PCa datasets and including a non-cancer epithelial cell transcriptional signature from the adjacent normal prostate tissue improves predictions of PCa tumor purity. We derived novel transcriptional signatures for stromal, immune, and non-cancer epithelial cells using gene expression data from PCa primary tumors, benign prostate hyperplasia, and adjacent normal tissue. We used ssGSEA to apply these signatures to publicly available PCa gene expression data to generate PROSTIMATE predictions, which we compared against matched pathologist estimations to show improved prediction accuracy compared to ESTIMATE and other published algorithms.

Carolina Heimann, Institute for Systems Biology, United States
Ilya Shmulevich, Institute for Systems Biology, United States
David Gibbs, Institute for Systems Biology, United States
Vesteinn Thorsson, Institute for Systems Biology, United States
Andrew Lamb, Sage Bionetworks, United States
Yooree Chae, Sage Bionetworks, United States
Amy Heiser, Sage Bionetworks, United States
Dante Bortone, University of North Carolina, United States
Benjamin Vincent, University of North Carolina, United States
Sarah Dexheimer, University of North Carolina, United States
Steven Vensko, University of North Carolina, United States

The Cancer Research Institute (CRI) iAtlas (www.cri-iatlas.org) is a platform for interactive data exploration and discovery in immuno-oncology (IO), originating in a pan-cancer working group study by The Cancer Genome Atlas. At present, iAtlas provides 17 analysis modules to explore immune-cancer interactions, immunotherapy treatment, and outcomes in over 12,000 participant samples. To meet evolving demands of data heterogeneity and volume, we have built a scalable, cloud-hosted relational database and a GraphQL-based API layer — both of which are driven by a thoroughly documented and standards-compliant data model. We provide an R client library for bioinformaticians and analysts who wish to programmatically explore and access data. The app itself (built using the Shiny framework) leverages the API client for data query and retrieval to drive visualizations and other functionality, including a rich selection of filters and conditions for working with custom cohorts. We continue to expand the breadth of data in iAtlas, including new immune checkpoint inhibition (ICI) trials with accompanying outcome data, as well as large-scale cancer -omics datasets such as the Pan-Cancer Analysis of Whole Genomes and the Human Tumor Atlas Network. For each, we have robust pipelines for data processing, encoded as CWL or Nextflow workflows — all of which are fully open and reusable. We aim to make iAtlas a platform that both democratizes analysis for IO researchers and enables contributions from tool developers or data scientists. Everything we build is intended to be as FAIR as possible to advance and accelerate discovery in combating cancer.

Gamze Gursoy, Columbia University, United States

Precision medicine has the potential to provide more accurate diagnosis, appropriate treatment and timely prevention strategies by considering patients’ biological makeup. However, this cannot be realized without integrating clinical and omics data in a data sharing framework that achieves large sample size. Due to their distinct data types and privacy and data ownership issues, integrated clinical and genetic data systems are lacking, leading to missed opportunities. Here we present a secure framework that harmonizes storage and querying of clinical and genomic data using blockchain technology. Our platform combines clinical and genomic data under a unified framework using novel data structures. It supports combined genotype-phenotype queries, gives institutions decentralized control of their data, and provides user access logs, improving transparency into how and when health information is used. We demonstrate the value of our framework for precision medicine by creating genotype-phenotype cohorts and examining relationships within them. We show that combining data across institutions using our secure platform increases statistical power, enabling discovery of novel connections between genetics and clinical observations in Amyotrophic Lateral Sclerosis. Overall, by providing an integrated, secure and decentralized framework, we envision more communities can participate in data sharing to advance medical discoveries and enhance reproducibility.

James Costello, University of Colorado Anschutz Medical Campus, United States
Charlene Tilton, University of Colorado Anschutz Medical Campus, United States
Robert Jones, University of Colorado Anschutz Medical Campus, United States
Nathaniel Xander, University of Colorado Anschutz Medical Campus, United States
Dan Theodorescu, Cedars Sinai Medical Center, United States

The standard of care for eligible patients with muscle-invasive bladder cancer (MIBC) is cisplatin-based neoadjuvant chemotherapy followed by radical cystectomy. However, platinum-based treatments leave up to 70% of MIBC patients with residual disease, and the 5-year survival rate for these individuals is under 30%. Understanding cisplatin resistance mechanisms will improve MIBC treatment by developing chemotherapy response biomarkers and precision medicine strategies. We previously identified NPEPPS, an M1 aminopeptidase, as a novel mediator of platinum drug response through its regulation of volume regulating anion channels (VRACs) which control platinum drug import. We have shown that NPEPPS is upregulated across 5 cisplatin resistant human MIBC cell lines, and that loss of NPEPPS expression from resistant cell lines is sufficient to restore normal levels of platinum uptake and improve sensitivity. These findings point to NPEPPS as a novel therapeutic target to improve platinum-based drug response rates in patients with MIBC, but the mechanisms driving increased NPEPPS expression in response to platinum-based drugs remain unknown. We recently used an integrative bioinformatics approach to generate and prioritize a list of transcription factors (TFs) which may be cisplatin-responsive to induce differential NPEPPS mRNA expression. The next steps will be to suppress the top ten prioritized TFs, then treat with cisplatin and measure whether the normal induction of NPEPPS is blocked in the context of TF KD. This will provide a novel characterization of cisplatin-induced transcription factor activity and improve our understanding of the upstream regulatory pathways controlling NPEPPS expression in the context of cisplatin treatment.

Larry Hiunter, University of Colorado, United States
Dan Denman, University of Colorado, United States

Visual perception is a complex phenomenon. The brain must process sensory input, extract relevant features, and generate a behavioral response. Each step involves intricate neural computations requiring the coordinated action of large groups of neurons distributed across several brain areas. These interactions are governed by both the physical connectivity and functional connectivity of the areas. Anatomical connections provide a substrate for signal transmission, but how information flow is regulated across this substrate is a topic of much debate. Recent work in the mammalian visual system has shown that subpopulations within an upstream source area appear to drive responses in downstream target areas. The activity of these subpopulations defines an m-dimensional hyperplane in the source area n-dimensional activity space (m < n). Additional work has validated the importance of subspaces to inter-areal communication and suggested that inter-population subspaces can be dynamically reoriented or reshaped, for example by stimulus onset or attention. Little is known about how subspaces behave under different contexts, such as different stimulus classes. Characterizing subspace dynamics under different stimulus classes is a key next step in understanding how representations of neural space are transformed between brain areas. Using multi-task learning in an artificial neural network model, I assess whether different stimulus classes are represented by the same subspace and how stimulus classes are related in representation space.

James Costello, University of Colorado Anschutz Medical Campus, United States
Larry Hunter, University of Colorado Anschutz Medical Campus, United States

People with Down syndrome (DS) experience multiple co-occurring conditions, which can include but are not limited to leukemia, dementia, endocrine abnormalities, autoimmune disorders, and obesity. Many of these conditions are managed with pharmacotherapy. However, systemic dysregulation due trisomy 21 may alter drug disposition, response, and lead to adverse reactions. A method for accurately predicting adverse reactions in people with DS would increase the safety of prescribing drugs in this population. Biomedical knowledge graphs compile complex interactions from biological databases into a relational structure. These graphs can include edges between proteins and diseases, drugs and proteins, as well as drugs and adverse reactions. Knowledge graphs are inherently incomplete and do not contain edges for specific conditions. For example, pediatric patients with DS and leukemia experience higher rates of toxicities like cardiotoxicity, myelosuppression, and mucositis when treated with methotrexate or thiguanine, yet those relationships are not included in the BioKG knowledge graph Here we present a novel framework for predicting adverse drug reactions in the context of DS. Our approach integrates the BioKG knowledge graph with DS transcriptomes by combining latent embeddings learned using a Graph Variational Autoencoder (VAE) and VAE, respectively. The combined embeddings are decoded into a simulated graphical representation using the trained GraphVAE decoder. Pathways between diseases, targets, drugs, and adverse reactions in the simulated graph represent DS-specific predictions. The framework we have developed has led to new insights into DS specific adverse events and provides an approach to explore adverse events in other genetic diseases.

Michael Strong, National Jewish Health, United States

Small changes to the bacterial genome can result in major changes to the appearance and behavior of an infecting bacterial population. In the bacteria Mycobacterium abscessus (MAB), the transition from a smooth to a rough colony morphology correlates with a more aggressive infection and poorer patient outcomes. Previous work has implicated disruptions to the glycopeptidolipid (GPL) synthesis genes and differences in the methylome as potential causes. Tools exist to compare genome sequences; however, custom scripts and manual work are required to associate sequence differences with particular annotations and the results are difficult to reproduce. Here, we present a reference-free genome comparison tool called Kable. Kable uses a colored de Bruijn Graph structure and index to query multiple bacterial genomes for variation in sequence, annotation boundaries, and methylation status. As a proof of concept, we applied Kable to four pairs of clonal smooth/rough MAB isolate genome assemblies using the corresponding gene annotation and methylation meta data. Kable identified the sequence variants and automatically found the gene that harbored or was adjacent to the variant. In all four pairs, the rough isolate contained a variant in the GPL locus that disrupted a gene which could then not be annotated. Even though the annotation was missing in the rough isolate, Kable could still determine which gene was disrupted and detect the mutation. With Kable, we present the first methylome maps of smooth/rough pairs of clonal MAB isolates and show that methylomes were more similar within pairs than between isolates of the same morphotype.

Casey Greene, University of Colorado Anschutz Medical Campus, United States
Jennifer Doherty, University of Utah, United States
Stephanie Hicks, Johns Hopkins University, United States
Lukas Weber, Johns Hopkins University, United States

Bulk RNA-seq is an efficient, scalable method for profiling gene expression in a tissue, but it loses information about the cell type composition of the tissue sample. Deconvolution allows for the computational estimation of cell type proportions in bulk samples. Many deconvolution software packages have been created in recent years, with single-cell RNA-seq allowing for more precisely defined reference cell type expression profiles. However, many deconvolution methods were designed for normal tissue and have not been benchmarked on a cancer dataset. Also, experimental design decisions can introduce bias in deconvolution results. We have generaterated a novel dataset of high-grade serous ovarian tumors, with paired expression profiles from single-cell and bulk methods as a benchmark designed to directly assess the impacts of measurement technology on bulk assays. We assess the viability of pooling samples for single-cell sequencing and subsequently demultiplexing to identify samples of origin for all cells, including cancer cells. We identify how the dissociation process introduces specific cell type biases that can impede accurate deconvolution of ovarian tumors. We also outline how the difference in mRNA enrichment method (typically ribosomal RNA depletion in bulk RNA-seq and poly-A capture in scRNA-seq) introduces discrepancies between the two data types. In benchmarking existing deconvolution methods based on their robustness to these technical biases, we present best practices recommendations for scientists looking to use deconvolution to study tumor heterogeneity in bulk RNA-seq datasets.

Emi Ford, Brigham Young University, United States
Stephen Piccolo, Brigham Young University, United States

Classification algorithms are useful for training computers to discriminate between groups. However, no single algorithm is best for every dataset. Accordingly, researchers have explored ways to combine algorithms using multiple-classifier systems. Two well-known packages that support such learning are AutoSklearn and Dynamic Ensemble Selection Library (DESLib). Typically, the more complex the algorithm, the longer it takes to run. Thus, when working with larger biological datasets, researchers must consider the tradeoff between time spent classifying and the accuracy of the predictions. We compared and contrasted 4 traditional classification algorithms, 4 variations of AutoSklearn, 4 algorithms within DESLib, and 3 prediction-combination approaches (majority vote, mean probability, and extreme probability). We applied each method to 6 biological datasets that each have 2 classes and vary in the number of observations (100-19000), features (4-1000), and imbalance ratios (1:1-1:5). Across all datasets, AutoSklearn had better predictive performance than all other classifiers but required at least 4 times as much computational time, while Random Forest performed nearly as well as AutoSklearn and was fast to execute. Algorithms from DESLib performed similarly to the prediction-combination approaches–relatively fast with high or moderate predictive ability. We conclude that when time is not a major concern, researchers should use AutoSklearn for classification. However, when datasets are large or decisions are needed quickly, the Random Forest algorithm should be considered a competitive alternative.

Jay Shendure, University of Washington, United States

As single-cell sequencing atlases are increasing in size and complexity, community input and domain expertise is essential to extract full value from these datasets after data release or publication. To facilitate this, we created a framework to support the interactive web-based exploration of single cell atlases comprising millions of cells. This lightweight framework includes a pipeline to create back-end databases that store single cell datasets in an easily searchable fashion. The front-end web page connects with the database and allows interactive exploration of the single cell datasets. Specifically, the pages allow easy selection of cell groups of interest, fast plotting of gene expression and metadata on 3-dimensional or 2-dimensional embedded spaces as interactive scatter plots, and display of summary statistics of the data. Examples of our framework include interactive browsers that allow exploration of mutant-specific effects on gene expression in trajectories and sub-trajectories in a mouse mutant cell atlas (https://atlas.gs.washington.edu/mmca_v2/), and exploration of time point specific gene expression and motif enrichment scores in a drosophila embryonic development cell atlas (https://atlas.gs.washington.edu/deap_v2/), both with over 1.5 million cells. A similar server supporting exploration of a >10 million cell dataset is under development.

Yiyan Yang, NLM, United States

Gastrointestinal (GI) tracts are colonized with abundant and complex bacterial communities that are often distinct from free living communities of bacteria in the environment. The genetic features and the molecular mechanisms responsible for niche differentiation and host GI tract preference in these bacteria have been understudied, particularly on larger scales. In this study, we developed Evolink, a phylogeny-aware tool for the rapid identification of genotype-phenotype associations across large-scale microbial datasets. Evolink was applied to over 30,000 bacterial species from the Genome Taxonomy Database with habitat metadata annotated in publicly available microbial databases. We identified genes positively associated with GI adaptation shared by Bacteroidota and Firmicutes including genes that are involved in responding to host oxidative immune responses, acid-resistance, biofilm formation, and xenobiotic/endobiotic metabolism. Negatively associated genes were often related to non-host environmental stress response such as chemotaxis proteins and superoxide dismutases. We found that genes related to quorum sensing may contribute to GI tract adaptation in Bacteroidota. Additionally, loss of flagellar-related genes in the Firmicutes phylogeny was found to coincide with the emergence of GI-associated clades, indicating that their loss may have a role in GI tract adaptation within this phylum. Overall, our study presents a systematic analytical tool for identifying genotype-phenotype associations and provides a global view of the adaptive strategies employed by different GI-associated bacterial phyla, thus providing a rich resource for further study of this topic.

Age and sex are historically understudied factors in biomedical studies even though many complex traits and diseases vary by these factors in their incidence and presentation. Hence, there is a critical need for analytical frameworks that can aid scientists in systematically bridging gaps in understanding age- and sex-specific genetic and molecular mechanisms. Hundreds of thousands of publicly-available gene expression profiles present an invaluable, yet untapped, opportunity for addressing this need. However, the bottleneck is that a vast majority of these profiles do not have age and sex labels. Therefore, we first ~30,000 samples associated with age and sex information and then trained machine learning (ML) models to predict these variables from gene expression values. Specifically, we trained one-vs-rest logistic regression classifiers with elastic-net regularization to classify transcriptome samples into age groups separately for females and males. Overall, the classifiers are able to discriminate between age-groups in a biologically meaningful way in each sex across technologies. The weights of these predictive models also serve as ‘gene signatures’ characteristic of different age groups in males and females. We also inferred genomewide sex-biased genes within each age group. Enrichment analysis of these gene signatures helped us identify age- and sex-associated multi-tissue and pan-body molecular phenomena (e.g. general immune response, inflammation, metabolism, hormone response). Our curated dataset, gene signatures, and enrichment results will be valuable resources to aid scientists in studying age- and sex-specific health and disease processes.

Victoria Leventman, University of North Florida, United States
Samuel Mikell, University of North Florida, United States
Daniel Tatem, University of North Florida, United States
Julia Gabel, University of North Florida, United States
Alexander Bartkowiak, University of North Florida, United States
Marie Mooney, University of North Florida, United States

The growing number of published biomedical articles stored in literature databases is far outpacing the rate at which bio-researchers can manually examine and annotate the literature to best meet their research needs. The curation of articles relevant to various biological subjects accommodates extending further research and fostering new developments and hypotheses as biologists enhance their domain knowledge. In this work in progress, we are investigating the feasibility of biocuration for the neuropharmacology field, with a focus on the model organism zebrafish (Danio rerio), and its potential to be utilized for automated annotations. The study first verified that literature related to genes in the GABA pathway, glutamate transmitter, the drug Ivermectin, and zebrafish were readily available in biomedical literature databases such as PubMed and PubMed Central. Then we developed a novel Relevancy score metric by incorporating co-occurrences of query terms and their frequencies, especially focusing on the relationship between GABA and Ivermectin drug/glutamate. After selecting the top 15 abstracts from this new ranking, these abstracts were manually annotated and compared to two standard baseline retrieval methods by a Biologist to determine if the new method was able to provide more relevant results. We are finding that the new rankings are qualitatively better and that they are introducing new potential topics of research and directions around GABA. With information retrieval and annotated literature, our method can form the foundation for a knowledge base on neuroactive drug discovery that is made available to bio-researchers.

Steven D. Leavitt, Brigham Young University, United States

While DNA barcoding as a method for identifying species has gained popularity due to its ease of implementation and relative cost efficiency, it is limited by poor representation of many understudied organismal groups. This results in low success of species identification when using general databases. These limitations are notable for lichens – a group of organisms that are difficult to identify but important for biomonitoring and conservation. We propose that developing a focused, regional DNA database will improve identification success through DNA barcoding. To test this, we developed a custom regional lichen DNA database of the Intermountain West, USA, comprising more than 600 species with 4862 ITS-region DNA sequences. Using both bulk and vouchered lichen community samples collected from sites across the Intermountain West, we generated more than 180,000 DNA sequences, combined these into 678 OTUs and assessed species identification rates (defined as successfully reaching the taxonomic level of species) with both our custom database and the standard UNITE database. We show that our regional database successfully identifies more species, on average, than UNITE. This trend is present whether considering identification within taxonomic families or within habitats. We suggest that efforts to improve DNA barcoding identification success should be more focused on building comprehensive regional databases, which will, in turn, allow general databases to improve identification success.

Skip Garner, Orbit Genomics, United States

The clinical significance of microsatellites – a type of repetitive DNA – is hard to pin down. On one hand, mutations in tri-nucleotide microsatellites are sufficient to cause diseases such as Machado-Joseph, Huntington’s, and various ataxias. One the other hand, microsatellites have been dismissed from most studies of complex diseases. For over a decade the Dark Matter Lab (Skip Garner, PI) has made inroads by repurposing techniques developed for single nucleotide polymorphisms (SNP); published results have identified new susceptibility loci for lung cancer, breast cancer, and medulloblastoma. These discoveries recently helped secure funding for Orbit Genomics: a Colorado-based company focused on developing an aid to diagnosis to confirm a positive low dose CT scan. The results of our first study have revealed 277 candidate loci for lung cancer. The candidates include 23 loci in coding regions and recapitulate well-established gene associations such as ARID1B and REL. The set of 277 loci identify lung cancer with high sensitivity (.90) and specificity (.88). Primers designed for amplicon sequencing capture 98.26% of the candidate regions and provide a route to low-cost validation studies. Up and coming techniques will increase sensitivity and ensure reproducibility with minor alleles and neural networks.

Bradley Bowles, Mayo Clinic, United States
Rory Olson, Mayo Clinic, United States
Chris Schmitz, Mayo Clinic, United States
Gavin Oliver, Mayo Clinic, United States
Karl Clark, Mayo Clinic, United States

Rare genetic disease affects less than 200,000 individuals in the United States. Clinical exome and genome sequencing have improved the diagnosis rates for patients afflicted with these rare genetic diseases, and yet most of these patients remain undiagnosed. Innovations in multiomic approaches and DNA analytics have improved outcomes for these patients. Here, we present our initial work at looking into another relatively unexplored area the genome, upstream open reading frames (uORFs). We collected pathogenic 5’UTR variants from Human Gene Mutation Database and benign variants from ClinVar. We identified variant interpretation annotations (such as CADD, DANN, or population allele frequency) that best correlated with variant pathogenicity and an additional set of annotations that correlated specifically with 5’UTR variants altering mRNA translational efficiency. We used this to screen a set of 777 undiagnosed rare disease patients for deleterious 5’UTR variation and functionally characterized the variants using luciferase assays. Allele frequency, CADD score, and DANN are moderate predictors of 5’UTR variant pathogenicity status, while a variant’s predicted effect on 5’UTR regions and predicted change in transcript ribosome binding ability are associated with translationally disruptive 5’UTR variation. Using these insights we identified uORF-disrupting 5’UTR variants predicted to alter gene expression in our rare disease patient cohort and experimentally assess this using In vitro luciferase constructs. Based on this we establish a prototype process for identifying pathogenic 5’UTR variation and functionally assessing the effect of this variation. These findings can be used to efficiently prioritize 5’UTR variation for manual review of functional characterization.

Chronic obstructive pulmonary disease (COPD) is a disease that causes airflow obstruction, which causes difficulty in breathing, and it is currently the fourth leading cause of mortality in the United States. High-throughput technologies have been used to generate data at the molecular level for COPD subjects in the blood to help identify biomarkers that predict disease severity and progression. In addition to individual biomarkers, understanding relationships between molecular features in the context of COPD, can give insight into biological mechanisms and move away from the reductionist approach of studying single genes or proteins at a time. Constructing multi-omics networks with respect to traits of interest gives an opportunity to understand relationships in a pathway but integrating is challenging due to multiple data sets with high dimensionality. SmCCNet has been implemented to construct pairwise multi-omics network modules by integrating proteins, metabolites, and COPD-related phenotypes. However, true biological mechanisms may be more complicated than pairwise relationships. Instead, higher order relationships may be associated with COPD-related phenotype, which can be practically represented by a hypergraph (edge connects more than two nodes) rather than a regular graph. To explore the complex biological structure between molecular features, we implemented a novel tensor hypergraph-based multi-omics network analysis pipeline on data from the COPDGene cohort for network inference towards COPD traits using transcriptomics, proteomics, and metabolomics. Network modules are interpreted with various downstream analysis including their association with the COPD phenotypes and pathway enrichment. Additionally, the network summarization score will be extracted to conduct mediation analysis.

Understanding the genetic basis of human diseases and their associated genes is a vital task. Prior works using differential gene expression approach on controlled studies have paved the way towards identifying genes related to the disease of interest. Despite a plethora of publicly available gene expression datasets, systematically studying human diseases using all these data is challenging because (1) gene expressions are inherently noisy and (2) only a few gene expression datasets are annotated with clear disease annotations. Here, we propose a graph signal processing approach to uncover disease genes and infer disease-study correspondence simultaneously. In particular, we consider each gene expression sample as a signal over the graph defined by functional gene interactions, which allows us to filter and denoise the expression data. The filtered gene expression samples are then used to train a linear model that aims to recapitulate disease-associated genes. For each disease, we consider the gene expression studies that best predict the corresponding disease genes to be related to the disease. Our evaluation results indicate that the proposed approach can recover previously known disease-study associations. Moreover, we show that different frequency information resulting from graph signal filtering serves different purposes; while low-frequency information excels at disease gene predictions, high-frequency information helps better uncover disease-study correspondence.

Kayla Johnson, Michigan State University, United States
Sneha Sundar, Michigan State University, United States
Renming Liu, Michigan State University, United States
Hao Yuan, Michigan State University, United States
Arrjun Krishnan, University of Colorado-Denver Anschutz Medical Campus, United States

Network-based machine learning is a powerful approach for leveraging the cellular context of genes to computationally predict novel/under-characterized genes that are functionally similar to a set of known genes of interest. One powerful network-based gene classification method that is gaining popularity is to use supervised learning algorithms where the features for each gene are determined by that gene’s connections in a molecular network. In this work, we explore how networks from multiple species can be jointly leveraged to improve this gene classification method. We first build multi-species networks by connecting nodes (genes/proteins) in different species if they belong to the same orthologous group. Then, we create feature representations by directly considering a gene's connection to all other genes in the entire multi-species network or considering a low-dimensional embedding for the entire network. We find that adding information across species improves performance for the tasks of predicting human and model species gene annotations across a set of non-redundant gene ontology biological processes. In addition to providing better predictions, we show how this approach casts genes across species into the same “space” where they can be used to improve how knowledge is transferred from one species to another.

The ability to identify and track T cell receptor sequences from patient samples is becoming central to the field of cancer research and immunotherapy. Tracking genetically engineered T cells expressing TCRs that target specific tumor antigens is important to determine the persistence of these cells and quantify tumor responses. The available high-throughput method to profile T cell receptor repertoires is generally referred to as TCR sequencing. However, the available TCR-Seq data is limited compared to RNA sequencing (RNA-Seq). In this paper, we have benchmarked the ability of RNA-Seq-based methods to profile TCR repertoires by examining 19 bulk RNA-Seq samples across four cancer cohorts including both T cell rich and poor tissue types. We have performed a comprehensive evaluation of the existing RNA-Seq-based repertoire profiling methods using targeted TCR-Seq as the gold standard. We also highlighted scenarios under which the RNA-Seq approach is suitable and can provide comparable accuracy to the TCR-Seq approach. Our results show that RNA-Seq-based methods are able to effectively capture the clonotypes and estimate the diversity of TCR repertoires, as well as provide relative frequencies of clonotypes in T cell rich tissues and low diversity repertoires. However, RNA-Seq-based TCR profiling methods have limited power in T cell poor tissues, especially in highly diverse repertoires of T cell poor tissues. The results of our benchmarking provide an additional appealing argument to incorporate RNA-Seq into the immune repertoire screening of cancer patients as it offers broader knowledge into the transcriptomic changes that exceed the limited information provided by TCR-Seq.

Hanne H Henriksen, Copenhagen University Hospital, Denmark
Sigurður T Karvelsson, University of Iceland, Iceland
Óttar Rolfsson, University of Iceland, Iceland
Morten H Bestle, North Zealand Hospital, Denmark
Pär I Johansson, Capital Region Blood Bank, Copenhagen University Hospital, Denmark

Sepsis is a major cause of death worldwide, with a mortality rate that has remained stubbornly high. The current gold standard of risk stratifying sepsis patients provides limited mechanistic insight for therapeutic targeting. An improved ability to predict sepsis mortality and to understand the risk factors would allow better treatment targeting. Sepsis causes metabolic dysregulation in patients; therefore metabolomics offers a promising tool to study sepsis. It is also known that that in sepsis endothelial cells affecting their function regarding blood clotting and vascular permeability. We integrated metabolomics data from patients admitted to an ICU for sepsis, with commonly collected clinical features of their cases and two measures of endothelial function relevant to blood vessel function, PECAM and thrombomodulin. Firstly, we performed differential analysis of metabolites in surviving vs non-surviving patients using the LIMMA R package to account for age and gender of patients. Next, we used penalized regression, enrichment analysis and pathway analysis to identify features most able to predict 30-day survival. The features important to sepsis survival include TCA cycle metabolites, and amino acids, as well as endothelial proteins and a medical history. To understand how this relates to other clinical features we used a combination of penalized regression and correlation analysis, this showed links between medical history and fatty acid metabolites, suggesting that pre-existing metabolic dysregulation may be a contributory factor to sepsis response. By exploring sepsis metabolomics data in conjunction with clinical features and endothelial proteins we have gained a better understanding of sepsis risk factors.

David R. Ziehr, Massachusetts General Hospital and Harvard Medical School, United States
Clary B. Clish, MasMassachusetts Institute of Technology , United States
Joseph Loscalzo, MasBrigham and Women’s Hospital and Harvard Medical School, United States
William M. Oldham, MasBrigham and Women’s Hospital and Harvard Medical School, United States

The cascade of metabolic changes needed to survive in hypoxia are well studied in cancer but there is now a recognition that similar changes contribute to nonmalignant diseases, including idiopathic pulmonary fibrosis and pulmonary arterial hypertension. To study hypoxia in this setting we combined transcriptomics with metabolomics in three human lung cell types—endothelial cells (ECs), smooth muscle cells (SMCs), and lung fibroblasts (LFs)—cultured in normoxia or hypoxia, alone or in co-culture. Analyzing the metabolomic and transcriptomic data separately showed both expected changes; “Hallmark” gene-set enrichment analysis of differentially expressed genes in hypoxic cells showed “hypoxia” as the most enriched set, and less expected changes; metabolomics revealed fatty acid metabolism as a novel area of change, further, alterations to EC metabolites following co-culture with SMCs mimic those involved in hypoxia. To investigate the regulatory mechanisms involved in hypoxia, we employed network-based strategies to integrate the metabolomic and transcriptomic datasets. We mapped our differentially regulated metabolites and transcripts to a STITCH protein/chemical database and used a community detection algorithm to identify enriched KEGG pathways. Shared subnetworks relevant to the hypoxia response included known metabolic responses to hypoxia and more novel pathways involved in redox stress. Secondly, we used the CARNIVAL/COSMOS approach to identify the relationships between metabolite alterations and transcription factor activity, revealing the opposing roles of HIF1α and MYC. This work represents the first broad multi-omics examination of metabolic adaptations to hypoxia in human primary cells and offers an opportunity to identify and exploit pathways relevant to lung diseases.

Sarah Gehrke, CU Anschutz, United States
Anh Nguyen, CU Anschutz, United States
Anita Walden, CU Anschutz, United States
Sruthi Magesh, CU Anschutz, United States
Julie Bletz, Sage, United States
Anne Thessen, CU Anschutz, United States
Shawn O’Neil, CU Anschutz, United States
James Eddy, Sage, United States
Monica Munoz-Torres, CU Anschutz, United States
Melissa Haendel, CU Anschutz, United States
Kaitlin Flynn, Sage, United States

The most pressing biomedical challenges of our time require collaboration across disciplinary and institutional boundaries. Over the last two decades it has become clearer how to more successfully approach this; however, there are often few resources and infrastructure available to apply known team science best practices to data-intensive research. Further, methods for evaluating collaboration and the multivariate effects of individual and team characteristics on collaboration efficacy are active areas of research. In our experience on dozens of such projects (and based on years of team science research literature), the most successful programs have clear governance, a shared understanding about goals, and incentives aligned to those goals. Additionally, this healthy triad (Goals, Roles and Incentives) is supported by sound operational infrastructure. We share our experiences and resources in creating and supporting successful transdisciplinary collaborations, from strategies to building healthy collaborative communities to technological support for knowledge exchange and resource sharing. Our playbook includes collaborative agreements to foster sound governance, training and guidance around team science, and evaluation approaches to team health; together these will support the vital work of transdisciplinary science.

Heterogeneity and gene-environment interactions limit our ability to treat many complex diseases such as Alzheimer's disease. Recent Alzheimer's disease spatial proteomics data show substantial disruptions in proteins responsible for tRNA transfer and synthetase, which modifies the availability of optimal and suboptimal tRNA resulting in distinct changes in gene regulation. We modeled how regulatory regions such as ramp sequences (i.e., slowly-translated codons at the 5' end of genes that evenly space ribosomes) are impacted by changes in tRNA levels. Disrupted tRNA pools change where ribosome stalling and collisions occur, decrease protein translation efficiency during protein synthesis, increase mRNA degradation via ribosome-associated protein quality control, and effectively decrease both protein and transcript levels. We show that tRNA pools alone significantly alter cell-specific gene expression without changing the genetic code by impacting codon translational efficiencies and ribosome stalling (odds=1.2072; P=2.64x10-6) with population-specific effects that we present through web interfaces at https://ramps.byu.edu and https://cubap.byu.edu. We also found that genes associated with Alzheimer's disease are 1.248x more likely to have a ramp sequence in the cerebellum than genes not associated with Alzheimer's disease (P= 0.005639). Metabolic gene dysregulation in glycolytic and ketolytic pathways associated with Alzheimer's disease further impact ramp sequences, suggesting a potential therapeutic target. In summary, changes to tRNA pools alter gene-specific ramp sequences with broad disease implications. Additional modeling of ramp sequence interactions with other regulatory elements will further improve our ability to predict how tRNA pools impact transcript and protein levels a priori.

Nicolas Matentzoglu, Semanticly, Greece
Halie Rando, University of Colorado, United States
Nicole Vasilevsky, University of Colorado, United States
Melissa Haendel, University of Colorado, United States
Zhi-Liang Hu, Iowa State University, United States
Gregoire Leroy, Food and Agricultural Organization of the United Nations, Italy
Imke Tammen, The University of Sydney, Australia
Frank Nicholas, The University of Sydney, Australia
Sabrina Toro, University of Colorado Anschutz Medical Campus, United States

In the current era of biomedical big data, advances in diagnostics and treatments can leverage a wealth of information from research and health records. This approach requires integrating and comparing data related to genotypes, phenotypes, and diseases from disparate sources. Though species agnostic, this process is currently optimized for diagnosis and treatment of human patients since the harmonized data mostly comes from human health records and animal model databases. Including all non-human animal data would improve translational research, and expand diagnostic support and treatment discovery for non-human animals. Here, we report the creation of the Vertebrate Breed Ontology (VBO) as a single source for data standardization and integration of all breed names. VBO was created using standard semantic engineering tools including the Ontology Development Kit. Breeds are added in VBO when they are recognized as such by international organizations, communities, and/or experts. VBO metadata can include common names and synonyms, country of existence, breed recognition status, domestication status, breed identifiers/name codes, reference in other databases, and description of origin. Provenance of all VBO information is recorded. Currently, livestock and cat breeds are available in VBO, with the addition of dog breeds and animals bred for laboratory purposes underway. The adoption of VBO as the source of breed names in databases and veterinary electronic health records is one step in making information more computable and consistent. This will enhance data interoperability, and support data integration and comparison, and ultimately diagnosis and treatments for both humans and other animals.

Surya Saha, Seven Bridges, United States
Timothy Putman, University of Colorado Anschutz Medical Campus, United States
Pierrette Lo, Sage Bionetworks, United States
Joaquin Espinosa, University of Colorado Anschutz Medical Campus, United States
Adam Resnick, Children’s Hospital of Philadelphia, United States
Brian OConnor, Sage Bionetworks, United States
Jack DiGiovanna, Seven Bridges, United States
Melissa Haendel, University of Colorado Anschutz Medical Campus, United States

The push to address increasingly complex problems of ever-increasing scale has driven investment in data infrastructure designed to overcome social and technical barriers between disciplines and data types. Whereas data repositories typically specialize in a single discipline or data type, data commons aim to break down these siloes to support more innovative, multi-modal, and multi-disciplinary work. Aggregating large amounts of heterogeneous data is only part of what needs to be done – researchers need to be able to use these often large and unwieldy data sets. Data sources, whether individual datasets or mature databases are not typically designed for interoperability and data reuse, both from the perspective of technical interoperability as well as legal reuse permissions. Best practices for designing, building, and sustaining data infrastructure that delights users and serves diverse stakeholders is no easy task. In this presentation, we will share ongoing efforts towards building a data commons for the Down syndrome research community – the INCLUDE DCC Data Hub. Ultimately, the goal is for any commons to have the capability of handling complex data requests that require querying across multiple sources and data resources – and this cannot happen without some harmonization of systems, data, and governance across the ecosystem. We will provide a progress update on ongoing work, with details from establishing the need for the data commons ecosystem, to conducting a landscape and requirements assessment, to developing a harmonized model and disseminating the final product through a secure data hub.

Angela Ting, University of Texas at Austin, United States
Dooti Roy, Boehringer Ingelheim Pharmaceuticals, Inc., United States
Oliver Sailer, Boehringer Ingelheim Pharma GmbH & Co. KG, Germany

A critical question both during and after drug development is whether or not a drug is safe. Typically, tabulated safety data is reviewed as it accumulates to identify early safety signals. However, identifying safety signals of interest can often be a challenging task, especially if there are too few events of interest to make a confident decision. Here, we want to investigate the safety of drug X, which was tested in different trials with different indications, dosing regimens, treatment durations, and route of administration. We use a hierarchical Bayesian model with nested indications and trials to answer questions of how likely an adverse event (AE) will happen in a future trial and if the risk of an AE is higher for drug X vs placebo. Furthermore, the identification of potential prognostic factors is of interest. The model predicts an AE with a balanced accuracy of 0.77. Additionally, we see with increasing treatment duration, number of doses, or dose concentration an increase in the risk for an AE. On the other hand, with increasing age decreases the risk for an AE. As conclusion, this model helps identifying prognostic markers associated with AEs and generates predictions of AEs with proven balanced accuracy. With appropriate use, this model can help researchers make better informed safety decisions with the ultimate goal of advancing patient safety and drug innovation.

Alex Kunz, Brigham Young University, United States
Nicholas Tenney, Brigham Young University, United States
Raya Esplin, Brigham Young University, United States
John Jacobsen, Brigham Young University, United States
Samuel Payne, Brigham Young University, United States
Andrea Kokkonen, Brigham Young University, United States
Dennis Shiozawa, Brigham Young University, United States
R. Paul Evans, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Many closely-related species are nearly identical morphologically and therefore difficult to distinguish. DNA sequencing has been an effective approach to differentiate these species. Unfortunately, whole-genome sequencing and assembly are expensive and computationally complex. Comparing genotypes for a large number of single nucleotide polymorphisms (SNPs) that vary across the species of interest is an effective and comparatively simple and inexpensive approach. The process can be further streamlined by developing custom SNP-Chips. Several custom approaches have been used to identify species-distinguishing SNPs among closely related species, but there are no generalized methods that can be applied to new groups of species. Identifying SNPs from raw sequencing reads requires a mix of complex software tools and custom scripts to process data, and prepare and convert files between steps. Automating the identification of species-defining SNPs in a generic manner would reduce the misidentification of species and make the approach accessible to researchers with little, or no, molecular or computational training. We designed Speciel to identify sets of SNPs that differentiate between related input species. Speciel uses next-generation sequencing reads from each of the target species, assembles the reads, and identifies SNPs relative to a reference sequence. These SNPs can be genotyped individually in sample tissue or, alternatively, used to develop probes on a custom SNP-Chip. When tested on five subspecies of cutthroat trout, Speciel identified 15,998 potential SNPs which together discriminate between the different subspecies. Speciel is open-source and freely downloadable in a Docker container with all required dependencies, software, and scripts.

Garrett Jenkinson, Mayo Clinic, United States
Eric Klee, Mayo Clinic, United States

DNA sequencing diagnoses 18-40% of unsolved rare genetic disease cases, and the recent incorporation of RNA-Seq has been shown to generate significant numbers of previously unattainable diagnoses. Multiple inborn diseases resulting from disorders of genomic imprinting are well characterized and a growing body of literature suggests the causative or correlative role of DNA methylation in rare inherited conditions. Therefore, the application of genomic-wide methylation-based sequencing for undiagnosed cases of rare disease is a logical progression from current testing. Following the rationale exploited in RNA studies of rare disease, we can assume that disease-associated methylation will demonstrate significant differences from individuals with unrelated phenotypes. Thus, aberrantly methylated sites will be outliers from a heterogeneous cohort. Based on this rationale, we developed BOREALIS: Bisulfite-seq OutlieR MEthylation At SingLe-SIte ReSolution, which is available as a Bioconductor package. It uses a beta binomial model to identify outlier methylation at CpG site resolution from bisulfite sequencing. BOREALIS addresses a need unmet by standard differential methylation analyses based on case-control groups. Utilizing a cohort of 94 undiagnosed rare disease patients we show that BOREALIS can identify outlier methylation linked to phenotypically relevant genes, providing a new avenue of exploration in the quest for increased diagnoses in rare disease. We highlight one patient with previously unrecognized hypermethylation patterns that are now informing clinical decisions. Furthermore, we touch upon a new multimodal-multiomics method we are now implementing to include single patient DNA, RNA and methylation and which is yielding previously unrecognized findings that promise to increase diagnosis of rare disease.

Charisse Madlock-Brown, University of Tennessee Health Science Center, United States
Kenneth Wilkins, NIH National Institute of Diabetes and Digestive and Kidney Diseases, United States
Parya Zareie, University of Tennessee Health Science Center, United States
Brenda McGrath, OCHIN, Inc., United States

Long COVID, or Post-Acute Sequelae of COVID-19 (PASC), is characterized by persistent symptoms and conditions after the acute phase of a COVID-19 infection. Mounting evidence suggests these span a wide array of body systems with significant heterogeneity across patients. We apply an unsupervised document analysis method, topic modeling (Latent Dirichlet Allocation), to identify clusters of co-occurring conditions within 480 million electronic health records of 9 million patients available from the National COVID Cohort Collaborative (N3C). These data, representing 62 contributing healthcare organizations across the US, generate hundreds of detailed clinical ‘topics.’ Using these as guides, we identify a number of new-onset conditions strongly associated with COVID-19 or suspected PASC patients compared to those with no known infection. Finally, we investigate a novel statistical modeling of patient-topic assignment pre- and post-infection, with covariates to identify PASC-associated topics in query cohorts such as adolescents or females. This method identifies a distinctive Long COVID topic, and several others significant for specific demographic groups. Overall we demonstrate that topic modeling is especially effective for large-scale EHR datasets, and with longitudinal analyses can inform how patient groups migrate toward, or away from, clinical topics in response to significant events like COVID-19 infection.

Jason Moore, Cedars Sinai Medical Center, United States

Methodology for predictive analysis analysis is a subject of extensive investigation with the most common analytical approaches to focusing on fixed effects models and models with the linear associations' assumptions, as exemplified in oncologic and radiologic pathology imaging and detection. However, investigation of health outcomes with more complex agnostic machine learning modeling frameworks in observational studies can be challenging due to data quality issues and other complexities. Such datasets often contain outliers, missing values, and mixed data types, necessitating the development of a quality data-cleaning pipelines for assessing and addressing data quality issues. Additionally, many complex health outcomes tend to observe a different degree of clinical heterogeneity and this drives the need for more thorough phenotypic subtyping. We developed an integrated data cleaning-subtype discovery pipeline with unsupervised learning algorithms to facilitate the analysis and visualization of data patterns and data outcomes. We apply this pipeline for the National Health and Nutrition Examination Survey (NHANES), one of the largest curated repositories of population-level health-related indicators which includes a physical examination, blood biochemistry, self-reported environmental, and nutrition data. We focused our investigations on dental caries which remains the most prevalent chronic disease affecting more than 3.5 billion people worldwide. Our approach reveals data patterns that led to the discovery of previously unrecognized subtypes and variables associated with the clinical phenotype heterogeneity of dental caries. We observed diverging patterns of similarity within different age groups and different variables subsets, which can further guide the development of more precise and robust machine learning predictive models.

Varsha Sreekanth, UC Anschutz, United States
Scott Cramer, UC Anschutz, United States
James Costello, UC Anschutz, United States

The lethality of prostate cancer (PCa) is driven by its transition from localized to metastatic disease. In recent years, several tumor profiling studies in PCa patients have revealed the molecular characteristics of both localized and metastatic PCa tumors. These studies have provided an abundance of molecular and clinical information, however, an understanding of the molecular determinants driving aggressiveness in PCa remains unclear. To address this gap, we have performed a meta-analysis that integrates genetic, transcriptomic, and clinicopathologic data across four independent PCa cohorts. Our approach begins by determining a set of common, clinically-significant alterations observed in primary PCa. This analysis revealed MAP3K7 and USP10 loss-of-function alterations as frequently occurring alterations that are associated with progression-free survival. Next, our approach compares primary and metastatic tumors harboring either USP10 or MAP3K7 alteration. This analysis identified distinct sets of genes associated with aggressiveness in patients harboring either USP10 or MAP3K7 loss. Further inspection of these genes confirmed that some have been previously linked to cancer while other genes remain unstudied. Results from this work may guide future studies of the molecular pathways regulating aggressiveness in USP10-deleted or MAP3K7-deleted PCa. Additionally, the analysis pipeline generated in this work is flexible to accommodate user-defined molecular subtypes. Thus, this work also yields a generalizable tool for identifying novel regulators of aggressiveness within molecularly-defined PCa subtypes.

Invasive species threaten the survival of native island ecosystems. On the island of Molokai, Hawaii, the invasive tree Prosopis Juliforia (Kiawe), has overtaken all costal dune areas resulting in the loss of native vegetation habitat and abundance. Our study focuses on evaluating the effectiveness of restoration projects across the Mokio Preserve on the island Molokai. To do this, we conducted a land survey via the collection and processing of UAV (unmanned aerial vehicle) obtained imagery into a species-level classified orthophotomosaic, with which we compared species richness across time-segmented restoration areas. Kiawe removal was the primary restoration method utilized across restoration projects. We found species richness to be highest in areas where Kiawe removal occurred earlier. The latest removal areas exhibited lower species richness when compared to earlier removal areas, yet remained significantly higher when compared to areas where there was no removal. We conclude that Kiawe removal and other restoration methods utilized in the Mokio Preserve are effective in restoring species richness. Similarly, longer time intervals from initial removal events are related to higher levels of species richness. These results support Kiawe removal and other used restoration measures in similar environments to restore native ecosystems. Our results also support the further use of UAVs and classification software as effective tools in ecosystem restoration.

Marylyn Ritchie, University of Pennsylvania, United States
Diego Milone, Universidad Nacional del Litoral, Argentina
Casey Greene, University of Colorado, United States

Correlation coefficients are widely used to identify relevant patterns in data. In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes. Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models. We show that CCC can capture biologically meaningful linear and non-linear patterns missed by standard, linear-only correlation coefficients. CCC efficiently captures general patterns in data by applying clustering algorithms and automatically adjusting the model complexity. When applied to human gene expression data, CCC identifies robust linear relationships while detecting non-linear patterns associated with sex differences that are not captured by the Pearson or Spearman correlation coefficients. Gene pairs highly ranked by CCC but not detected by linear-only coefficients showed high probabilities of interaction in tissue-specific networks built from diverse data sources including protein-protein interaction, transcription factor regulation, and chemical and genetic perturbations. CCC is much faster than state-of-the-art not-only-linear coefficients such as the Maximal Information Coefficient (MIC). CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can readily be applied to genome-scale data and other domains across different data types.

Arturo Pineda, Galatea Bio, Mexico
Ming Ta Michael Lee, Galatea Bio, United States
Sonia Moreno, Galatea Bio, Spain

Background: Elevated levels of liver enzymes in the bloodstream are indicators of potential illnesses. Developing a patient's risk profile regarding these enzymes enable adjustments in lifestyle or treatments to best serve the patient. Polygenic risk scores (PRS) evaluate a patient’s risk to medical conditions based on their genetic profile. The aim of this study is to evaluate the applicability of PRS models to a direct-to-consumer cohort.Methods: In this analysis, four key liver enzymes ALP, AST, ALT, GGT have been examined. Enzyme models from the PGS Catalog were evaluated in conjunction with data from the 1000 Genomes Project. The resulting Z-scores were filtered to include only individuals of European ancestry, and transformed into decile risk per model. Results: There were four models Privé1, Privé2, Sinnot-Armstrong, and Pazoki. The highest correlation coefficient was between those from Privé (r = 0.93), which also had the highest number of variants. The success rates of developing a risk assessment are 77.7% (ALP), 61.6% (AST), 64% (ALT), and 74.2% (GGT). Discussion: The majority of individuals were able to receive a risk assessment. When models were in clear agreement an individual may receive a risk profile and if not then no assessment was made. All models displayed clear concordance with the exception of that from Pazoki. We recommend using Sinnott-Armstrong and Privé due to the high concordance and high number of variants accounted for. In the future, we would like to prospectively test the effectiveness of these models in a clinical setting.

Tejasvene Ramesh, University of Southern California, United States
Karishma Chhugani, University of Southern California, United States
Yutong Chang, University of Southern California, United States
Brian Nadel, University of Southern California, United States
Yesha Patel, University of Southern California, United States
Annie Wong-Beringer, University of Southern California, United States
Serghei Mangul, University of Southern California, United States

Sepsis is a life-threatening, dysregulated host response to infection that is a major cause of mortality worldwide. Current research approaches usually over-simplify the interactions between pathogens and innate and adaptive immune systems. We developed and benchmarked transcriptomic-based bioinformatics methods to accurately infer the features of the innate and adaptive immune system and virulence of infectious agents. We combined publicly available transcriptomics data into the largest trans-ancestry retrospective sepsis cohort with a rich set of clinical phenotypes. Our cohort includes over 3705 individuals diagnosed with sepsis across 33 individual studies. We focused on the identification of immune features like cell type composition and sepsis type respectively, associated with the severity of the infection and poor outcomes in sepsis and studying the complex interplay between the immune system and the infectious type (viral or bacterial), facilitating insights on clinical severity scores and survival status of patients. To achieve this goal, we assembled a large-scale multicentre and trans-ancestry retrospective sepsis (MCMERS) cohort, including individuals diagnosed with sepsis from publicly available datasets. We compared the cell type composition of samples derived from survivors and non-survivors groups using our recently developed tool (GEDIT) and predicted mortality in AUROC (summary AUROC=0.725). Our results suggest that the relative abundance of various immune cells (e.g. Mast cells, Neutrophils) were significantly different across survivors and non-survivors (p-value<10-4). Results from our study will improve our understanding of the relationship between the immune system and infectious type (viral or bacterial) across individuals of diverse ancestry backgrounds.

Aaron Fabelico, Brigham Young University, United States
James Wengler, Brigham Young University, United States
Stephen Piccolo, Brigham Young University, United States

Data-sharing requirements have led to vast availability of genomic datasets in public repositories. Researchers can reuse and combine these datasets to validate findings and address novel hypotheses. However, it is challenging to identify which datasets are relevant to a particular research question due to the large quantity of available datasets and heterogeneity in the way that researchers describe their data and study designs. In this study, we focus specifically on Gene Expression Omnibus (GEO), a repository that contains genomic data from hundreds of thousands of experiments, commonly used in biomedical analyses. Notable efforts have been made to manually annotate the data, but these efforts are unable to keep pace with daily dataset submissions. To address this problem, we turn to language representation models. Under the assumption that a researcher has manually identified a few datasets related to a discrete research topic, we seek to identify more datasets that are likely related, using word-embedding representations of the text. This is done by summarizing dataset descriptions using methodologies such as bag of words, skipgram, or transformers to generate vectors, which we then compare using cosine similarity. With a systematic benchmark comparison among algorithms and model corpora, we evaluate whether it is most effective to use models pre-trained on large, generic corpora; models pre-trained on smaller biomedical corpora; or models trained on (out-of-sample) GEO titles and abstracts. Preliminary results suggest that training on discipline-specific text and using either transformers or skip-gram models yields the best results.

Catherine Lozupone, University of Colorado Anschutz Medical Campus, United States
Lawrence Hunter, University of Colorado Anschutz Medical Campus, United States

A mechanistic understanding of microbial processes in the gut would greatly improve our ability to diagnose and treat disease. Knowledge graphs (KGs) have potential to provide this mechanistic understanding as a centralized resource of complex concepts, though a KG in the microbiome field that is biomedically relevant has not yet been built. We developed a microbiome-relevant KG, MGMLink, which integrates manually curated microbe-host gene and microbe-metabolite relationships from literature into a biomedically relevant KG built using the PheKnowLator framework. We compare the accuracy of content, evaluated through competency questions, and diversity of concepts, evaluated by quantifying node and edge types, of MGMLink and two other microbe-relevant KGs. The first combines microbial traits from kg-microbe with phenotypes across species from kg-phenio. The second KG is MiKG4MD which represents relationships between microbes, neurotransmitters, and diseases to evaluate mental disorders. We hypothesize that MGMLink will enable more useful scientific hypotheses about the role of microbes in disease since it captures specific microbe-host interactions. To test this, we present a novel methodology for hypothesizing mechanisms of microbes in disease using cosine similarity based path search. In addition to the novel microbiome-relevant knowledge base that we developed, this approach can quantifiably examine the content of KGs such that they can become better suited for mechanistic inference.

Ali Tugrul Balci, University of Pittsburgh, United States
Nathan Clark, University of Utah, United States
Maria Chikina, University of Pittsburgh, United States

Understanding the genomic underpinnings of longevity is crucial for preventing age-related pathologies. Mammals have a vast variety of lifespan that evolved repeatedly, making longevity a convergent phenotype that can be studied with comparative methods. While morphological diversity is increasingly attributed to changes in non-coding regulatory elements (REs) than protein sequences, an analytical phylogenetic framework that is tailored to the functional and structural properties of REs is lacking. Here, we develop AFconverge, an ‘alignment-free’ phylogenetic method that predicts the patterns of regulatory motif adaptations underlying phenotypic evolution. By modularly computing the phenotypic association of transcription factor (TF) binding motifs in REs, AFconverge introduces new and flexible paradigms for deciphering the complexity of regulatory adaptations at multiple scales. Applying AFconverge to study promoter adaptations underlying mammalian longevity, we find widespread gains and losses of motifs that are consistent with known associations of longevity with pluripotency maintenance, circadian regulation, immunity, and dietary restriction. We also characterize 28 gene families involved in pluripotency regulation and germline development that exhibit family-specific motif selection patterns. Additionally, AFconverge’s TF-centric signals enable inference on latent factors underlying the observed promoter-motif selection patterns genome-wide, revealing that promoter-motif selection underlying longevity is strongly driven by mechanisms in stem and progenitor cells of germlines, liver, adipose tissues, connective tissues, and cardiovascular and immune systems. Finally, we show that promoters implicated in mTOR signaling, insulin signaling, extracellular matrix regulation, and cancer evolve under relaxed constraint in long-lived mammals. Thus, AFconverge offers new, powerful approaches for interrogating how selection acts on regulatory machinery.

Serghei Mangul, University of Southern California, United States
Eleazar Eskin, UCLA, United States

Structural variation (SV), refers to insertions, deletions, inversions, and duplications in human genomes. With advances in whole genome sequencing (WGS) technologies, a plethora of SV detection methods have been developed. However, dissecting SVs from WGS data presents a number of challenges, with the majority of SV detection methods suffering from a high false-positive rate, and no existing method able to detect a full range of SV’s present in a sample. Here, we report an integrated structural variant calling framework, SVPred that combines the outputs of individual callers using a newly-devised filtering and merging algorithm. Previous studies have shown the performance of SV callers to be significantly affected by the variant length. SVPred utilizes this difference by dividing the outputs of individual callers into bins based on the variant length, and combining the best-performing tools per bin. SVPred executes various combinations of Pindel, MistrVar, indelMINER, Manta, GRIDSS, BreakDancer, LUMPY, DELLY, CREST, RDXplorer, PopDel, GASV, and GenomeSTRiP to generate SV events. We evaluated the performance of SVPred on data with varying organisms and coverage. We ran SVPred on the public benchmark data for the Ashkenazi Jewish Trio son from the Genome-in-a-Bottle (GIAB) consortium, along with 7 strains of the mouse genome. SVPred has the highest F1 score measured across both mouse and human genomes. Additionally, variants predicted by SVPred were consistent with the experimentally validated truth set. Our analysis shows that SVPred provides an accurate SV calling framework and can serve as the gold standard for SV calling for the research community.

High-throughput image microscopy and image-based profiling have produced massive amounts of high-dimensional cell morphology datasets, which has led to rapid development of computational analysis workflows. However, most experiments use bespoke analytical workflows that are difficult to repurpose in new experiments and often lack usability, scalability, data provenance, portability, and reproducibility. Here we introduce CytoSnake, an open-source software for extensible, reproducible, and scalable processing of high-dimensional cell morphology readouts derived from cell microscopy images. CytoSnake is based on the snakemake workflow manager, is pip installable, and implements pycytominer functions to perform configurable image-based profiling workflows that can smoothly extend to multiple experiments. Users can edit configuration files to perform custom data processing, and CytoSnake provides detailed logs documenting the data provenance and parameters used. CytoSnake controls scalability in the backend, permitting usage of multiple threads and cores, and users can explicitly set the computational resources per dataset. We tested CytoSnake on morphology features extracted from both CellProfiler and DeepProfiler. Using CytoSnake, we fully reproduced a full image-based profiling pipeline of a publicly-available Cell Painting dataset processing nine 384 well plates in less than 30 minutes. In addition to its usability, CytoSnake promotes community-based workflow development through its custom workflow implementation feature. Users can develop their own workflow by utilizing the available components that CytoSnake offers. User-developed workflows can be shared across other users that use CytoSnake, while maintaining reproducibility. With its rigorous process of maintaining reproducibility and allowing user-based extensibility, CytoSnake will accelerate the analytical capabilities of the cell morphology community.

Lawrence Hunter, University of Colorado School of Medicine, United States
Katharina Kann, University of Colorado Boulder, United States

We introduce an algorithm for evaluating the factuality of biomedical language models. In contrast to previously proposed template-based strategies, we use naturally occurring text while still not requiring expert involvement. Our results highlight shortcomings of templates when evaluating factuality, as we find much larger differences between models that are trained from scratch and models that are finetuned on medical text. In addition and in contrast to previous evaluation methods, our approach indicates that all biomedical models outperform all vanilla models. Finally, we also find that pseudo-perplexity is not correlated with factuality and that more training data does not necessarily result in more factual output.

Han Yu, Roswell Park Comprehensive Cancer Center, United States
Rachael Hageman Blair , University at Buffalo, United States

Recent advances in single-cell sequencing technologies have accelerated discoveries and provided insights into the heterogeneous tumor microenvironment. Despite this progress, the translation to clinical endpoints and drug discovery has not kept pace. Mathematical models of cellular metabolism and regulatory networks have emerged as powerful tools in systems biology that have progressed methodologically in parallel. Although cellular metabolism and regulatory networks are intricately linked, differences in their mathematical representations have made their integration challenging. This work presents a framework for integrating Bayesian Network representations of regulatory networks into a constraint-based metabolism model. Fully integrated models of this type can be used to perform computational experiments to predict the effects of perturbations to the signaling pathway on the downstream metabolism. This framework was applied to single-cell sequencing data to develop cell-specific computational models of glioblastoma. Models were used to predict the pharmaceutical effects of 177 curated drugs published in the drug repurposing hub library, and combinations, on metabolism in the tumor microenvironment. The integrated model is used to predict the effects of pharmaceutical interventions on the system, which enables the prioritization of therapeutic targets and combination therapies for drug-related discovery. Results show that predicted drug combinations inhibiting STAT3 (e.g. Niclosamide) with other transcription factors (e.g. AR inhibitor Enzalutamide) will strongly suppress anaerobic metabolism in malignant cells, without major interference to other cell types’ metabolism, suggesting a potential combination therapy for anticancer treatment. This framework of model integration is generalizable to other applications, such as different cell types, organisms, and diseases.

Complex social behaviors are essential to survival and reproduction, but the existing methods are ill-suited to give the researcher agency to conduct rigorous and controlled experiments. Inducing social behavioral responses is difficult, relying on interactions with multiple animals that behave in uncontrolled and sometimes unreliable ways. Additionally, stimuli from animals are multi-modal including multiple visual cues that are difficult for the experimenter to control. Teasing apart the multivariate conditions of an animal behavior requires a combinatorial number of experiments that increases exponentially with respect to the complexity of the behavior and its stimuli. I will use object detection and tracking algorithms to accurately and reproducibly quantify mating and feeding behavior of lake Malawi cichlids.

Jessica Knobbe, University of Iowa, United States
Nicole Cady, University of Michigan, United States
Jemmie Hoang, University of Iowa, United States
Catherine Cherwin, University of Iowa, United States
Melissa Curry, University of Iowa Hospital and Clinics, United States
Rohan Garje, University of Iowa, United States
Praveen Vikas, University of Iowa, United States
Sonia Sugg, University of Iowa, United States
Sneha Phadke, University of Iowa Carver College of Medicine, United States
Edward Filardo, Cytonus Therapeutics, United States
Meeta Yadav, University of Iowa Carver College of Medicine, United States

Breast cancer represents the most diagnosed cancer worldwide and is responsible for nearly one-third of all women-associated cancer incidences in the United States. There are numerous risk factors associated with the development of breast cancer, including both environmental factors (e.g. age, obesity, metabolic syndrome, etc.) and genetic factors (e.g. BRCA1 and BRCA2, etc.). However, up to 50% of breast cancer cases cannot be attributed to these known risk factors, demanding identification of alternative factors that drive cancer formation. Recently, the host microbiome has emerged as an important environmental factor linked with pathobiology of breast cancer. As geographic location impacts the gut microbiome (GM), it is imperative to study the GM in disease state from many regions of the world. Therefore, to investigate the shifts in the bacterial component of the GM of patients with breast cancer in the Midwest region of the United States, we profiled 24 patients with breast cancer (BC) and 23 race- and sex-matched healthy controls (HC). We utilized several bioinformatics tools in R to assess the microbial community differences between BC and HC including univariate tests, alpha and beta diversity, and the machine learning algorithm, Random Forest. We observed a significant difference in the GM composition between these two groups, specifically a reduced abundance of several short-chain fatty acid-producing beneficial gut bacteria in BC. Our study highlights a potential role of the GM in breast cancer pathobiology and warrants further investigation of the mechanisms through which the GM can predispose or protect from breast cancer.

Abdollah Dehzangi, Rutgers University, United States

Protein lysine methylation is a particular type of post translational modification that plays an important role in both histone and non-histone function regulation in proteins. Deregulation caused by lysine methyltransferases has been identified as the cause of several diseases including cancer as well as both mental and developmental disorders. Identifying lysine methylation sites is a critical step in both early diagnosis and drug design. This study proposes a new in silico method called CNN-Meth for predicting lysine methylation sites using a convolutional neural network (CNN). Our model is trained using evolutionary, structural, and physicochemical-based features along with binary encoding. To the best of our knowledge, this combination of features has never been used for this problem. Our results demonstrate that CNN-Meth can significantly outperform previous studies found in the literature for predicting Methylation sites. It achieves 96.0%, 85.1%, 96.4%, and .65 in terms of Accuracy, Sensitivity, Specificity, and Matthew’s Correlation Coefficient (MCC), respectively. CNN-Meth and its source code are publicly available at https://github.com/MLBC-lab/CNN-Meth

Arwen Oakley, Brigham Young University, United States
Stephen Piccolo, Brigham Young University, United States

It is estimated that 8% of men have red-green color vision deficiencies (CVD) (J A Spalding, 1999, p. 468). To achieve equity in research, researchers with CVD should be able to understand scientific figures just as well as their peers. To determine how often researchers use colorblind-unfriendly figures, we have manually classified 5000 images according to CVD friendliness. We randomly selected these images from the eLife online journal, which contains over 50,000 scientific figures related to biological sciences. We used four visual metrics to annotate and classify each image. The metrics are 1) whether the image contains confusing colors (i.e., shades of green and red, 2) whether contrast mitigates the confusing colors, 3) whether labels mitigate the confusing colors, and 4) whether distance mitigates the confusing colors. To categorize each image, we looked at the original image alongside the image simulated for someone with 80% deuteranopia. Using these metrics, we categorized each image as “Gray-scale,” “Definitely okay,” “Probably okay,” “Probably problematic,” or “Definitely problematic.” All images classified as “Probably okay” or “Probably problematic” include a written description that explains our choice. We estimate that approximately 14% of eLife figures are “Definitely problematic.” Individuals with CVD would likely struggle to understand certain aspects of these figures. Our annotated dataset will be freely available in hopes that it will prove useful to other researchers. We are now using our hand-classified dataset to train a convolutional neural network model to automatically classify scientific figures as CVD friendly or not.

Jonathan Tang, Cal Poly, San Luis Obispo, United States
Lauren Garabedian, Cal Poly, San Luis Obispo, United States
McClain Kressman, Cal Poly, San Luis Obispo, United States
Harsha Lakshmankumar, Cal Poly, San Luis Obispo, United States
Belle Aduaka, Cal Poly, San Luis Obispo, United States
Ella Thomas, Cal Poly, San Luis Obispo, United States
Rachel Koenigsberg, Cal Poly, San Luis Obispo, United States
Jean Davidson, Cal Poly, San Luis Obispo, United States
Paul Anderson, Cal Poly, San Luis Obispo, United States

Underspecification refers to a common problem in machine learning where a model does not perform as expected in real-world scenarios. This is in contrast to overfitting that can be identified during training validation and testing. We systematically created shifted test sets using the METABRIC transcriptomic dataset of 2,000 patients. These datasets are used to evaluate methods that identify and potentially improve underspecification. Three clusters within the HER2+ breast cancer subtype were each sampled 20 times to create 60 datasets with equally sized shifted and unshifted test sets. Model underspecification was evaluated by examining the performance difference between shifted and unshifted test sets. When little or no underspecification, the best performing model will perform well on both test sets. We focused on five datasets that exhibited the biggest difference in performance. Within these, we evaluated methods to re-rank potential models using only the unshifted test set. We produced an optimal re-ranking that matches the “true” ranking of models computed by combining the performance on shifted and unshifted datasets. The models were re-ranked according to the correlation of the knowledge-derived adjacency matrix and an adjacency matrix derived from the final layer of the neural networks. The unshifted test data performance and knowledge correlation were averaged to form a final combined score. This re-ranking significantly improved underspecification in three of the five examples. One sample still had a significant difference and in one sample underspecification worsened. These initial results indicate that methods which incorporate established biological knowledge can mitigate underspecification.

Xiaoxu Deng, University of Kansas Medical Center, United States

Gene set enrichment analysis (GSEA) helps to identify the biological functions that are enriched in up or down-regulated gene expression. Survival analysis is used to study the association of risk factors and time to an event (e.g., death). Typically, in GSEA, the log-fold change in expression between treatments is used to rank genes, in order to determine if a biological function or pathway has a non-random distribution of altered gene expression. Instead, we propose a survival-based gene set enrichment analysis helps determine which biological functions are associated with a disease’s survival. We are developing an R package for this analysis and present a study of kidney renal clear cell carcinoma (KIRC) to demonstrate the approach. This approach begins by ranking all genes according to their log-hazard ratios, from which the association between gene expression and survival were calculated by Cox proportional hazards model and then determines if any gene sets are significantly overrepresented toward the top or bottom within that ranked list. By focusing on sets of genes having significantly larger log-hazard ratios, our result confirms that this survival-based approach can identify important biological pathways or functions associated with disease survival. For example, the top three pathways significantly enriched by KIRC genes with top-ranked log hazard ratios are Cell Cycle Mitotic, Mitotic Metaphase, and Anaphase, whereas the negatively significantly enriched top three pathways are RHO GTPase cycle, Pyruvate metabolism, and Citric acid cycle. This approach allows researchers to quickly identify disease variant pathways as the basis for further research.

Gregory Way, University of Colorado Anschutz Medical Campus, United States
Michelle Mattson-Hoss, Infixion Bioscience, Inc., United States
Herb Sarnoff, Infixion Bioscience, Inc., United States

Neurofibromatosis type 1 (NF1) is an autosomal dominant genetic condition that causes patients to develop benign tumors (plexiform and cutaneous neurofibromas), cognitive and learning deficits, bone dysplasias, etc. Tumors develop from NF1 haploinsufficient microenvironments driven by neurofibromin-deficient Schwann cells. Our goal is to describe a biomarker for NF1 genotype using Schwann cell morphology readouts. We performed a modified Cell Painting assay, staining for three cellular structures (actin, nuclei, and endoplasmic reticulum) in two isogenic Schwann cell lines, one NF1 wild type and the other is NF1 null. We collected fluorescence microscopy images and are currently benchmarking various image analysis pipelines in their ability to derive single-cell morphology features. First, we performed illumination correction to adjust for microscopy artifacts and vignetting. We tested various methods and determined that PyBaSiC was the most effective and user-friendly. Next, we are testing two core pipelines. One method segments single cells using Cellpose 2.0 and extracts feature measurements using DeepProfiler. The other method is standard in our field: using CellProfiler for both segmentation and feature selection. Lastly, we will use CytoSnake to process single-cell morphology features. We will apply machine learning derived from both methods to discover which pipeline produces the most consistent and meaningful biomarker of NF1 genotype. In the future, we will use this biomarker to identify drugs that upregulate NF1 expression in NF1 patient Schwann cells to structurally resemble healthy NF1 genotype.

Kathleen R. Mullen, University of Colorado, Anschutz Medical Campus, United States
Nicolas Matentzoglu, Semanticly, Greece
Joseph E. Flack IV, Johns Hopkins University, United States
Harshad Hegde, Lawrence Berkeley National Laboratory, United States
Peter N. Robinson, The Jackson Laboratory for Genomic Medicine, United States
Ada Hamosh, Johns Hopkins University, United States
Melissa Haendel, University of Colorado, Anschutz Medical Campus, United States
Christopher J. Mungall, Lawrence Berkeley National Laboratory, United States
Nicole Vasilevsky, University of Colorado, Anschutz Medical Campus, United States

The wealth of data from research, human and veterinary health records can be leveraged to support disease diagnosis and treatment discovery. This process requires data standardization, integration and comparison, which relies on ontologies such as the Mondo Disease Ontology (Mondo) for disease data. Here, we report the recent improvements in the representation of non-human animal diseases in Mondo. Mondo represents a hierarchical classification of over 20,000 diseases in humans and across species, covering a wide range of diseases including cancers, infectious diseases and Mendelian disorders. Diseases are subclassified under the root class ‘disease or disorder’ as ‘human disease or disorder’, 'infectious disease or post-infectious disorder', or 'non-human animal disease'. Mondo leverages logical axioms to apply semantics (meaning) to the terms. We added axioms to indicate the species affected by a disease, and whether a disease affects a single species or several species. Importantly, semantics make the connection between a non-human disease and its analogous counterpart in human. In addition, we are improving the coverage of non-human diseases in Mondo by adding new terms represented in veterinary records and animal databases. Existing computational tools have been successful in supporting human precision medicine. However, they have not leveraged the wealth of information from veterinary records and databases because of its lack of standardization. The improvements of non-human animal diseases in Mondo will support this standardization, and will improve the current computational tools for both humans and other animals.

Ruijia Wang, Vor Biopharma, United States
Gabriella Angelini, Vor Biopharma, United States
Michelle Lin, Vor Biopharma, United States
Juliana Ferrucio, Vor Biopharma, United States

CRISPR-Cas9-based gene editing is a powerful approach to improve our ability to treat specific diseases with an unmet medical need. Developing robust cell therapies with genome engineering requires careful assessment of allelism at single cell resolution especially when multiple targets are considered. Recently, droplet-based targeted single cell DNA sequencing (scDNAseq) has been used to genotype selected loci across thousands of cells enabling high-throughput assessment of gene editing efficiency. However, several technical issues must be accounted for including low sequencing depth and PCR amplification bias due to low input DNA in each droplet. These artifacts skew allele read frequencies in the readout which can confound accurate genotyping. We addressed these issues by developing a machine learning method that learns the extent of this skew from single nucleotide polymorphisms (SNPs) across all cells and amplicons. SNPs can be initially identified through pseudo-bulk genotyping and in theory should be detectable in every cell because they occur in the germline. Using this approach, we analyzed scDNAseq data generated from Cas9-edited human hematopoietic stem and progenitor cell (HSPC) samples before and after in vivo transplantation into mouse bone marrow. The model was trained and cross-validated on observed heterozygous and homozygous SNPs with higher accuracy than GATK. When applied to Cas9 editing events, we found the predicted editing consequences and biallelic editing rate of cells to correlate with flow cytometry. These results demonstrate that our computational approach is uniquely suited for genotyping gene editing events at single cell resolution.

Mohammad Vahed, University of Southern California, United States
Nicholas Darci-Maher, University of California, Los Angeles, United States
Kerui Peng, University of Southern California, United States
Jaqueline Brito, University of Southern California, United States
JungHyun Jung, University of Southern California, United States
Anushka Rajesh, University of Southern California, United States
Andrew Smith, University of Southern California, United States
Reid F. Thompson, Oregon Health & Science University, United States
Abhinav Nellore, Oregon Health & Science University, United States
Casey Greene, University of Pennsylvania, United States
Jonathan Jacobs, QIAGEN Digital Insights, United States
Dat Duong, University of California Los Angeles, United States
Eleazar Eskin, University of California Los Angeles, United States
Serghei Mangul, University of Southern California, United States

There is growing evidence that data sharing enables important discoveries across various biomedical disciplines. When data is shared on centralized repositories in easy-to-use formats, other researchers can examine and re-analyze the data, challenge existing interpretations, and test new theories. Additionally, secondary analysis is economically sustainable and can be used in countries with limited resources. However, once a research team publishes critical findings derived from an omics dataset, secondary analysis can play a crucial role in enabling and verifying the reproducibility and generalizability of published results. We have performed a data-driven examination of reuse patterns of the reusability of public omics data across 2,882,007 open-access biomedical publications (published between 2001 and 2020; across 13,753 journals). Our search included two omics data repositories, NCBI Sequence Read Archive (SRA) and NCBI Gene Expression Omnibus (GEO). We tested the accuracy of this assumption using a subset of datasets for which investigators manually linked their dataset records with PubMed identifiers and found it to be accurate. Considering the data in units of datasets and the number of times they are reused, we found that except for a few initiatives, omics data is poorly reused, and over 59% of the data in GEO, and over 70% of the data in SRA, is not reused even once. Our study establishes the current state and trends of secondary analysis of omics data and suggests that an easy-to-use format is needed to enable omics data reusability.

George Blanck, University of South Florida College of Medicine , United States
Vayda Barker, University of South Florida College of Medicine , United States

Microtubule associated protein Tau (MAPT) expression has long been studied in the context of tauopathies and neurodegenerative disorders, but more recently, increased MAPT gene expression has been correlated with increased overall survival in neuroblastoma. Similarly, DNA copy number variations (CNV) have been studied in the cancer setting, especially in the context of regulatory elements common to both proliferation and apoptosis genes, but with limited focus on the role of MAPT. Thus, we hypothesized that copy number (CN) losses of MAPT would be associated with reduced survival in neuroblastoma. The TARGET pediatric neuroblastoma dataset was assessed for CNV for the MAPT gene via processing of whole exome files were processed. Our results demonstrate case IDs in the bottom 50th percentile of MAPT CN have significantly lower overall survival when compared to case IDs in the upper 50th percentile of MAPT CN (p=0.0188). Further analysis with RNAseq files revealed that the cases representing the lower 50th percentile of CN ratios also show reduced expression of MAPT (p = 0.00184). Overall, elucidation of the role of MAPT CNV in neuroblastoma could lead to novel methods of establishing patient risk stratifications.

Gamze Gürsoy, Columbia University & New York Genome Center, United States

Human functional genomics data continues to be produced and shared at increasing rates. These datasets are crucial for facilitating data-driven discoveries in biomedicine. Although data are anonymized prior to being shared, there are serious privacy concerns for participants in these studies surrounding both potential for re-identification and leakage of sensitive genetic information. Both of these privacy risks are particularly well-documented in bulk RNA sequencing datasets. Genetic information leakage risk is less clear for single-cell and single-nucleus RNA sequencing data, which is increasingly preferred for quantifying gene expression in human tissues. At the read-level, existing read sanitization methods are directly applicable to these data prior to dissemination. At the expression-level, no studies have quantified information leakage or provided mitigation strategies. Moreover, single-cell RNA sequencing often does not capture the entire tissue-level transcriptomic output and contains considerable technical noise. This makes it difficult to assume privacy risks identified in studies of bulk data are generalizable to the single-cell domain, and the extent to which personally-identifiable genotypes can be reconstructed from single-cell data are therefore unknown. This is critical to understand, especially when establishing data sharing policies for large-scale single-cell sequencing studies that seek to openly release expression data to the research community. To address this gap, here we explore the relationship between sample-matched single-cell/nucleus RNA sequencing data and the corresponding bulk measurements. We examine the suitability of existing bulk RNA sequencing private information leakage quantification methods when applied to single-cell resolution data. We then explore novel methods for reconstructing genotypes from single-cell expression matrices.

Eliot Bush, Harvey Mudd College, United States

The nucleotide sequences of 16S ribosomal RNA (rRNA) genes have been used to inform the taxonomic placement of prokaryotes for several decades. Whole-genome approaches can better resolve evolutionary relationships of organisms, but these analyses often require computational proficiencies that are uncommon among microbiologists. PHANTASM is a new tool capable of automating these workflows. This tool was designed to work for a wide range of prokaryotes and is the first example of an automated reconciliation of NCBI’s Taxonomy database with that of the List of Prokaryotic names with Standing in Nomenclature (LPSN). In this study, we describe the workflow of PHANTASM and provide several examples of results generated by it. The source code is freely-available on GitHub. In order to facilitate the ease-of-access for researchers, PHANTASM is also available as a Docker image. While other tools exist to facilitate starting points for these analyses, PHANTASM provides users with a greater degree of control and produces outputs that can be used to make publication-quality figures.

Alyssa Nitz, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Thousands of genome-wide Association Studies (GWAS) have been used to identify ~500,000 single nucleotide variants or polymorphisms (SNPs) correlated with various human traits, including many inherited diseases (i.e., GWAS-SNPs). GWAS-SNPs are common in the study population, generally have very small effect sizes, and in most cases are either not functional or have an unknown function. Efforts to determine GWAS-SNP functions have been mostly unsuccessful. We present a comprehensive analysis of known human SNPs to determine the prevalence of variants that destroy or create a ramp sequence, which could provide a functional explanation for GWAS-SNPs. Ramp sequences are consecutive slowly-translated codons at the 5' end of genes that improve translational efficiency by preventing downstream ribosomal collisions. Even a single, synonymous and/or common, SNP can destroy a ramp sequence by trading an inefficient for an efficient codon. We analyzed the effects of known human SNPs on ramp sequences. We compiled a list of SNPs occurring in coding regions of the genome by downloading dbSNP, annotating all SNPs with ANNOVAR to determine what type of region the SNPs are in (e.g., codons or intergenic), and removing SNPs not in a coding region. Next, using our published algorithm, ExtRamp, we identified which ramp sequences were affected by the SNPs. The human genome contains ~3,000 ramp sequences and 136,243 SNPs affect these sequences.

Yang Xu, Broad Institute, United States
Rachel McCord, University of Tennessee, United States

Mammalian genomes encode genetic information in linear sequence but the folding of chromosomes into specific three-dimensional structures is required for the appropriate expression of this genetic information (Gibcus et al., 2013). Hi-C, which measures the frequency of physical contacts between two DNA fragments, infers the average spatial genome organization from a bulk population of cells (Belton et al., 2013). However, recent single-cell Hi-C studies, through profiling genomic contacts at single-cell resolution, have shown that chromatin features in individual cells are not equivalent to the average features in a population of cells (Galitsyna et al., 2021), while some patterns are cell-type specific (Sauerwald et al., 2018; Winick-Ng et al., 2021). Recently, computational methods have been developed for cell-type identification based on various levels of genome structures derived from chromosomal interactions in single cells. Here, we benchmarked three algorithms, which utilized either contact matrices (scHicluster), gene-body associating domain (scGAD), or A/B compartment (scA/B) to infer cell types, using single-cell Hi-C datasets from previous studies. We also tried to determine the required sequencing depth to distinguish different cell types and the importance of each part of the contact map in these analyses. Through this work, we demonstrated that the clustering ability of different methods could be different when analyzing different datasets. We also found that 25k reads per cell were the minimum sequencing depth needed to cluster cells. At last, our work showed that the diagonal information of the contact matric was more important for cell type identification from single-cell Hi-C data.

Robyn Ball, The Jackson Laboratory, United States
Vivek Philip, The Jackson Laboratory, United States
Elissa Chesler, The Jackson Laboratory, United States

In differential co-expression analysis, the goal is to understand the effect of treatment or group difference on the coupling or de-coupling of gene expression patterns. We compared the application of three co-expression algorithms: Weighted Gene Co-expression Analysis (WGCNA), paraclique analysis, and Permutation-based Maximum Covariance Analysis (PMCA) using RNA-sequencing of bulk striatum from cocaine-treated (repeated 10 mg/kg i.p cocaine HCl) and saline-treated controls from Collaborative Cross (CC) mouse populations (134 samples over 34 strains). WGCNA clusters differences of adjacencies, whereas paraclique identifies differentially co-expressed paracliques of genes post-hoc, and PMCA provides a permutation-based false positive rate for each gene association and anti-association, which can then be ranked, thresholded or clustered to find differential co-expression. In addition to the practical considerations such as computational scalability and required thresholding that may impact results, we assessed these algorithms based on their concordance and divergence to each other and to external expertly curated studies. To assess concordance, we used Jaccard similarities among modules clustered by each algorithm and ranked genes based on their membership in co-expression modules across these algorithms. We characterized these highly concordant modules in relation to cocaine-related curated gene sets in GeneWeaver using hierarchical similarity analyses. Additionally, we examined the divergence of these algorithm’s results with Jaccard similarity of algorithm-specific gene sets with expertly curated historical studies related to cocaine exposure. These analyses can inform the choice of ideal approaches and parameters for studying differential co-expression. Funded by NIH P50 DA039841 and R01 DA037927 to EJC.

Christopher Mancuso, University of Colorado Anschutz Medical Campus, United States
Kayla Johnson, Michigan State University, United States
Ingo Braasch, Michigan State University, United States
Arjun Krishnan, University of Colorado Anschutz Medical Campus, United States

Most common and rare diseases exhibit staggering heterogeneity in clinical presentation, disease course, and treatment response. By subtyping the pathologic basis of diseases, precision medicine has the potential to offer personalized diagnoses and therapeutic options for individual patients. However, current precision medicine approaches cannot be applied to a broader spectrum of diseases due to the lack of 1) the ability to interpret patient-specific novel mutated genes 2) strategies for finding the right animal model and gene targets to experimentally characterize disease etiology in individual patients. To address these challenges, we have developed a novel regression-guided coexpression approach to build patient-specific genome-scale gene networks using hundreds of thousands of existing human transcriptomes weighted based on their relevance to transcriptome data from a single patient and use the network to predict patient-specific genes. To prioritize the right research organism and gene targets, we have developed an approach that uses gene homology and multi-species transcriptome data to infer a perturbed gene network in research organisms (e.g. mouse, fly, zebrafish, and worm) that can recapitulate the disease condition in an individual patient. By comparing patient and research organism networks, we can predict the functionally-analogous pathogenic genes of patients in research organisms, which could be experimentally investigated further. Our method provides a new framework to discover underlying genes related to common and rare diseases on the individual level, also helping to shorten the timeframe of functional tests by simplifying the process of selecting a well-suited organism and gene targets to test.

PD-L1 is a critical immune checkpoint protein in tumor cells that suppresses the immune system. Recently our collaborator Dr. Jianbo Yue found that 6J1, a triazine compound, decreased PD-L1 exosomal secretion and renewed T cells potential in 6J1-treated primary tumors, thereby improving the anticancer efficacy of 6J1 and prolonging mouse survival. Furthermore, a combination of 6J1 and an anti-PD-1 antibody (a promising new therapy in cancer) significantly improved the anticancer immune response when compared with either treatment alone. Hence, by manipulating PD-L1 endosomal trafficking may provide a promising means to promote an anticancer immune response in addition to the immune checkpoint-blocking antibody therapy. Here to elucidate possible mechanisms of action, next-generation bulk RNA sequencing was performed on treated and treated models. Specifically, our data consisted of four treated and untreated murine models of breast cancer. DESeq2 was used to normalize data and perform differential gene expression analysis. We then used REACTOME to identify enriched genes sets to discover additional biological insights from the top most differentially expressed genes. Despite these efforts, we were unable to discover a consistent pattern across all biological replicates that pointed to a single mode of action. This finding suggested that drugs affecting the exosomal secretion processes in tumor cells do not consistently alter transcription. While this approach could have helped provide testable hypotheses for Dr. Yue, further omic analysis of his samples may be required to identify the molecular mechanism driving the positive response to 6J1.

Juan Vargas, MPH Biostatistics, United States
Douglas Fritz, Medical Scientist Training Program, United States

Recent fast development of spatial transcriptomics (ST) technologies provides new ways to characterize gene expression patterns along with spatial information. Compared to non-spatial single-cell transcriptomics, ST data offers a unique opportunity to unravel both spatial and temporal information simultaneously, which is crucial to understand pathogenic cell lineage contributing to disease progressions. A few computational machine learning or deep learning algorithms have been developed to identify these spatiotemporal trajectories. However, it is crucial to use the appropriate statistical model to fit overdispersed ST data, which is usually neglected in spatiotemporal association analysis. We developed a computational approach to select the best model by benchmarking 7 statistical models for overdispersed ST count data, which provides a sensitive framework and evaluation metric on selecting the model that best fits and predicts the ST data. Additionally, we also benchmarked the performances of identifying spatially aggregated gene signatures that are significantly associated with the identified spatiotemporal trajectories. By applying our framework on ST datasets, we found that Negative Binomial (NB) and Zero-inflated NB outperform Poisson, Quassi Poisson, Neo-inflated Poisson, Hurdle model, and linear mixed effect modeling for genes of medium and high variations. All models work equivalently well for low-count genes. Applying our framework to public ST datasets, we are able to reveal genes that are associated with pre-defined spatiotemporal trajectory and reflect tumor immune interaction using 10X Visum ST data, and genes that characterize the structures of mouse hippocampus using Slide-seqV2 ST data.

Anders Gorm Pedersen, Technical University of Denmark, Denmark
Anders Pedersen, Technical University of Denmark, Denmark

Childhood asthma is the most common reason for hospitalization in early childhood. From epidemiological studies, it is evident that the prevalence is higher in boys than girls. After puberty, it is more prominent in women than men. The heritability of childhood asthma is estimated to be between 60 and 90%. This suggests that the genetic components driving the development of childhood asthma have a sex-specific effect. Yet, most association studies do not consider gender in their analysis. In this project, a Bayesian logistic regression model with a variant-sex interaction term was developed in RStan to identify SNPs that have a sex-specific effect on childhood asthma. Discovery studies were conducted in a dataset of 1189 children with severe asthma (2-6 hospitalizations) from Copenhagen Prospective Studies on Asthma in Childhood (COPSAC) and 5094 non-asthmatic controls. Twenty variants have a posterior probability of interaction higher than 76%. A subset of individuals with severe asthma (6+ hospitalizations, 372 individuals) suggests 17 variants with a posterior probability higher than 80% of having a sex interaction, of which two are found to be part of the genes IL1R1 and CLEC16A, known for being associated with asthma previously, and 4 of the top 9 interacting variants are expressed in lung tissue. Sex-stratified analysis confirms the sex-specific effect in both data sets. Suggested variants do not replicate UK Biobank (5581 cases, 88094 controls), which might indicate that the definition of the asthma phenotype in both data sets is too different. Further replication is planned in the iPSCYH data set.

Fabio Boniolo, Technical University of Munich, Munich, Germany, Germany
Norman Roggendorf, Technical University of Munich, Germany
Bahar Tercan, Institute for Systems Biology, Seattle, WA, USA, United States
Jan Baumbach, Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany, Germany
Mauro A. A. Castro, Universidade Federal do Paraná, Curitiba, Brazil, Brazil
A. Gordon Robertson, Dxige Research Inc., Canada
Dieter Saur, Technical University of Munich, Germany
Markus List, Technical University of Munich, Germany

Cancer is one of the leading causes of death worldwide. Despite significant improvements in prevention and treatment, mortality remains high for many cancer types. Hence, innovative methods that use molecular data to stratify patients and identify biomarkers are needed. Promising biomarkers can also be inferred from competing endogenous RNA (ceRNA) networks that capture the gene-miRNA-gene regulatory landscape. Thus far, the role of these biomarkers could only be studied globally but not in a sample-specific manner. To mitigate this, we introduce spongEffects, a novel method that infers subnetworks (or modules) from ceRNA networks and calculates patient- or sample-specific scores related to their regulatory activity. We show how spongEffects can be used for downstream interpretation and machine learning tasks such as tumor classification and for identifying subtype-specific regulatory interactions. In a concrete example of breast cancer subtype classification, we prioritize modules impacting the biology of the different subtypes. In summary, spongEffects prioritizes ceRNA modules as biomarkers and offers insights into the miRNA regulatory landscape. Notably, these module scores can be inferred from gene expression data alone and can thus be applied to cohorts where miRNA expression information is lacking.

Martina Poletti, Earlham Institute , United Kingdom
Johanne Brooks-Warburton, University of Hertfordshire, United Kingdom
Balazs Bohar, Eötvös Lorand University, Hungary
Matthew Madgwick, Earlham Institute, United Kingdom
Bram Versockt, KU Leuven, Belgium
Marc Ferrante, University Hospitals Leuven, Belgium
Séverine Vermeire, University Hospitals Leuven, Belgium
Simon Carding, Quadram Institute, United Kingdom
Tamas Korcsmaros, Imperial College, United Kingdom

To understand the pathomechanisms of complex diseases and the concert of regulatory genomic alterations systems biological approaches are necessary such as network propagation. Inflammatory bowel disease (IBD) is a complex disease causing continuous painful inflammation in the gastrointestinal tract. IBD consists of two major diseases, ulcerative colitis and Crohn’s disease. We investigated IBD to highlight the effectiveness of network propagation in complex diseases. We collected 941 ulcerative colitis and 1965 Crohn’s disease patient-specific immunochip data from the Leuven IBD Biobank, and then selected the IBD-associated Single Nucleotide Polymorphisms (SNPs) from these patients. We mapped the regulatory SNPs using our published iSNP pipeline to the directed protein-protein interaction network from OmniPath. In the next step, we used a modified HotNet2 algorithm generated kernel for directed graphs using a unit heat at the source of the SNP-affected proteins. If the heat reached a transcription factor then it was propagated one step further using the DoRothEA transcription factor - target gene network. To exclude the non-IBD-related hub proteins, we ran the heat propagation using the same number of proteins as seeds 1000 times. This resulted in a distribution of heat above each protein. We used a Z-score transformation to calculate the significantly affected proteins. If a protein had Z>2, then it was considered affected. We found IBD-related Gene Ontology functions both in ulcerative colitis and Crohn’s disease. This study further shows that network propagation can be a useful method to decipher hidden disease-related proteins in complex diseases.

Viet Huan Le, Taipei Medical University, Taiwan, China
Truong Nguyen Khanh Hung, Cho Ray Hospital, Viet Nam
Ngan Thi Kim Nguyen, National Taiwan Normal University, Taiwan, China
Nguyen Quoc Khanh Le, Taipei Medical University, Taiwan, China

Background: Possible drug-food constituent interactions (DFCIs) could change the intended efficiency of therapeutics in medical practice, which may lead to adverse events for patients, even death. However, the importance of DFCIs remains underestimated, as the number of studies on these topics is limited. Recently, scientists have applied artificial intelligence-based models to forecast DFCIs. However, the previous studies lacked reproducibility, or the performance was not good enough to recognize the DFCIs in clinical practice. Therefore, this study proposed a novel prediction model to address the limitations of the preceding work and enhance the accuracy of DFCI detection. Methodology: From 70,477 foods (FooDB_1.0) and 13,580 drugs (DrugBank_5.1.7), our benchmark dataset contained 2,263,426 DFCIs of negative, positive, and non-significant DFCIs, in which 50% DFCIs were used for training, 37.5% for hyper-parameter tuning and 12.5% for testing. The external test set included 1,922 DFCIs from a previous study. We used a four-step feature selection process to filter out only the eighteen most important features. eXtreme Gradient Boosting (XGBoost) was the optimal model among the five investigated algorithms (all were five-fold cross-validated and hyper-parameter tuned). Results: The XGBoost model performed a predictive accuracy of 97.56% on the unseen data of the external test set. Finally, we applied our model to recommend whether a drug should or should not be taken with some food compounds based on their interactions. Conclusion: Our model, with its simplicity and high accuracy, can help doctors and patients avoid the adverse effects of DFCIs and ameliorate treatment efficiency.

Mayla Boguslav, University of Colorado, Anschutz Medical Campus, United States
Lawrence Hunter, University of Colorado, Anschutz Medical Campus, United States

Although the focus of much of natural language processing work has been on information extraction from text, documents also contain statements of desired, but as yet unknown information; we call these "ignorance statements”. Recent work in identifying and categorizing such statements [1], has resulted in the creation of an “ignorance base”, a collection of such statements gleaned from the peer-reviewed biomedical literature. We propose a novel use for the ignorance-base: presenting scientists with previously published statements of desired knowledge that might be relevant to their own novel results. The method is related to ontology-term enrichment approaches [2], but instead looks for knowledge goals that might be related to a set of results. We anticipate that such enrichment analysis could facilitate interdisciplinary cross-fertilization, as the suggested literature statements might be unknown to the researcher who produced the data, or even from different areas of research altogether. In this work, we introduce a hybrid approach that utilizes two artificial intelligence (AI) techniques: (a) Concept Mapping using Knowledge Graphs (KGs), and (b) ignorance extraction, using NLP classifiers. This hybrid system connects the input data to their associated Open Biological and Biomedical Ontologies (OBO) within KGs. Such OBOs will be used to extract the most relevant articles from the ignorance-base to identify desired knowledge that can be answered by newly discovered results. References [1]M. Boguslav, et al,Bioinformatics Advances, vol. 1, no. 1, p. vbab012, 2021, doi: 10.1093/bioadv/vbab012. [2]H. Tipney and L. Hunter,Human Genomics, vol. 4, no. 3, p. 202, 2010, doi: 10.1186/1479-7364-4-3-202.

POSTER PRESENTATIONS