Goals, Roles, and Watering holes: A playbook for the leading edge of team science

Presenting Author: Julie McMurry, University of Colorado, Anschutz Medical Campus

Sarah Gehrke, CU Anschutz
Anh Nguyen, CU Anschutz
Anita Walden, CU Anschutz
Sruthi Magesh, CU Anschutz
Julie Bletz, Sage
Anne Thessen, CU Anschutz
Shawn O’Neil, CU Anschutz
James Eddy, Sage
Monica Munoz-Torres, CU Anschutz
Melissa Haendel, CU Anschutz
Kaitlin Flynn, Sage


The most pressing biomedical challenges of our time require collaboration across disciplinary and institutional boundaries. Over the last two decades it has become clearer how to more successfully approach this; however, there are often few resources and infrastructure available to apply known team science best practices to data-intensive research. Further, methods for evaluating collaboration and the multivariate effects of individual and team characteristics on collaboration efficacy are active areas of research. In our experience on dozens of such projects (and based on years of team science research literature), the most successful programs have clear governance, a shared understanding about goals, and incentives aligned to those goals. Additionally, this healthy triad (Goals, Roles and Incentives) is supported by sound operational infrastructure. We share our experiences and resources in creating and supporting successful transdisciplinary collaborations, from strategies to building healthy collaborative communities to technological support for knowledge exchange and resource sharing. Our playbook includes collaborative agreements to foster sound governance, training and guidance around team science, and evaluation approaches to team health; together these will support the vital work of transdisciplinary science.

Realizing the potential of secure and decentralized harmonization of clinical and genomics data for precision medicine

Presenting Author: Ahmed Elhussein, Columbia University

Gamze Gursoy, Columbia University


Precision medicine has the potential to provide more accurate diagnosis, appropriate treatment and timely prevention strategies by considering patients’ biological makeup. However, this cannot be realized without integrating clinical and omics data in a data sharing framework that achieves large sample size. Due to their distinct data types and privacy and data ownership issues, integrated clinical and genetic data systems are lacking, leading to missed opportunities. Here we present a secure framework that harmonizes storage and querying of clinical and genomic data using blockchain technology. Our platform combines clinical and genomic data under a unified framework using novel data structures. It supports combined genotype-phenotype queries, gives institutions decentralized control of their data, and provides user access logs, improving transparency into how and when health information is used. We demonstrate the value of our framework for precision medicine by creating genotype-phenotype cohorts and examining relationships within them. We show that combining data across institutions using our secure platform increases statistical power, enabling discovery of novel connections between genetics and clinical observations in Amyotrophic Lateral Sclerosis. Overall, by providing an integrated, secure and decentralized framework, we envision more communities can participate in data sharing to advance medical discoveries and enhance reproducibility.

iAtlas: an Open-Source, Interactive Portal and Platform for Immuno-Oncology Research

Presenting Author: James Eddy, Sage Bionetworks

Carolina Heimann, Institute for Systems Biology
Ilya Shmulevich, Institute for Systems Biology
David Gibbs, Institute for Systems Biology
Vesteinn Thorsson, Institute for Systems Biology
Andrew Lamb, Sage Bionetworks
Yooree Chae, Sage Bionetworks
Amy Heiser, Sage Bionetworks
Dante Bortone, University of North Carolina
Benjamin Vincent, University of North Carolina
Sarah Dexheimer, University of North Carolina
Steven Vensko, University of North Carolina


The Cancer Research Institute (CRI) iAtlas (www.cri-iatlas.org) is a platform for interactive data exploration and discovery in immuno-oncology (IO), originating in a pan-cancer working group study by The Cancer Genome Atlas. At present, iAtlas provides 17 analysis modules to explore immune-cancer interactions, immunotherapy treatment, and outcomes in over 12,000 participant samples. <br><br>To meet evolving demands of data heterogeneity and volume, we have built a scalable, cloud-hosted relational database and a GraphQL-based API layer — both of which are driven by a thoroughly documented and standards-compliant data model. We provide an R client library for bioinformaticians and analysts who wish to programmatically explore and access data. The app itself (built using the Shiny framework) leverages the API client for data query and retrieval to drive visualizations and other functionality, including a rich selection of filters and conditions for working with custom cohorts.<br><br>We continue to expand the breadth of data in iAtlas, including new immune checkpoint inhibition (ICI) trials with accompanying outcome data, as well as large-scale cancer -omics datasets such as the Pan-Cancer Analysis of Whole Genomes and the Human Tumor Atlas Network. For each, we have robust pipelines for data processing, encoded as CWL or Nextflow workflows — all of which are fully open and reusable. <br><br>We aim to make iAtlas a platform that both democratizes analysis for IO researchers and enables contributions from tool developers or data scientists. Everything we build is intended to be as FAIR as possible to advance and accelerate discovery in combating cancer.

Tissue-adjusted pathway analysis of cancer

Presenting Author: Rob Frost, Dartmouth College


We describe a novel single sample pathway analysis method for cancer transcriptomics data named tissue-adjusted pathway analysis of cancer (TPAC). The TPAC method leverages information about the normal tissue-specificity of human genes to compute a robust multivariate distance score that quantifies pathway dysregulation in each profiled tumor. Because the null distribution of the TPAC scores has an accurate gamma approximation, both population and sample-level inference is supported. As we demonstrate through an analysis of gene expression data from The Cancer Genome Atlas (TCGA), TPAC pathway scores are more strongly associated with both patient prognosis and tumor stage than the scores generated by existing single sample pathway analysis methods.

Alignment-free phylogenetic method unveils the pan-mammalian regulatory motif adaptations underlying extended lifespan

Presenting Author: Elysia Saputra, University of Pittsburgh

Ali Tugrul Balci, University of Pittsburgh
Nathan Clark, University of Utah
Maria Chikina, University of Pittsburgh


Understanding the genomic underpinnings of longevity is crucial for preventing age-related pathologies. Mammals have a vast variety of lifespan that evolved repeatedly, making longevity a convergent phenotype that can be studied with comparative methods. While morphological diversity is increasingly attributed to changes in non-coding regulatory elements (REs) than protein sequences, an analytical phylogenetic framework that is tailored to the functional and structural properties of REs is lacking. Here, we develop AFconverge, an ‘alignment-free’ phylogenetic method that predicts the patterns of regulatory motif adaptations underlying phenotypic evolution. By modularly computing the phenotypic association of transcription factor (TF) binding motifs in REs, AFconverge introduces new and flexible paradigms for deciphering the complexity of regulatory adaptations at multiple scales. Applying AFconverge to study promoter adaptations underlying mammalian longevity, we find widespread gains and losses of motifs that are consistent with known associations of longevity with pluripotency maintenance, circadian regulation, immunity, and dietary restriction. We also characterize 28 gene families involved in pluripotency regulation and germline development that exhibit family-specific motif selection patterns. Additionally, AFconverge’s TF-centric signals enable inference on latent factors underlying the observed promoter-motif selection patterns genome-wide, revealing that promoter-motif selection underlying longevity is strongly driven by mechanisms in stem and progenitor cells of germlines, liver, adipose tissues, connective tissues, and cardiovascular and immune systems. Finally, we show that promoters implicated in mTOR signaling, insulin signaling, extracellular matrix regulation, and cancer evolve under relaxed constraint in long-lived mammals. Thus, AFconverge offers new, powerful approaches for interrogating how selection acts on regulatory machinery.

A robust computational framework to benchmark spatiotemporal trajectory association analysis in single-cell spatial transcriptomics

Presenting Author: Fan Zhang, University of Colorado School of Medicine

Juan Vargas, MPH Biostatistics
Douglas Fritz, Medical Scientist Training Program


Recent fast development of spatial transcriptomics (ST) technologies provides new ways to characterize gene expression patterns along with spatial information. Compared to non-spatial single-cell transcriptomics, ST data offers a unique opportunity to unravel both spatial and temporal information simultaneously, which is crucial to understand pathogenic cell lineage contributing to disease progressions. A few computational machine learning or deep learning algorithms have been developed to identify these spatiotemporal trajectories. However, it is crucial to use the appropriate statistical model to fit overdispersed ST data, which is usually neglected in spatiotemporal association analysis. We developed a computational approach to select the best model by benchmarking 7 statistical models for overdispersed ST count data, which provides a sensitive framework and evaluation metric on selecting the model that best fits and predicts the ST data. Additionally, we also benchmarked the performances of identifying spatially aggregated gene signatures that are significantly associated with the identified spatiotemporal trajectories. By applying our framework on ST datasets, we found that Negative Binomial (NB) and Zero-inflated NB outperform Poisson, Quassi Poisson, Neo-inflated Poisson, Hurdle model, and linear mixed effect modeling for genes of medium and high variations. All models work equivalently well for low-count genes. Applying our framework to public ST datasets, we are able to reveal genes that are associated with pre-defined spatiotemporal trajectory and reflect tumor immune interaction using 10X Visum ST data, and genes that characterize the structures of mouse hippocampus using Slide-seqV2 ST data.

LochNESS: a novel statistic to quantify sample specific effects in single-cell data based on k-NN graphs

Presenting Author: Xingfan Huang, University of Washington

Jay Shendure, University of Washington


With the advent of single-cell RNA-seq data with complex experimental designs, profiling transcriptomes from various genotypes or tissue regions, detection of changes due to sample-specific conditions in large atlases becomes a computational challenge. We developed “lochNESS” (local cellular heuristic Neighborhood Enrichment Specificity Score), a statistic quantifying the relative enrichment or depletion of samples in the transcriptional neighborhood of each cell. Briefly, we build a k-NN graph and calculate the ratio of observed vs. expected number of cells from each sample in each cell’s k-NNs.&lt;br&gt; &lt;br&gt;We applied lochNESS to a developmental mouse mutant atlas to identify genotype specific biases, focusing on Gli2 knockout mutants. In Gli2 mutants, cells of the floor plate, a structure at the base of the neural tube, exhibited low lochNESS, indicating loss of floor plate in mutants. LochNESS also revealed a mutant-enriched group of roof plate cells marked by Ttr, marker for choroid plexus, and a mutant-depleted group marked by Wnt signaling-related genes. These relatively subtle effects were missed by other strategies and only uncovered by the granularity of lochNESS analysis.&lt;br&gt; &lt;br&gt;We also applied lochNESS to an adult macaque brain atlas to identify brain region specific transcriptional programs. LochNESS revealed groups of region-specific cells and, in a focused analysis, identified markers for astrocytes in specific regions (TCAF2 and FRK in the occipital lobe) and combinations of regions (PGD in the brainstem, basal ganglia, and thalamus), that we would not have identified using conventional methods. Overall, lochNESS identifies nuanced sample-specific differences, enabling biological discovery in complex single-cell datasets.

C3PO: Combined polygenic protein prediction in oncology

Presenting Author: Matthew Bailey, Brigham Young University


Polygenic risk scores have recently emerged as a credible statistical strategy to gather the minor cumulative effects of multiple genomic markers to predict disease risk. Polygenic risk statistics require two fundamental measurements i) a genetic mutation and ii) an estimated effect of that mutation on disease status. The cumulative weighted sum of all mutations and effects can be calculated for each sample and represents two fundamental measures. First, the score represents the risk of a person getting a disease; second, it estimates how much genomic variability contributes to said risk. Polygenic risk mathematics contains many appealing features for multi-omics modeling in cancer. Here we outline a tool called Combined Polygenic Protein Predictor in Oncology (C3PO), whereby we apply polygenic risk scores to build a protein prediction model using multi-omics measurements in cancer. The fundamental algorithm of our tool leverages polygenic risk score theory to quantify the somatic contribution to variable protein abundances. Understanding the genomic architecture of protein abundance in cancer will improve our understanding of the consequences of driver-alteration on the proteome and allow us to produce novel hypotheses for drug response and gene-by-environment interactions. Preliminary findings and applications of our multi-omics modeling tool on CPTAC data provide strategies to handle sparse mutation tables of tumor-specific mutations and predictions for the genomic contribution to altered pathways. Preliminary results also suggest that C3PO can capture proteogenomic effects of environmental factors like smoking and identify drug response biomarkers in patient-derived cancer models, e.g., xenografts and organoids.

Sex specific genetic architecture of childhood asthma

Presenting Author: Amelie Fritz, Technical Universaity Denmark

Anders Gorm Pedersen, Technical University of Denmark
Anders Pedersen, Technical University of Denmark


Childhood asthma is the most common reason for hospitalization in early childhood. From epidemiological studies, it is evident that the prevalence is higher in boys than girls. After puberty, it is more prominent in women than men. The heritability of childhood asthma is estimated to be between 60 and 90%. This suggests that the genetic components driving the development of childhood asthma have a sex-specific effect. Yet, most association studies do not consider gender in their analysis.<br> <br>In this project, a Bayesian logistic regression model with a variant-sex interaction term was developed in RStan to identify SNPs that have a sex-specific effect on childhood asthma. Discovery studies were conducted in a dataset of 1189 children with severe asthma (2-6 hospitalizations) from Copenhagen Prospective Studies on Asthma in Childhood (COPSAC) and 5094 non-asthmatic controls. Twenty variants have a posterior probability of interaction higher than 76%.<br>A subset of individuals with severe asthma (6+ hospitalizations, 372 individuals) suggests 17 variants with a posterior probability higher than 80% of having a sex interaction, of which two are found to be part of the genes IL1R1 and CLEC16A, known for being associated with asthma previously, and 4 of the top 9 interacting variants are expressed in lung tissue. Sex-stratified analysis confirms the sex-specific effect in both data sets.<br><br>Suggested variants do not replicate UK Biobank (5581 cases, 88094 controls), which might indicate that the definition of the asthma phenotype in both data sets is too different. Further replication is planned in the iPSCYH data set.<br>

An efficient not-only-linear correlation coefficient based on machine learning

Presenting Author: Milton Pividori, University of Pennsylvania

Marylyn Ritchie, University of Pennsylvania
Diego Milone, Universidad Nacional del Litoral
Casey Greene, University of Colorado


Correlation coefficients are widely used to identify relevant patterns in data. In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes. Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models. We show that CCC can capture biologically meaningful linear and non-linear patterns missed by standard, linear-only correlation coefficients. CCC efficiently captures general patterns in data by applying clustering algorithms and automatically adjusting the model complexity. When applied to human gene expression data, CCC identifies robust linear relationships while detecting non-linear patterns associated with sex differences that are not captured by the Pearson or Spearman correlation coefficients. Gene pairs highly ranked by CCC but not detected by linear-only coefficients showed high probabilities of interaction in tissue-specific networks built from diverse data sources including protein-protein interaction, transcription factor regulation, and chemical and genetic perturbations. CCC is much faster than state-of-the-art not-only-linear coefficients such as the Maximal Information Coefficient (MIC). CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can readily be applied to genome-scale data and other domains across different data types.

Survival-based Gene Set Enrichment Analysis

Presenting Author: Jeffrey Thompson, University of Kansas Medical Center

Xiaoxu Deng, University of Kansas Medical Center


Gene set enrichment analysis (GSEA) helps to identify the biological functions that are enriched in up or down-regulated gene expression. Survival analysis is used to study the association of risk factors and time to an event (e.g., death). Typically, in GSEA, the log-fold change in expression between treatments is used to rank genes, in order to determine if a biological function or pathway has a non-random distribution of altered gene expression. Instead, we propose a survival-based gene set enrichment analysis helps determine which biological functions are associated with a disease’s survival. We are developing an R package for this analysis and present a study of kidney renal clear cell carcinoma (KIRC) to demonstrate the approach. This approach begins by ranking all genes according to their log-hazard ratios, from which the association between gene expression and survival were calculated by Cox proportional hazards model and then determines if any gene sets are significantly overrepresented toward the top or bottom within that ranked list. By focusing on sets of genes having significantly larger log-hazard ratios, our result confirms that this survival-based approach can identify important biological pathways or functions associated with disease survival. For example, the top three pathways significantly enriched by KIRC genes with top-ranked log hazard ratios are Cell Cycle Mitotic, Mitotic Metaphase, and Anaphase, whereas the negatively significantly enriched top three pathways are RHO GTPase cycle, Pyruvate metabolism, and Citric acid cycle. This approach allows researchers to quickly identify disease variant pathways as the basis for further research.<br>

How do you build a data commons? A case study for Down Syndrome

Presenting Author: Monica Munoz-Torres, University of Colorado Anschutz Medical Campus

Surya Saha, Seven Bridges
Timothy Putman, University of Colorado Anschutz Medical Campus
Pierrette Lo, Sage Bionetworks
Joaquin Espinosa, University of Colorado Anschutz Medical Campus
Adam Resnick, Children’s Hospital of Philadelphia
Brian OConnor, Sage Bionetworks
Jack DiGiovanna, Seven Bridges
Melissa Haendel, University of Colorado Anschutz Medical Campus


The push to address increasingly complex problems of ever-increasing scale has driven investment in data infrastructure designed to overcome social and technical barriers between disciplines and data types. Whereas data repositories typically specialize in a single discipline or data type, data commons aim to break down these siloes to support more innovative, multi-modal, and multi-disciplinary work. Aggregating large amounts of heterogeneous data is only part of what needs to be done – researchers need to be able to use these often large and unwieldy data sets. Data sources, whether individual datasets or mature databases are not typically designed for interoperability and data reuse, both from the perspective of technical interoperability as well as legal reuse permissions. <br><br>Best practices for designing, building, and sustaining data infrastructure that delights users and serves diverse stakeholders is no easy task. In this presentation, we will share ongoing efforts towards building a data commons for the Down syndrome research community – the INCLUDE DCC Data Hub. Ultimately, the goal is for any commons to have the capability of handling complex data requests that require querying across multiple sources and data resources – and this cannot happen without some harmonization of systems, data, and governance across the ecosystem. We will provide a progress update on ongoing work, with details from establishing the need for the data commons ecosystem, to conducting a landscape and requirements assessment, to developing a harmonized model and disseminating the final product through a secure data hub.

Leveraging edge computing for workflow tracking and management to improve academic and healthcare security, efficiency, and auditability

Presenting Author: Caylin Hickey, University of Kentucky

Cody Bumgardner, University of Kentucky
Justin Miller, University of Kentucky


Academic healthcare systems and personnel must work together cohesively and quickly in an ever-evolving environment to provide prompt, precise patient care. Here, we present novel approaches using edge computing to provide a secure, resilient communication network to facilitate task and data movement in a clinical and academic setting. <br>To improve continuing education for faculty, we developed “check in” stations using Raspberry Pi nodes with displays and magnetic card readers to digitally track attendance. Attendees scan their badge, containing only a numeric identifier, and receive visual confirmation of recording. Secure, listener-based communication channels transmit the identifier to an identity lookup function and the complete identity and swipe are forwarded to an auditable database and the original swipe station for notification. <br>Using the Cresco edge computing framework, we developed a management system that leverages local and cloud resources to efficiently control the bioinformatics processes in our genomics workflow as data becomes available. Controller agent(s) instantiate ephemeral resources and communicate task deployment. Processing is initiated from the sequencer and results are automatically downloaded to a local share for review.<br> This framework reduced genomics workflow processing times from 75 hours to 12 hours with the ability to scale to cloud-provided capacity. Additionally, our education attendance can be tracked digitally from any room with a Raspberry Pi node and audited from a web-based interface.<br>Edge computing frameworks provide scalability, improved speed, and increased capacity for common healthcare tasks to improve the efficiency and precision of patient care in both clinical and academic settings.

Systematic study of human diseases using graph signal processing on gene expression datasets

Presenting Author: Renming Liu, Michigan State University


Understanding the genetic basis of human diseases and their associated genes is a vital task. Prior works using differential gene expression approach on controlled studies have paved the way towards identifying genes related to the disease of interest. Despite a plethora of publicly available gene expression datasets, systematically studying human diseases using all these data is challenging because (1) gene expressions are inherently noisy and (2) only a few gene expression datasets are annotated with clear disease annotations. Here, we propose a graph signal processing approach to uncover disease genes and infer disease-study correspondence simultaneously. In particular, we consider each gene expression sample as a signal over the graph defined by functional gene interactions, which allows us to filter and denoise the expression data. The filtered gene expression samples are then used to train a linear model that aims to recapitulate disease-associated genes. For each disease, we consider the gene expression studies that best predict the corresponding disease genes to be related to the disease. Our evaluation results indicate that the proposed approach can recover previously known disease-study associations. Moreover, we show that different frequency information resulting from graph signal filtering serves different purposes; while low-frequency information excels at disease gene predictions, high-frequency information helps better uncover disease-study correspondence.

GeneplexusZoo: Utilizing network information across species to improve supervised gene classification

Presenting Author: Christopher Mancuso, University of Colorado-Denver Anschutz Medical Campus

Kayla Johnson, Michigan State University
Sneha Sundar, Michigan State University
Renming Liu, Michigan State University
Hao Yuan, Michigan State University
Arrjun Krishnan, University of Colorado-Denver Anschutz Medical Campus


Network-based machine learning is a powerful approach for leveraging the cellular context of genes to computationally predict novel/under-characterized genes that are functionally similar to a set of known genes of interest. One powerful network-based gene classification method that is gaining popularity is to use supervised learning algorithms where the features for each gene are determined by that gene’s connections in a molecular network. In this work, we explore how networks from multiple species can be jointly leveraged to improve this gene classification method. We first build multi-species networks by connecting nodes (genes/proteins) in different species if they belong to the same orthologous group. Then, we create feature representations by directly considering a gene's connection to all other genes in the entire multi-species network or considering a low-dimensional embedding for the entire network. We find that adding information across species improves performance for the tasks of predicting human and model species gene annotations across a set of non-redundant gene ontology biological processes. In addition to providing better predictions, we show how this approach casts genes across species into the same “space” where they can be used to improve how knowledge is transferred from one species to another.

Privacy-preserving prediction of phenotypes from genotypes using homomorphic encryption

Presenting Author: Annie Choi, Columbia University

Seungwan Hong, New York Genome Center
Gamze Gürsoy, Columbia University
Daniel Joo, Columbia University


Finding associations between genetic variants and disease phenotypes via machine learning has been an exploding field of study in recent years. However, statistically significant inferences from these studies require a massive amount of sensitive genotype and phenotype information from thousands of patients, creating concerns related to patient privacy. These concerns are exacerbated when machine learning models themselves leak information about the patients in the training dataset. As a result, privacy concerns are constantly in conflict with the urge for widespread access to patient information for research purposes. Homomorphic encryption can be a potential solution as it allows computations on ciphertext space. While many privacy-preserving methods with homomorphic encryption have since been developed to address the privacy of input (genotype) and output (phenotype) during inference, none implemented mitigations for model privacy. This is largely due to the need for cleaning and pre-processing of large-scale<br>genotype data, which is computationally challenging when model parameters are encrypted. Here we implemented a privacy-preserving inference model using homomorphic encryption for five different phenotype prediction tasks, where genotypes, phenotypes, and model parameters are all encrypted. Our implementation ensures no privacy leakage at any point during inference. We show that we can achieve high accuracy for all five models (≥ 94% for all phenotypes, equivalent to plaintext inference), with each inference taking less than ten seconds for ∼200 genomes. Our study shows that it is possible to achieve high quality machine learning predictions while protecting patient confidentiality against any kind of membership inference attacks with theoretical privacy guarantees.

Leveraging public transcriptome data with ML to infer pan-body age- and sex-specific molecular phenomena

Presenting Author: Kayla Johnson, Michigan State University


Age and sex are historically understudied factors in biomedical studies even though many complex traits and diseases vary by these factors in their incidence and presentation. Hence, there is a critical need for analytical frameworks that can aid scientists in systematically bridging gaps in understanding age- and sex-specific genetic and molecular mechanisms. Hundreds of thousands of publicly-available gene expression profiles present an invaluable, yet untapped, opportunity for addressing this need. However, the bottleneck is that a vast majority of these profiles do not have age and sex labels. Therefore, we first ~30,000 samples associated with age and sex information and then trained machine learning (ML) models to predict these variables from gene expression values. Specifically, we trained one-vs-rest logistic regression classifiers with elastic-net regularization to classify transcriptome samples into age groups separately for females and males. Overall, the classifiers are able to discriminate between age-groups in a biologically meaningful way in each sex across technologies. The weights of these predictive models also serve as ‘gene signatures’ characteristic of different age groups in males and females. We also inferred genomewide sex-biased genes within each age group. Enrichment analysis of these gene signatures helped us identify age- and sex-associated multi-tissue and pan-body molecular phenomena (e.g. general immune response, inflammation, metabolism, hormone response). Our curated dataset, gene signatures, and enrichment results will be valuable resources to aid scientists in studying age- and sex-specific health and disease processes.

A comprehensive knowledgebase of known and predicted genetic variants associated with COVID-19 severity

Presenting Author: Meltem Ece Kars, Icahn School of Medicine at Mount Sinai, Institute for Personalized Medicine

Meltem Ece Kars, Icahn School of Medicine at Mount Sinai
David Stein, Icahn School of Medicine at Mount Sinai
Peter Stenson, Cardiff University
David Cooper, Cardiff University


Host genetic susceptibility is a key risk factor for severe illness from COVID-19. Despite numerous studies of COVID-19 host genetics, current resources of known COVID-19 severity variants are limited, and there are no available tools to computationally predict novel COVID-19 severity variants. Therefore, we collected 820 variants reported to affect COVID-19 susceptibility by doing a systematic literature search. We then developed the first machine learning classifier of severe COVID-19 variants to perform a genome-wide prediction of COVID-19 severity for all missense variants. We further evaluated variant-, gene- and protein-level features of high-confidence COVID-19 variants using a feature selection algorithm, which indicated conservation level and variant impact as the most important discriminative features. Lastly, we investigated the pleiotropic effects of COVID-19-associated variants using phenome-wide association studies in ~40,000 BioMe BioBank genotyped individuals, revealing comorbidities that could increase the risk of severe COVID-19. Taken together, our work provides a comprehensive COVID-19 host genetics knowledgebase to date for the known and predicted genetic determinants of severe COVID-19 and further understanding of the biological factors underlying COVID-19 susceptibility.

Abyssinian to Zebu: Classifying Animal Breeds with the Vertebrate Breed Ontology (VBO)

Presenting Author: Kathleen Mullen, University of Colorado Anschutz Medical Campus

Nicolas Matentzoglu, Semanticly
Halie Rando, University of Colorado
Nicole Vasilevsky, University of Colorado
Melissa Haendel, University of Colorado
Zhi-Liang Hu, Iowa State University
Gregoire Leroy, Food and Agricultural Organization of the United Nations
Imke Tammen, The University of Sydney
Frank Nicholas, The University of Sydney
Sabrina Toro, University of Colorado Anschutz Medical Campus


In the current era of biomedical big data, advances in diagnostics and treatments can leverage a wealth of information from research and health records. This approach requires integrating and comparing data related to genotypes, phenotypes, and diseases from disparate sources. Though species agnostic, this process is currently optimized for diagnosis and treatment of human patients since the harmonized data mostly comes from human health records and animal model databases. Including all non-human animal data would improve translational research, and expand diagnostic support and treatment discovery for non-human animals. Here, we report the creation of the Vertebrate Breed Ontology (VBO) as a single source for data standardization and integration of all breed names. VBO was created using standard semantic engineering tools including the Ontology Development Kit. Breeds are added in VBO when they are recognized as such by international organizations, communities, and/or experts. VBO metadata can include common names and synonyms, country of existence, breed recognition status, domestication status, breed identifiers/name codes, reference in other databases, and description of origin. Provenance of all VBO information is recorded. Currently, livestock and cat breeds are available in VBO, with the addition of dog breeds and animals bred for laboratory purposes underway. The adoption of VBO as the source of breed names in databases and veterinary electronic health records is one step in making information more computable and consistent. This will enhance data interoperability, and support data integration and comparison, and ultimately diagnosis and treatments for both humans and other animals.

Binding sites prediction of proteins

Presenting Author: Zhong-Ru Xie, University of Georgia

Lei Lou, University of Georgia
Yifei Wu, University of Georgia


Molecular recognition plays a critical role in biological processes. By binding to another biological molecule such as carbohydrate, DNA/RNA, small molecule, or protein, proteins perform many physiological functions. We developed a computational method to predict ligand binding sites that use the physicochemical properties of triangles of protein atoms instead of relying on the identification of surface cavities or estimations of binding energy as other methods do. Any three atoms on the surface of the target protein form a triangle. Based on the chemical properties and environment of the three surface atoms, the triangles can be classified into different categories in which they either do or do not prefer to bind to particular types of ligands. By simply calculating the binding propensity scores of different atom triangles of target proteins, this method predicts the binding sites with up to 90% accuracy. We recently integrated the triangle preferences with a deep-learning method to create a machine learning-based DNA binding site prediction algorithm, DeepDISE, and are developing analogous protein-carbohydrate, protein-protein and protein-small molecule binding site prediction algorithms using deep learning technology. Our goal is to create innovative and transformative tools that will aid the research community in elucidating how proteins recognize, modify, and/or regulate their binding targets to conduct their physiological functions.

On Language Models, Medicine, and Factuality

Presenting Author: Sagi Shaier, CU Boulder

Lawrence Hunter, University of Colorado School of Medicine
Katharina Kann, University of Colorado Boulder


We introduce an algorithm for evaluating the factuality of biomedical language models. In contrast to previously proposed template-based strategies, we use naturally occurring text while still not requiring expert involvement. Our results highlight shortcomings of templates when evaluating factuality, as we find much larger differences between models that are trained from scratch and models that are finetuned on medical text. In addition and in contrast to previous evaluation methods, our approach indicates that all biomedical models outperform all vanilla models. Finally, we also find that pseudo-perplexity is not correlated with factuality and that more training data does not necessarily result in more factual output.

Metabolites and Markers of Endothelial Dysfunction Help Understand Sepsis Mortality Risk

Presenting Author: Sarah McGarrity, Brigham and Women’s Hospital and Harvard Medical School

Hanne H Henriksen, Copenhagen University Hospital
Sigurður T Karvelsson, University of Iceland
Óttar Rolfsson, University of Iceland
Morten H Bestle, North Zealand Hospital
Pär I Johansson, Capital Region Blood Bank, Copenhagen University Hospital


Sepsis is a major cause of death worldwide, with a mortality rate that has remained stubbornly high. The current gold standard of risk stratifying sepsis patients provides limited mechanistic insight for therapeutic targeting. An improved ability to predict sepsis mortality and to understand the risk factors would allow better treatment targeting. Sepsis causes metabolic dysregulation in patients; therefore metabolomics offers a promising tool to study sepsis. It is also known that that in sepsis endothelial cells affecting their function regarding blood clotting and vascular permeability. <br>We integrated metabolomics data from patients admitted to an ICU for sepsis, with commonly collected clinical features of their cases and two measures of endothelial function relevant to blood vessel function, PECAM and thrombomodulin. Firstly, we performed differential analysis of metabolites in surviving vs non-surviving patients using the LIMMA R package to account for age and gender of patients. Next, we used penalized regression, enrichment analysis and pathway analysis to identify features most able to predict 30-day survival. The features important to sepsis survival include TCA cycle metabolites, and amino acids, as well as endothelial proteins and a medical history. To understand how this relates to other clinical features we used a combination of penalized regression and correlation analysis, this showed links between medical history and fatty acid metabolites, suggesting that pre-existing metabolic dysregulation may be a contributory factor to sepsis response.<br>By exploring sepsis metabolomics data in conjunction with clinical features and endothelial proteins we have gained a better understanding of sepsis risk factors.

A comprehensive analysis of the reusability of public omics data across 2.8 million research publications

Presenting Author: Mohammad Vahed, University of Southern California

Mohammad Vahed, University of Southern California
Nicholas Darci-Maher, University of California, Los Angeles
Kerui Peng, University of Southern California
Jaqueline Brito, University of Southern California
JungHyun Jung, University of Southern California
Anushka Rajesh, University of Southern California
Andrew Smith, University of Southern California
Reid F. Thompson, Oregon Health & Science University
Abhinav Nellore, Oregon Health & Science University
Casey Greene, University of Pennsylvania
Jonathan Jacobs, QIAGEN Digital Insights
Dat Duong, University of California Los Angeles
Eleazar Eskin, University of California Los Angeles
Serghei Mangul, University of Southern California


There is growing evidence that data sharing enables important discoveries across various biomedical disciplines. When data is shared on centralized repositories in easy-to-use formats, other researchers can examine and re-analyze the data, challenge existing interpretations, and test new theories. Additionally, secondary analysis is economically sustainable and can be used in countries with limited resources. However, once a research team publishes critical findings derived from an omics dataset, secondary analysis can play a crucial role in enabling and verifying the reproducibility and generalizability of published results. We have performed a data-driven examination of reuse patterns of the reusability of public omics data across 2,882,007 open-access biomedical publications (published between 2001 and 2020; across 13,753 journals). Our search included two omics data repositories, NCBI Sequence Read Archive (SRA) and NCBI Gene Expression Omnibus (GEO). We tested the accuracy of this assumption using a subset of datasets for which investigators manually linked their dataset records with PubMed identifiers and found it to be accurate. Considering the data in units of datasets and the number of times they are reused, we found that except for a few initiatives, omics data is poorly reused, and over 59% of the data in GEO, and over 70% of the data in SRA, is not reused even once. Our study establishes the current state and trends of secondary analysis of omics data and suggests that an easy-to-use format is needed to enable omics data reusability.

Creating an Ignorance-Base: Exploring known unknowns in the scientific literature

Presenting Author: Mayla Boguslav, University of Colorado Anschutz Medical Campus

Nourah Salem, University of Colorado Anschutz Medical Campus
Elizabeth White, University of Colorado Anschutz Medical Campus
Katherine J. Sullivan, Colorado Department of Public Health
Stephanie P. Araki, University of Colorado Anschutz Medical Campus
Michael Bada, University of Colorado Anschutz Medical Campus
Teri L Hernandez, University of Colorado Anschutz Medical Campus
Sonia M Leach, National Jewish Health
Lawrence E Hunter, University of Colorado Anschutz Medical Campus


Research progresses through the process of making the unknown unknowns into known unknowns, then into knowns. We aim to illuminate this process by helping students, researchers, funders, and publishers better understand the state of our collective scientific knowledge and ignorance (known unknowns) in order to help accelerate translational research through the process of identifying questions that need answers in relation to a topic or experimental results. Many knowledge-bases exist, but no ignorance-bases until now. We create the first ignorance-base that captures these statements of ignorance driven by the entailed goal for scientific knowledge from the scientific literature by combining biomedical concept recognition and ignorance classification. We used the ignorance-base to help a student not only find new avenues to explore based on a topic, but also elucidate how questions are asked and how those change over time. We also used the ignorance-base to help researchers contextualize their gene list in our collective scientific ignorance to find questions that bear on it from another discipline. For both, we present a new method, ignorance-enrichment, to find the over-represented biomedical concepts in ignorance statements, meaning that there is more unknowness in relation to the topic or gene list than expected. While these ideas and methods are generally applicable across biomedical research, we chose to focus on the prenatal nutrition field and literature because of its global public health impact. If we can identify the questions in prenatal nutrition to start, then we can look to other fields and knowledge-bases for answers.

Identify genes associated with gastrointestinal tract adaptation using Evolink

Presenting Author: Xiaofang Jiang, NIH/NLM

Yiyan Yang, NLM


Gastrointestinal (GI) tracts are colonized with abundant and complex bacterial communities that are often distinct from free living communities of bacteria in the environment. The genetic features and the molecular mechanisms responsible for niche differentiation and host GI tract preference in these bacteria have been understudied, particularly on larger scales. In this study, we developed Evolink, a phylogeny-aware tool for the rapid identification of genotype-phenotype associations across large-scale microbial datasets. Evolink was applied to over 30,000 bacterial species from the Genome Taxonomy Database with habitat metadata annotated in publicly available microbial databases. We identified genes positively associated with GI adaptation shared by Bacteroidota and Firmicutes including genes that are involved in responding to host oxidative immune responses, acid-resistance, biofilm formation, and xenobiotic/endobiotic metabolism. Negatively associated genes were often related to non-host environmental stress response such as chemotaxis proteins and superoxide dismutases. We found that genes related to quorum sensing may contribute to GI tract adaptation in Bacteroidota. Additionally, loss of flagellar-related genes in the Firmicutes phylogeny was found to coincide with the emergence of GI-associated clades, indicating that their loss may have a role in GI tract adaptation within this phylum. Overall, our study presents a systematic analytical tool for identifying genotype-phenotype associations and provides a global view of the adaptive strategies employed by different GI-associated bacterial phyla, thus providing a rich resource for further study of this topic.

Development and validation of an explainable machine learning-based prediction model for drug-food constituent interactions from chemical structures

Presenting Author: Quang Hien Kha, Taipei Medical University

Viet Huan Le, Taipei Medical University
Truong Nguyen Khanh Hung, Cho Ray Hospital
Ngan Thi Kim Nguyen, National Taiwan Normal University
Nguyen Quoc Khanh Le, Taipei Medical University


Background: Possible drug-food constituent interactions (DFCIs) could change the intended efficiency of therapeutics in medical practice, which may lead to adverse events for patients, even death. However, the importance of DFCIs remains underestimated, as the number of studies on these topics is limited. Recently, scientists have applied artificial intelligence-based models to forecast DFCIs. However, the previous studies lacked reproducibility, or the performance was not good enough to recognize the DFCIs in clinical practice. Therefore, this study proposed a novel prediction model to address the limitations of the preceding work and enhance the accuracy of DFCI detection.<br> <br>Methodology: From 70,477 foods (FooDB_1.0) and 13,580 drugs (DrugBank_5.1.7), our benchmark dataset contained 2,263,426 DFCIs of negative, positive, and non-significant DFCIs, in which 50% DFCIs were used for training, 37.5% for hyper-parameter tuning and 12.5% for testing. The external test set included 1,922 DFCIs from a previous study. We used a four-step feature selection process to filter out only the eighteen most important features. eXtreme Gradient Boosting (XGBoost) was the optimal model among the five investigated algorithms (all were five-fold cross-validated and hyper-parameter tuned).<br> <br>Results: The XGBoost model performed a predictive accuracy of 97.56% on the unseen data of the external test set. Finally, we applied our model to recommend whether a drug should or should not be taken with some food compounds based on their interactions.<br><br>Conclusion: Our model, with its simplicity and high accuracy, can help doctors and patients avoid the adverse effects of DFCIs and ameliorate treatment efficiency.

Metrics and software for assessing microbiota engraftment following Fecal Microbiota Transplant

Presenting Author: Chloe Herman, Northern Arizona University

Liz Gehret, Northern Arizona University
Evan Bolyen, Northern Arizona University
J Gregory Caporaso, Northern Arizona University


Fecal Microbiota Transplant (FMT) is an FDA approved treatment for recurrent Clostridium <br>difficile infections, and is being explored for other health applications ranging from alleviating <br>digestive and neurological disorders, to priming the microbiome for cancer treatment, and helping restore the microbiome following cancer treatment.<br><br>Quantifying the success of FMT, as the degree to which FMT alters a recipient’s microbiome, is important in determining if a recipient didn’t respond because the engrafted microbiome didn’t produce the desired outcomes (a successful FMT, but negative treatment outcome), or the microbiome didn’t engraft (an unsuccessful FMT and negative treatment outcome). However, there is no consistent methodology for quantifying FMT success. <br><br>We have identified core metrics for assessing whether FMT has altered a recipient’s gut microbiome: tracking microbiome richness relative to the donors’ and pre-transplant individuals’ microbiome; tracking changes in overall microbiome composition as distances to the donor before and after treatment; identifying differentially abundant features before and after the FMT, or between the donors’ and baseline individuals’ samples; detecting features identified in previous studies following FMT; and Proportional Engraftment of Donor Strains (PEDS), which assesses the fraction of microbiome strains that were transferred to the recipient.<br><br>This talk will present q2-FMT, an open-source QIIME 2 plugin, that provides reproducible and well-tested methods for assessing FMT success. QIIME 2 is a widely used platform for microbiome bioinformatics. By implementing these approaches, we hope to move the field toward standardized assessments of engraftment, an important step in using FMT as a therapeutic tool and developing microbiome-based medicine.

Exploration of 5'UTR Variation in a Clinical Workflow Evaluating Undiagnosed Rare Disease Patients

Presenting Author: Eric Klee, Mayo Clinic

Bradley Bowles, Mayo Clinic
Rory Olson, Mayo Clinic
Chris Schmitz, Mayo Clinic
Gavin Oliver, Mayo Clinic
Karl Clark, Mayo Clinic


Rare genetic disease affects less than 200,000 individuals in the United States. Clinical exome and genome sequencing have improved the diagnosis rates for patients afflicted with these rare genetic diseases, and yet most of these patients remain undiagnosed. Innovations in multiomic approaches and DNA analytics have improved outcomes for these patients. Here, we present our initial work at looking into another relatively unexplored area the genome, upstream open reading frames (uORFs). <br>We collected pathogenic 5’UTR variants from Human Gene Mutation Database and benign variants from ClinVar. We identified variant interpretation annotations (such as CADD, DANN, or population allele frequency) that best correlated with variant pathogenicity and an additional set of annotations that correlated specifically with 5’UTR variants altering mRNA translational efficiency. We used this to screen a set of 777 undiagnosed rare disease patients for deleterious 5’UTR variation and functionally characterized the variants using luciferase assays.<br>Allele frequency, CADD score, and DANN are moderate predictors of 5’UTR variant pathogenicity status, while a variant’s predicted effect on 5’UTR regions and predicted change in transcript ribosome binding ability are associated with translationally disruptive 5’UTR variation. Using these insights we identified uORF-disrupting 5’UTR variants predicted to alter gene expression in our rare disease patient cohort and experimentally assess this using In vitro luciferase constructs. <br>Based on this we establish a prototype process for identifying pathogenic 5’UTR variation and functionally assessing the effect of this variation. These findings can be used to efficiently prioritize 5’UTR variation for manual review of functional characterization.<br>

An ML framework for precision medicine: from patient-specific gene networks to translational animal models

Presenting Author: Hao Yuan, Michigan State University

Christopher Mancuso, University of Colorado Anschutz Medical Campus
Kayla Johnson, Michigan State University
Ingo Braasch, Michigan State University
Arjun Krishnan, University of Colorado Anschutz Medical Campus


Most common and rare diseases exhibit staggering heterogeneity in clinical presentation, disease course, and treatment response. By subtyping the pathologic basis of diseases, precision medicine has the potential to offer personalized diagnoses and therapeutic options for individual patients. However, current precision medicine approaches cannot be applied to a broader spectrum of diseases due to the lack of 1) the ability to interpret patient-specific novel mutated genes 2) strategies for finding the right animal model and gene targets to experimentally characterize disease etiology in individual patients. To address these challenges, we have developed a novel regression-guided coexpression approach to build patient-specific genome-scale gene networks using hundreds of thousands of existing human transcriptomes weighted based on their relevance to transcriptome data from a single patient and use the network to predict patient-specific genes. To prioritize the right research organism and gene targets, we have developed an approach that uses gene homology and multi-species transcriptome data to infer a perturbed gene network in research organisms (e.g. mouse, fly, zebrafish, and worm) that can recapitulate the disease condition in an individual patient. By comparing patient and research organism networks, we can predict the functionally-analogous pathogenic genes of patients in research organisms, which could be experimentally investigated further. Our method provides a new framework to discover underlying genes related to common and rare diseases on the individual level, also helping to shorten the timeframe of functional tests by simplifying the process of selecting a well-suited organism and gene targets to test.

Identifying alterations associated with aggressiveness in prostate cancer molecular subtypes

Presenting Author: Michael Orman, University of Colorado

Varsha Sreekanth, UC Anschutz
Scott Cramer, UC Anschutz
James Costello, UC Anschutz


The lethality of prostate cancer (PCa) is driven by its transition from localized to metastatic disease. In recent years, several tumor profiling studies in PCa patients have revealed the molecular characteristics of both localized and metastatic PCa tumors. These studies have provided an abundance of molecular and clinical information, however, an understanding of the molecular determinants driving aggressiveness in PCa remains unclear. To address this gap, we have performed a meta-analysis that integrates genetic, transcriptomic, and clinicopathologic data across four independent PCa cohorts. Our approach begins by determining a set of common, clinically-significant alterations observed in primary PCa. This analysis revealed MAP3K7 and USP10 loss-of-function alterations as frequently occurring alterations that are associated with progression-free survival. Next, our approach compares primary and metastatic tumors harboring either USP10 or MAP3K7 alteration. This analysis identified distinct sets of genes associated with aggressiveness in patients harboring either USP10 or MAP3K7 loss. Further inspection of these genes confirmed that some have been previously linked to cancer while other genes remain unstudied. Results from this work may guide future studies of the molecular pathways regulating aggressiveness in USP10-deleted or MAP3K7-deleted PCa. Additionally, the analysis pipeline generated in this work is flexible to accommodate user-defined molecular subtypes. Thus, this work also yields a generalizable tool for identifying novel regulators of aggressiveness within molecularly-defined PCa subtypes.

Rigorous benchmarking of T cell receptor repertoire profiling methods for cancer RNA sequencing

Presenting Author: Serghei Mangul, University of Southern California


The ability to identify and track T cell receptor sequences from patient samples is becoming central to the field of cancer research and immunotherapy. Tracking genetically engineered T cells expressing TCRs that target specific tumor antigens is important to determine the persistence of these cells and quantify tumor responses. The available high-throughput method to profile T cell receptor repertoires is generally referred to as TCR sequencing. However, the available TCR-Seq data is limited compared to RNA sequencing (RNA-Seq). In this paper, we have benchmarked the ability of RNA-Seq-based methods to profile TCR repertoires by examining 19 bulk RNA-Seq samples across four cancer cohorts including both T cell rich and poor tissue types. We have performed a comprehensive evaluation of the existing RNA-Seq-based repertoire profiling methods using targeted TCR-Seq as the gold standard. We also highlighted scenarios under which the RNA-Seq approach is suitable and can provide comparable accuracy to the TCR-Seq approach. Our results show that RNA-Seq-based methods are able to effectively capture the clonotypes and estimate the diversity of TCR repertoires, as well as provide relative frequencies of clonotypes in T cell rich tissues and low diversity repertoires. However, RNA-Seq-based TCR profiling methods have limited power in T cell poor tissues, especially in highly diverse repertoires of T cell poor tissues. The results of our benchmarking provide an additional appealing argument to incorporate RNA-Seq into the immune repertoire screening of cancer patients as it offers broader knowledge into the transcriptomic changes that exceed the limited information provided by TCR-Seq.

How tRNA pools and ramp sequences regulate protein and transcript levels associated with disease

Presenting Author: Justin Miller, University of Kentucky


Heterogeneity and gene-environment interactions limit our ability to treat many complex diseases such as Alzheimer's disease. Recent Alzheimer's disease spatial proteomics data show substantial disruptions in proteins responsible for tRNA transfer and synthetase, which modifies the availability of optimal and suboptimal tRNA resulting in distinct changes in gene regulation. We modeled how regulatory regions such as ramp sequences (i.e., slowly-translated codons at the 5' end of genes that evenly space ribosomes) are impacted by changes in tRNA levels. Disrupted tRNA pools change where ribosome stalling and collisions occur, decrease protein translation efficiency during protein synthesis, increase mRNA degradation via ribosome-associated protein quality control, and effectively decrease both protein and transcript levels. We show that tRNA pools alone significantly alter cell-specific gene expression without changing the genetic code by impacting codon translational efficiencies and ribosome stalling (odds=1.2072; P=2.64x10-6) with population-specific effects that we present through web interfaces at https://ramps.byu.edu and https://cubap.byu.edu. We also found that genes associated with Alzheimer's disease are 1.248x more likely to have a ramp sequence in the cerebellum than genes not associated with Alzheimer's disease (P= 0.005639). Metabolic gene dysregulation in glycolytic and ketolytic pathways associated with Alzheimer's disease further impact ramp sequences, suggesting a potential therapeutic target. In summary, changes to tRNA pools alter gene-specific ramp sequences with broad disease implications. Additional modeling of ramp sequence interactions with other regulatory elements will further improve our ability to predict how tRNA pools impact transcript and protein levels a priori.

Finding Long COVID: Topic modeling of post-infection trends in patient EHR profiles

Presenting Author: Shawn ONeil, University of Colorado

Charisse Madlock-Brown, University of Tennessee Health Science Center
Kenneth Wilkins, NIH National Institute of Diabetes and Digestive and Kidney Diseases
Parya Zareie, University of Tennessee Health Science Center
Brenda McGrath, OCHIN, Inc.


Long COVID, or Post-Acute Sequelae of COVID-19 (PASC), is characterized by persistent symptoms and conditions after the acute phase of a COVID-19 infection. Mounting evidence suggests these span a wide array of body systems with significant heterogeneity across patients. We apply an unsupervised document analysis method, topic modeling (Latent Dirichlet Allocation), to identify clusters of co-occurring conditions within 480 million electronic health records of 9 million patients available from the National COVID Cohort Collaborative (N3C). These data, representing 62 contributing healthcare organizations across the US, generate hundreds of detailed clinical ‘topics.’ Using these as guides, we identify a number of new-onset conditions strongly associated with COVID-19 or suspected PASC patients compared to those with no known infection. Finally, we investigate a novel statistical modeling of patient-topic assignment pre- and post-infection, with covariates to identify PASC-associated topics in query cohorts such as adolescents or females. This method identifies a distinctive Long COVID topic, and several others significant for specific demographic groups. Overall we demonstrate that topic modeling is especially effective for large-scale EHR datasets, and with longitudinal analyses can inform how patient groups migrate toward, or away from, clinical topics in response to significant events like COVID-19 infection.

Language representation models based on metadata identify relevant datasets in Gene Expression Omnibus

Presenting Author: Grace Salmons, Brigham Young University

Aaron Fabelico, Brigham Young University
James Wengler, Brigham Young University
Stephen Piccolo, Brigham Young University


Data-sharing requirements have led to vast availability of genomic datasets in public repositories. Researchers can reuse and combine these datasets to validate findings and address novel hypotheses. However, it is challenging to identify which datasets are relevant to a particular research question due to the large quantity of available datasets and heterogeneity in the way that researchers describe their data and study designs. In this study, we focus specifically on Gene Expression Omnibus (GEO), a repository that contains genomic data from hundreds of thousands of experiments, commonly used in biomedical analyses. Notable efforts have been made to manually annotate the data, but these efforts are unable to keep pace with daily dataset submissions. To address this problem, we turn to language representation models. Under the assumption that a researcher has manually identified a few datasets related to a discrete research topic, we seek to identify more datasets that are likely related, using word-embedding representations of the text. This is done by summarizing dataset descriptions using methodologies such as bag of words, skipgram, or transformers to generate vectors, which we then compare using cosine similarity. With a systematic benchmark comparison among algorithms and model corpora, we evaluate whether it is most effective to use models pre-trained on large, generic corpora; models pre-trained on smaller biomedical corpora; or models trained on (out-of-sample) GEO titles and abstracts. Preliminary results suggest that training on discipline-specific text and using either transformers or skip-gram models yields the best results.

Using knowledge graphs to infer gene expression in plants

Presenting Author: Anne Thessen, University of Colorado Anschutz

Pankaj Jaiswal, Oregon State University
Laurel Cooper, Oregon State University
Justin Elser, Oregon State University
Tyson Lee Swetnam, University of Arizona
Harshad Hegde, Lawrence Berkeley National Laboratory


Climate change is already affecting ecosystems around the world and forcing us to adapt to meet societal needs. The speed with which climate change is progressing necessitates a massive scaling up of the number of species with understood genotype-environment-phenotype (G×E×P) dynamics. An important part of predicting phenotype is understanding the complex gene regulatory networks present in organisms. Previous work has demonstrated that knowledge about one species can be applied to another using ontologically-supported knowledge bases that exploit homologous structures and orthologous genes. These types of knowledge structures that can apply knowledge about one species to another have the potential to enable the massive scaling up that is needed through in silico experimentation. Our preliminary analysis uses data from gene expression studies in Arabidopsis thaliana and Populus trichocarpa plants exposed to drought conditions. A graph query (using data from Planteome and the EMBL-EBI Gene Expression Atlas) identified 16 pairs of orthologous genes in these two taxa, some of which show opposite expression during drought by either increasing or decreasing their expression. As expected, analysis of the upstream regulatory region of these genes revealed that orthologous pairs with similar expression behavior had upstream regulatory regions similar to each other, unlike orthologous pairs that changed their expression in opposite ways. This suggests that even though the orthologous pairs share common ancestry and functional roles, predicting expression and phenotype role through inference needs a careful consideration of integrating cis and trans regulatory components in the curated and inferred knowledge graph.

Fast, flexible and safe sequence assembly for RNA and beyond

Presenting Author: Brendan Mumey, Montana State University


Many important problems in bioinformatics (e.g., assembly or multi-assembly) admit multiple solutions, while the final objective is to report only one. A common approach to deal with this uncertainty is finding safe partial solutions (e.g., contigs) which are common to all solutions. We report new work on finding provably safe RNA assemblies (even for NP-hard versions of the problem), discuss some open problems (e.g. adapting methods for metagenomics or single cell data) and invite collaborations.

Visualization and Exploration of Survival Prediction Model Explanations

Presenting Author: Cartsen Görg, Colorado School of Public Health

Krithika Suresh, University of Michigan
Arya Amini, City of Hope
Debashis Ghosh, Colorado School of Public Health
Carsten Görg, Colorado School of Public Health


Machine learning methods that can capture complex and non-linear relationships are a useful approach to accurately predict time-to-event outcomes in biomedical research. These methods are often considered “black-box” algorithms that are not interpretable and therefore difficult to trust in making important clinical decisions. Explainable machine learning proposes the use of explainers that can be applied to predictions from any complex model to provide insight into how the model arrived at that prediction. These explainers describe how a patient’s characteristics are contributing to their prediction. The application of explainers to survival prediction models can provide explanations for an individual’s overall survival curve as well as survival predictions at particular follow-up times. Here, we present a novel visualization technique, a “picket fence plot”, to display and explore individual-specific explanations. We integrated the picket fence plot into an interactive R/Shiny application that can load predicted survival curves and model explanations and supports comparison between models and patients. We demonstrate an application of our tool to explain prostate cancer-specific survival predictions from a random survival forest built using data from the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial.

Double-strand DNA breaks: Quantitative mapping and computational modeling of mechanisms.

Presenting Author: Maga Rowicka, University of Texas Medical Branch

Yingjie Zhu, UTMB
Razie Yousefi, UTMB


DNA double-strand breaks (DSBs) are the most lethal form of DNA damage and drive aging and cancer. Genome-wide methods for mapping DSBs by sequencing (e.g. BLESS [1]) are limited to measuring relative frequencies of breaks between loci. Knowing the absolute DSB frequency per cell, however, is key to understanding their physiological relevance. Therefore, we proposed quantitative DSB sequencing (qDSB-Seq [2]), a method to infer the absolute DSB frequency genome-wide. qDSB-Seq relies on inducing spike-in DSBs by a site-specific endonuclease and estimating the efficiency of the endonuclease cleavage by whole genome sequencing or qPCR. This efficiency is then used to normalize DSB sequencing data. The qDSB-Seq approach can be applied to any DSB labeling method and allows accurate comparisons of DSB levels across samples. We present application of qDSB-Seq to precisely quantify the impact of various DSB-causing agents and replication stress-induced DSBs. To infer population distribution of quantified DSBs, we computationally integrate qDSB-Seq, gel electrophoresis and fluorescent-activated cell sorting results. We analyze DSBs induced by hydroxyurea-mediated replication stress and reveal that majority of them originate from a small cell sub-population undergoing massive DNA breakage. We also use computer simulations to reject and validate mechanisms of action of Mus81 endonuclease in the absence of Mec1. We show that computer simulation are a powerful tool to analyze and infer genomic and population distribution of DSBs and mechanisms of DSB creation.<br><br>References<br><br>[1] Crosetto N*, Mitra A, Silva MJ, Bienko M, Dojer N, Wang Q, Karaca E, Chiarle R, Skrzypczak M, Ginalski K, Pasero P, Rowicka M*, Dikic I*: Nucleotide-resolution DNA

Novel enzyme families among artificial intelligence-based protein structure models

Presenting Author: Krzysztof Pawlowski, University of Texas Southwestern Medical Center


For decades, novel putative enzyme families have been discovered by applying protein sequence bioinformatics tools to detect distant evolutionary relationships.<br><br>However, the breakthrough in protein structure prediction achieved in 2020/21 by AlphaFold and RoseTTAfold brought new opportunities. We will present how these artificial intelligence-based protein structure models can be explored by applying sensitive 3D structure comparisons in the search of novel enzyme families. Now, evolutionary relations even more distant than accessible previously can be identified.<br><br>As a proof-of-concept, we will show how bioinformatics discovery of an obscure pseudokinase family, SelO, lead to discovery of unique RNA capping pathway in coronaviruses, a promising therapeutic target. <br>Then, we will discuss two novel protein kinase-like families that we discovered in the AlphaFold structure models of the human proteome, as well as novel enzyme examples from pathogenic bacteria and viruses. We will highlight functional hypotheses that can be generated bioinformatically for the novel families.<br><br>The unexpected discoveries of novel enzymes underscore the versatility of the enzymatic structural folds, and the opportunities for novel enzyme/pseudoenzyme research that can lead to novel pharmaceutical interventions.

The Mondo Disease Ontology: A Standardized Disease Terminology of Human and Animal Diseases

Presenting Author: Sabrina Toro, University of Colorado, Anschutz Medical Campus

Kathleen R. Mullen, University of Colorado, Anschutz Medical Campus
Nicolas Matentzoglu, Semanticly
Joseph E. Flack IV, Johns Hopkins University
Harshad Hegde, Lawrence Berkeley National Laboratory
Peter N. Robinson, The Jackson Laboratory for Genomic Medicine
Ada Hamosh, Johns Hopkins University
Melissa Haendel, University of Colorado, Anschutz Medical Campus
Christopher J. Mungall, Lawrence Berkeley National Laboratory
Nicole Vasilevsky, University of Colorado, Anschutz Medical Campus


The wealth of data from research, human and veterinary health records can be leveraged to support disease diagnosis and treatment discovery. This process requires data standardization, integration and comparison, which relies on ontologies such as the Mondo Disease Ontology (Mondo) for disease data. Here, we report the recent improvements in the representation of non-human animal diseases in Mondo. <br>Mondo represents a hierarchical classification of over 20,000 diseases in humans and across species, covering a wide range of diseases including cancers, infectious diseases and Mendelian disorders. Diseases are subclassified under the root class ‘disease or disorder’ as ‘human disease or disorder’, 'infectious disease or post-infectious disorder', or 'non-human animal disease'. Mondo leverages logical axioms to apply semantics (meaning) to the terms. We added axioms to indicate the species affected by a disease, and whether a disease affects a single species or several species. Importantly, semantics make the connection between a non-human disease and its analogous counterpart in human. In addition, we are improving the coverage of non-human diseases in Mondo by adding new terms represented in veterinary records and animal databases.<br>Existing computational tools have been successful in supporting human precision medicine. However, they have not leveraged the wealth of information from veterinary records and databases because of its lack of standardization. The improvements of non-human animal diseases in Mondo will support this standardization, and will improve the current computational tools for both humans and other animals.

The ability to classify patients based on gene-expression data varies by algorithm and performance metric

Presenting Author: Stephen Piccolo, Brigham Young University

Avery Mecham, Brigham Young University
Nathan Golightly, Brigham Young University
Jérémie Johnson, Brigham Young University
Dustin Miller, Brigham Young University


By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist—and most support diverse hyperparameters—so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.