Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


ROCKY 2019 | Dec 5 – 7, 2019 | Aspen/Snowmass, CO | HOME - ROCKY 2019


Abstract Withdrawn

Enabling the next generation of microbiome science with QIIME 2

Presenting Author: Greg Caporaso, Northern Arizona University


In the past two decades, we have taken large strides towards understanding the role of microbiomes in human and environmental health. We are developing an understanding of how our microbiomes impact carcinogenesis, the efficacy of medical treatment, and diverse disorders such as autism, Alzheimer’s disease, and asthma. Other applications of microbiome science are on the horizon, for example in agriculture, climate science, and forensics. To enable advances in microbiome science, researchers have come to rely on the open-source QIIME microbiome bioinformatics platform. <br>I will present QIIME 2 (https://qiime2.org), the most recent iteration of QIIME platform, in the context of ongoing microbiome projects in my lab. QIIME 2 supports the latest microbiome analysis methods, including for sequence quality control and taxonomic annotation, longitudinal data analysis, and microbiome mutli-omics. A core design goal of QIIME 2 was enabling fully reproducible bioinformatics, which led to its innovative decentralized retrospective data provenance tracking system that integrates workflow details within self-contained data artifacts and interactive visualizations. We additionally aimed to make QIIME 2 accessible to researchers with varied levels of computational sophistication, and users can now access identical functionality through a Python 3 API, a command line interface, and a graphical user interface. We strive to foster a diverse and inclusive community of microbiome researchers and software developers with QIIME 2 - I’ll conclude by discussing how you can learn to use QIIME 2 in your own research, or make your software accessible to QIIME 2’s large user community by developing a plugin. <br>

Evolutionary Action is a unifying framework for assessing missense variant structures within and across phyla

Presenting Author: Nicholas Abel, Baylor College of Medicine

Harley Peters, Baylor College of Medicine
Panos Katsonis, Baylor College of Medicine


The quantification and consequences of non-synonymous polymorphism at the exome level in natural populations has yet to be fully defined to enable correcting pathogenic variants through Precision Medicine. Furthermore, a non-statistical method for defining mutation load and dynamics of individuals, populations and species at the exome level has yet to be fully developed. Tools exist for predicting the impact a mutation has on displacing a protein in in its fitness landscape, namely Evolutionary Action (EA). In order to interrogate the spectrum of naturally occurring fitness effects we applied the EA equation, and its selection constant λ, to natural populations of humans and relevant model organisms’ population data. We found species specific mutation constants at both the individual and species level. Additionally, we utilized machine learning on Drosophila melanogaster populations to identify pathways and gene groups under differential selection globally. We found groups involved in Signal Transduction, Translation, Transcription Factors, Transport and Catabolism, and the Spliceosome to be highly ranked in separating the populations. These data demonstrate the nuanced aspects of mutable areas in replication and translation machinery as well as protein trafficking and recycling which occurred in Drosophila during adaptation to novel habitats. These findings establish EA as a crucial metric for machine learning and add to the nascent field of population exomics.

A toxicogenomics approach to identify liver and kidney injuries

Presenting Author: Patric Schyman, Biotechnology HPC Software Applications Institute (BHSAI)

Richard Printz, Vanderbilt University School of Medicine
Shanea Estes, Vanderbilt University School of Medicine
Tracy O'Brien, Vanderbilt University School of Medicine
Kelli Boyd, Vanderbilt University School of Medicine
Masakazu Shiota, Vanderbilt University School of Medicine
Anders Wallqvist, U.S. Army Medical Research and Development Command


The immense resources required and ethical concerns for animal-based toxicological studies have driven the development of in vitro and in silico approaches. Using gene expression data from liver and kidney tissues of rats exposed to diverse chemical insults, we previously derived a set of gene modules associated with specific organ injuries (i.e., injury modules). Recently, we validated this approach in a study using thioacetamide, a known liver toxicant that promotes fibrosis. Our first aim was to test whether we could use gene expression from rat primary liver and kidney cells (in vitro) exposed to thioacetamide to predict organ injuries in rats in vivo. Second, we sought to establish interspecies concordance between the gene module responses in rat and human primary cells exposed to thioacetamide to predict organ injuries. In all cases, the most activated liver gene modules were those associated with fibrosis. Histological analyses supported these results, demonstrating the potential of gene expression data to identify organ injuries. The in vitro predictions were significantly correlated with the in vivo predictions, with an R2 value of 0.64. Finally, the top-ranked liver injuries in human primary cells correctly identified known pathological changes such as fibrosis. Our proposed approach could potentially be used with in vitro assays to screen large number of chemicals and predict liver and kidney injuries in vivo.

Towards Automating Computational Phenotyping: Exploring the Trade-offs of Different Vocabulary Mapping Strategies

Presenting Author: Tiffany Callahan, University of Colorado Denver Anschutz Medical Campus- Computational Bioscience Program

Jordan Wyrwa, School of Medicine, University of Colorado Denver Anschutz Medical Campus
Katy E. Trinkley, University of Colorado Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Denver Anschutz Medical Campus
Lawrence E. Hunter, University of Colorado Denver Anschutz Medical Campus
Michael G. Kahn, University of Colorado Denver Anschutz Medical Campus
Tellen D. Bennett, University of Colorado Denver Anschutz Medical Campus


The near-universal adoption of electronic health records (EHRs) presents an unprecedented opportunity to fuel population-scale development of research-grade computational phenotypes (CPs). However, several barriers to the development, validation, and implementation of CPs must be overcome before their potential can be fully realized. Further, while repositories of domain expert-derived CPs exist, most of the CPs cannot easily be implemented across different EHR systems because they are tailored to specific source vocabularies. Common data models (CDM) provide one practical solution, but one could employ different strategies to align them to a CP definition. Understanding the trade-offs of these different vocabulary mapping strategies is a vital next step towards enabling CDM-driven CP automation. We provide a comprehensive examination of how different vocabulary mapping strategies affects the creation of patient cohorts. We tested the effects of using different vocabulary mapping strategies, types of clinical data, and only clinical code sets versus complete phenotype definitions. We performed 144 comparisons applied to 7 CPs in two independent datasets. For each comparison, false negative (FN) and false positive (FP) error rates were calculated using the CP author’s cohort as the gold standard. Using only clinical code sets, the FP and FN error rates ranged from 0-88% and 0-25%. Using full CP definitions, the FP and FN error rates ranged from 0-49% and 0-37%. Work is underway to add additional CPs, include new domain-expert verified mapping strategies, and verify the resulting patient cohorts.

Open PBTA: Collaborative analysis of the Pediatric Brain Tumor Atlas

Presenting Author: Joshua Shapiro, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation)

The OpenPBTA Contributors Consortium, _


Pediatric brain tumors are the leading cause of cancer-related death in children, but our ability to understand and successfully treat these devastating diseases has been challenged by a lack of large, integrated data sets. To address this limitation, The Children's Brain Tumor Tissue Consortium and the Pediatric Pacific Neuro-Oncology Consortium recently released the Pediatric Brain Tumor Atlas (PBTA) as part of the Gabriella Miller Kids First Data Resource Center. The PBTA dataset includes WGS and RNA-Seq data from nearly 1,000 tumors. Analysis of this dataset is being conducted through the OpenPBTA project, an open science initiative to comprehensively define the molecular landscape of these tumors through shared analyses and collaborative manuscript production. The current state of analyses is continuously available and visible to the public through GitHub at https://bit.ly/openPBTA, and we encourage contributions from community participants through pull requests. To ensure reproducibility, analyses are performed within a Docker container, with continuous integration applying each added analysis to test datasets. The corresponding manuscript is collaboratively written using the Manubot system, also hosted on GitHub and available to the public as it evolves. The OpenPBTA managing scientists include members of the Alex's Lemonade Stand Foundation's Childhood Cancer Data Lab and the Children's Hospital of Philadelphia's Center for Data Driven Discovery in Biomedicine. Through OpenPBTA, we will advance discovery of new mechanisms contributing to the pathogenesis of pediatric brain tumors and promote a standard for open research efforts to accelerate understanding and clinical translation on behalf of patients.

Data Discovery Engine: A web-based toolset for maximizing data discoverability and promoting reusable data-sharing best practices

Presenting Author: Marco Cano, Scripps Research

Xinghua Zhou, Scripps Research
Jiwen Xin, Scripps Research
Chunlei Wu, Scripps Research
Sebastien LeLong, Scripps Research
Matthew B. Carson, Northwestern Medicine
Kristi L. Holmes, Northwestern Medicine
Sara Gonzales, Northwestern Medicine


Biomedical research community has a wealth of data and opportunities for collaboration, yet it is challenging to identify existing datasets that can be leveraged to help power investigation. Data Discovery Engine (http://discovery.biothings.io) is a web application providing a pathway and tooling for data providers to define and expose their data resources in an easy and reusable way so that others can find them via multiple portals including Google Dataset Search and other domain-specific portals. The application includes two components:<br><br>The Discovery Guide component helps data providers organize their dataset metadata in a structured JSON-LD format, following the schema.org/Dataset schema. This ensures the basic metadata fields can be properly indexed by major search engines like Google. Using the same mechanism, we can extend JSON-LD metadata to include additional biomedical specific fields, which can be subsequently captured by domain-specific discovery portals. In addition to sharing well-formed metadata through our application, the guide also allows data providers to embed an one-liner to their existing dataset page and turn their own website as a structured metadata provider. <br><br>The Schema Playground component focuses on enabling developers to build schema.org compatible schemas to encode their dataset metadata. By building on top of existing schemas, developers can include additional biomedical fields they need while keeping the interoperability of their metadata with the general-purpose search engines. The playground provides tools to extend, visualize and host user-defined metadata schemas. These schemas can then be used in the Discovery Guide to cover additional metadata types.<br>

Hypergraph Analytics for Computational Virology

Presenting Author: Cliff Joslyn, Pacific Northwest National Laboratory

Emilie Purvine, Pacific Northwest National Laboratory
Brett Jefferson, Pacific Northwest National Laboratory
Brenda Praggastis, Pacific Northwest National Laboratory
Song Feng, Pacific Northwest National Laboratory
Hugh Mitchell, Pacific Northwest National Laboratory
Jason McDermott, Pacific Northwest National Laboratory


Multi-omic data sets capture multiple complex and indirect interactions and networked structures. In virology, for example, measured response of host protein levels in response to viral infection, or experimentally identified protein complexes and pathways, contain multi-way interactions among collections of proteins as evidenced across multiple experimental conditions and pathways. Graphs are a standard tool to represent such connected interactions. But both mathematically and methodologically, graphs are limited to represent only pairwise interactions natively, for example between pairs of proteins in protein-protein interaction networks. Representing multi-way interactions in graphs requires additional coding, which is of sufficient burden that higher-order interactions above "primary effects" are commonly ignored. Hypergraphs are mathematical structures which explicitly generalize graphs precisely to represent such multi-way interactions natively. This talk will explore our recent work to understand how analogs of traditional network science concepts, like centrality and spectral clustering, can be used in the context of hypergraphs for discovery of central biological pathways, characterization of unknown transcription factors, and comparison of responses to viral infection with different pathogenesis.

Containerized pipeline for the identification of compound heterozygous variants in trios

Presenting Author: Dustin Miller, Brigham Young University


In most children who are diagnosed with pediatric cancer or a structural birth defect, the underlying cause is unknown. It is likely that in many cases, inherited DNA mutations cause such diseases, but researchers have found such evidence for relatively few pediatric diseases; thus there is an urgent need to identify alternative mechanisms of disease development. We hypothesize that compound heterozygous (CH) variants play an important role in disease development; little attention has been given to these variants, in part because whole-genome phasing and annotation requires integration of specialized software and annotation files. Using datasets from the Gabriella Miller Kids First Data Resource Center, we seek to improve identification of CH variants in pediatric patients. We have created a Docker-based computational pipeline that simplifies the process of CH variant identification and have validated our pipeline using idiopathic scoliosis genotype data from 16 trios. Our pipeline encapsulates various programs used to process (GATK4), phase (Shapeit2), annotate (SnpEff), and explore variants (Gemini). We provide open-source, reproducible scripts that allow other researchers to examine our methodology in detail and apply it to their own data. Encapsulating our code within containers helps control what software versions are used, what system libraries are used, and creates a cohesive computational environment. The use of containerization technology in genome analysis is in relatively early stages; our work helps to set a precedent for using containerized pipelines in pediatric-genome studies. In addition, our work helps to identify the impact of CH variants in pediatric disease.

The design of an interactive lung map for studying premalignant lesions in the lung over time

Presenting Author: Carsten Görg, Colorado School of Public Health

Wilbur Franklin, University of Colorado School of Medicine
Daniel Merrick, University of Colorado School of Medicine


Lung squamous cell carcinoma and adenocarcinoma are the most common subtypes of lung<br>cancer; both are associated with recognized and unique premalignant lesions, and<br>their pre-cancers have distinct histologic appearances, tissue distribution, and molecular driver events. To facilitate a comprehensive analysis of these precancerous lesions, and ultimately the understanding of mechanisms of progression and identification of risk markers as well as targets for inhibition, we designed an interactive lung map that represents the lesions in the context of anatomic findings, genomic and microenvironmental features, and the patient’s overall clinical history. The map design supports two use cases: (1) pathologists and radiologists, usually analyzing one patient at a time, can explore the spatial and temporal context of multiple identified lesions by comparing lesions at different sites to each other as well as comparing the progression of lesions over time; (2) for secondary research purposes, users can utilize histologic diagnoses, genomic features, and data elements in the clinical history to define patient cohorts and study the heterogeneity and progression of lesions within and across cohorts. Using either histologic or radiographic images, the maps will provide a visual tool to facilitate understanding of anatomic and temporal relationships between lesions. Our lung map design will be implemented as part of the analysis component of a Data Commons for the NCI’s Lung Pre-Cancer Atlas project to facilitate the analysis of samples from both retrospective and prospective studies.

Apollo: an efficient tool to collaboratively refine and attribute genome-wide genomic annotations

Presenting Author: Nathan Dunn, University of California, Berkeley

Nomi Harris, Lawrence Berkeley National Lab
Colin Diesh, University of California, Berkeley
Robert Buels, University of California, Berkeley
Ian Holmes, University of California, Berkeley


Dissemination of the most accurate genome annotations is important to provide an understanding of biological function. An important final step in this process is the manual assessment and refinement of genome annotations. Apollo (https://github.com/GMOD/Apollo/) is a real-time collaborative web application (think Google docs) used by hundreds of genome annotation projects around the world, ranging from single species to lineage-specific efforts supporting the annotation of dozens of genomes as well as several endeavors focused around undergraduate and high school education. <br><br>To support efficient curation Apollo offers drag-and-drop editing, a large suite of automated structural edit operations, the ability to pre-define curator comments and annotation status to maintain consistency, attribution of annotation authors, fine-grained user and group permissions, and a visual history of revertible edits. Additionally, Apollo is built atop the dynamic genome web browser JBrowse (http://jbrowse.org/), which is performant, customizable, and has a large registry of plugins (https://gmod.github.io/jbrowse-registry/).<br><br>The most recent Apollo enhancements have focused on automated upload of genomes (FASTA) and genomic evidence (GFF3, FASTA, BAM, VCF) for annotations to make them readily available for group annotation, the ability to manually test variant effects when annotating variants, and the annotation and export of gene ontology (GO) terms.

An online end-to-end pipeline for virus phylogeography that leverages Natural Language Processing for finding host locations

Presenting Author: Matthew Scotch, Arizona State University

Arjun Magge, Arizona State University
Davy Weissenbacher, University of Pennsylvania
Karen O'Connor, University of Pennsylvania
Graciela Gonzalez, University of Pennsylvania


To study virus spread and genome evolution, researchers often leverage DNA sequences from the NCBI GenBank database as well as corresponding metadata such as host type (e.g. human or Anas acuta) or location of the infected host (e.g. Denver, CO or Australia). However, as we have shown, location metadata is often missing or incomplete. This can create difficulty for researchers in creating robust datasets for analysis. In our prior work, we demonstrated the value of incorporating sampling uncertainty in the geospatial assignment of taxa for virus phylogeography, a field which estimates evolution and spread of pathogens. Here, sampling uncertainty relates to possible locations of the infected host that can be derived from the GenBank metadata or the full-text article that is linked to the record. To automate this task, we have developed a Natural Language Processing (NLP) pipeline for extracting possible locations from GenBank as well as journal articles and assign probabilities for each location. The probabilities can then be utilized in the phylogeography software, BEAST, for producing models of virus spread. In this work, we describe an online portal of an end-to-end system that integrates virus DNA sequences and metadata from GenBank with probabilities from our NLP system (https://zodo.asu.edu/zoophy/ and https://zodo.asu.edu/zodo/). The portal then implements phylogeography models in BEAST and sends the results to the user in the form of trees, geospatial maps, and graphs. We make this portal available to researchers and epidemiologists studying the spread of RNA viruses.

Evolutionary Action as a Tool for Quantifying Differentiation Across the Primate Family Tree

Presenting Author: Harley Peters, Baylor College of Medicine

Nicholas Abel, Baylor College of Medicine
Panagiotis Katsonis, Baylor College of Medicine
Olivier Lichtarge, Baylor College of Medicine


Selection is the driving force behind evolution. Evidence of selection on a given protein is often found in analyzing fixed differences between species orthologs as accumulated missense mutations. To this end most methodologies make use of the ratio of nonsynonymous mutations to the ‘background rate’ of synonymous mutations, or the dN/dS ratio. A major limitation of this approach is that each nonsynonymous mutation is assumed to have the same selection coefficient; however, not all mutations are created equal. The impact of a missense mutation on a protein’s fitness can be estimated using the Evolutionary Action (EA) equation. Here, we investigate selection pressures acting across the primate lineage using EA to quantify the divergence between the genomes of each species relative to human. We find the distribution of EA scores in each species to be largely skewed toward variants of low-impact. In each species the sum of all EA scores strongly correlates with time of divergence, providing evidence that functional changes within exomic regions of the genome accumulate at a constant rate. We also find that EA correlates well with the current gold standard of dN/dS, adding a dimension of phenotypic impact to existing methodologies. We used EA score cutoffs to identify genes with high-impact changes specific to humans alone, or shared by both humans and Neandertals. GO term analysis of these genes finds significant enrichment for several pathways; including muscle development, the JNK cascade, and skin keratinization. This work demonstrates EA's application as a tool for selection pressure analysis.

Haplocravat: LD-based calculation built on top of a platform to annotate variants

Presenting Author: Ben Busby, Mountain Genomics/Johns Hopkins University

Kyle Moad, Insilico
Kymberleigh Pagel, Johns Hopkins University
Rachel Karchin, Johns Hopkins University


This talk will present Haplocravat: LD-based calculation built on top of a platform to annotate variants. OpenCravat is a pip installable -- and standalone -- python package that can leverage over 50 variant annotators on local or cloud based datasets in a modular way. Groups in two NCBI style codeathons -- in Seattle and Colorado -- decided to extend this framework by using it to annotate haplotypes and haploblocks as well as individual variants. Eventually, this approach can likely be used to annotate paths in graph genomes. Major advantages of this approach include extending chip based marker SNPs to those deleterious in patient populations and potential prediction of susceptibility to phenotypic states such as inflammation and eventually diseases such as diabetes.

Characterizing the Regulatory Framework in an Aggressive Breast Cancer Phenotype: A Bayesian Regression-Based Enhancement

Presenting Author: George Acquaah-Mensah, Massachusetts college of Pharmacy & Health Sciences


There are a variety of molecular presentations of breast cancer, some more aggressive than others. Black/African-American (B/AA) breast cancer patients tend to have more aggressive tumor biology compared to White/Caucasians. The object of this study was to identify master transcriptional regulators and other molecular features driving the differences. Breast invasive carcinoma (BrCA) data from the Cancer Genome Atlas were interrogated. Transcriptional Regulatory Networks were reverse-engineered using the Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe), and the Bayesian Best Subset Regression (BBSR) method within the ordinary differential equation-based Inferelator framework. Priors used for BBSR were a consensus between ChIP Enrichment Analysis (ChEA) assertions and predictions based on JASPAR database position-specific weight matrices. The ARACNe and BBSR outputs became input regulons for the Virtual Inference of Protein activity by Enriched Regulon analysis (VIPER) method. Using the regulons identified and a signature generated based on the gene expression profile, VIPER was applied to identify master regulators of regulons driving the differences in breast cancer between Caucasian and B/AA patients, as well as to infer concomitant aberrant protein activity. Each of these approaches accurately identified master regulators including TCF3, HES4, PQBP1, TERT, RB1, and ARID4A as driving the difference in BrCA phenotype between B/AA and White patients. However, the BBSR/Inferelator–based approach identified several fold more leading-edge genes per master regulator than the ARACNe-based approach. This suggests, VIPER with BBSR/Inferelator recovers more differentially expressed targets per master regulator identified than does VIPER with ARACNe. *

Discovering Subclones in Tumors Sequenced at Standard Depths

Presenting Author: Li Liu, Arizona State University


Tumorigenesis is an evolutionary process that typically originates from a single clone and grows into a diverse population of cells (subclones) over time and space. Understanding intratumor heterogeneity is critical to designing personalized treatments and improving clinical outcomes of cancers. Such investigations require accurate delineation of the subclonal composition of a tumor, which to date can only be reliably inferred from deep-sequencing data (>300x depth). To enable accurate subclonal discovery in tumors sequenced at standard depths (30-50x), we have developed a novel computational method, named model-based adaptive grouping of subclones (MAGOS). MAGOS incorporates a unique adaptive error model in the statistical decomposition of mixed populations, which corrects the mean-variance dependency of sequencing data at the subclonal level. We tested MAGOS and two existing methods (SciClone and PyClone) with extensive computer simulations and real-world data. We show that MAGOS consistently outperforms the other two methods when the sequencing depth is lower than 300x, and can achieve a decomposition accuracy of 80% at a sequencing depth of as low as 30x. MAGOS also has the highest computational efficiency and is 3 – 20 times faster than the other two methods. MAGOS supports subclone decompositions with single nucleotide variants and copy number variants from one or more samples of an individual tumor. Applications of MAGOS to whole-exome sequencing data of 376 liver cancer samples discovered that subclonal complexity of a tumor is an independent prognostic factor of patient overall survival.

Pathogenic Synonymous Variants Are More Likely to Affect Codon Usage Biases than Benign Synonymous Variants

Presenting Author: Justin Miller, Brigham Young University

John Kauwe, Brigham Young University


Several codon usage biases within genes directly affect translational efficiency. Ramps of slowly translated codons at initiation, pairing codons that encode the same amino acid within a ribosomal window, and complete aversion to codons lacking cognate tRNA significantly increase translational speed or decrease resource utilization. Although many mechanisms affecting codon usage biases are well-established, the effects of synonymous codon variants on disease remain largely unknown. We identified changes in codon usage dynamics caused by each of the 65,691 synonymous variants in ClinVar, including 14,923 highly supported benign or pathogenic synonymous variants (i.e., variants with multiple submitters or reviewed by an expert panel). We found that pathogenic synonymous variants are 2.4x more likely to affect any codon usage dynamic (i.e., codon aversion, codon pairing, or ramp sequences), 8.5x more likely to affect multiple codon usage dynamics, and 69.5x more likely to affect all three codon usage dynamics than benign synonymous variants. Although significant differences exist between the number of variants affecting most codon usage dynamics, changing only codon aversion or only ramp sequences was nonsignificant, indicating that disrupting only a ramp sequence or only codon aversion may not be sufficient to identify variant pathogenicity. However, a strong indicator of pathogenicity occurs when a synonymous variant affects at least two codon usage biases (p-value=5.23x10^-10) or all three codon usage biases (p-value=1.30x10^-25). We anticipate utilizing these results to improve variant annotation by prioritizing synonymous variants that are most likely to be pathogenic.

Map and model — moving from observation to prediction in toxicogenomics

Presenting Author: Wibke Busch, Helmholtz Centre for Environmental Research - UFZ

Andreas Schüttler, Helmholtz-Centre for Environmental Research - UFZ


Chemicals induce compound-specific changes in the transcriptome of an organism (toxicogenomic fingerprints). This provides potential insights about the cellular or physiological responses to chemical exposure and adverse effects, which is needed in assessment of chemical-related hazards or environmental health. In this regard, comparison or connection of different experiments becomes important when interpreting toxicogenomic experiments. Owing to lack of capturing response dynamics, comparability is often limited. We developed an experimental design and bioinformatic analysis strategy to infer time- and concentration-resolved toxicogenomic fingerprints. We projected the fingerprints to a universal coordinate system (toxicogenomic universe) based on a self-organizing map of toxicogenomic data retrieved from public databases. Genes clustering together in regions of the map indicative for functional relations due to co-expression under chemical exposure. To allow for quantitative description and extrapolation of the gene expression responses we developed a time- and concentration-dependent regression model. We applied the analysis strategy in a microarray case study exposing zebrafish embryos to three selected model compounds including two cyclooxygenase inhibitors. After identification of key responses in the transcriptome we could compare and characterize their association to developmental, toxicokinetic, and toxicodynamic processes using the parameter estimates for affected gene clusters. The design and analysis pipeline described here could serve as a blueprint for creating comparable toxicogenomic fingerprints of chemicals. It integrates, aggregates, and models time- and concentration-resolved toxicogenomic data. https://doi.org/10.1093/gigascience/giz057

Combining the Evolutionary Trace Algorithm and Covariation Metrics Yields Improved Structural Predictions

Presenting Author: Daniel Konecki, Baylor College of Medicine

Benu Atri, Baylor College of Medicine
Jonathan Gallion, Baylor College of Medicine
Angela Wilkins, Baylor College of Medicine
Olivier Lichtarge, Baylor College of Medicine


Understanding protein structure and function are vital to monitoring and controlling the activities of proteins for diagnostic and therapeutic purposes. However, many protein structures remain unsolved, and for those which are the relationships between residues are not yet known. While sequence-based covariation metrics exist to address these issues, few directly take into account phylogenetic information. Previously, we paired the Evolutionary Trace (ET) algorithm, which explicitly captures phylogenetic information, and the Mutual Information metric in order to identify evolutionarily coupled pairs of residues. This algorithm identified residues linked to allosteric signaling, confirmed by experiments, in the dopamine D2 receptor. Here we present a new implementation of the ET algorithm that provides efficient scaling to larger proteins and alignments. We characterize the effects of different sequence distances, phylogenetic trees, and covariation metrics on the ability of ET to predict covariation data, in addition to information gained traversing the phylogenetic tree. Characterization on a set of twenty-three proteins, and validation on a set of ~1000, benchmarked all methods on the ability to predict short range structural contacts. From this structural validation we show that by limiting computations to specific levels of a phylogentic tree, the new algorithm often improves accuracy, even when compared with state-of-the-art non-machine learning covariation methods. Examining highly ranked residue pairs not in close contact, reveals enrichment for epistatic interactions. In the future we will apply this method for both structural and functional predictions to guide biological experiments as well as test its usefulness as a machine learning feature set.

Toxicological Mechanistic Inference: Generating mechanistic explanations of adverse outcomes

Presenting Author: Ignacio Tripodi, University of Colorado, Boulder

Tiffany Callahan, University of Colorado, Denver
Jessica Westfall, University of Colorado, Boulder
Nayland Meitzer, University of Colorado, Boulder
Robin Dowell, University of Colorado, Boulder
Lawrence Hunter, University of Colorado, Denver


Government regulators and others concerned about toxic chemicals in the environment hold that a mechanistic, causal explanation of toxicity is strongly preferred over a statistical or machine learning-based prediction by itself. We thus present a mechanistic inference engine, which can generate hypotheses of the most likely mechanisms of toxicity, using gene expression time series on human tissue and a semantically-integrated knowledge graph. We seek enrichment in our manually-curated list of cellular mechanisms of toxicity (e.g. "Mitochondria-mediated toxicity by inhibition of electron transport chain"), represented as causally-linked ontology concepts. Our knowledge representation consists of an integration of concepts from multiple ontologies (GO, PRO, HPO, ChEBI, PATO, DOID, CL), as well as relevant concepts from Reactome, the cellular toxicogenomics database (CTD), and the AOP Wiki. The expression assays were obtained from the Open TG-Gates, EU-funded CarcinoGenomics projects, and other relevant public datasets consisting of human liver, lung, nasal, buccal, bronchial, and kidney cells exposed to a sizeable number of chemicals that elicit different mechanisms of toxicity. Besides predicting the most likely mechanisms at play from the transcriptomics assays, we generate mechanistic narratives that link the most significant genes at each time point, to each of the steps in the top-ranked mechanisms. This provides a transparent, putative explanation of each mechanism of toxicity, that would help inform a researcher’s decision-making and aid further experimental design. Furthermore, we were able to experimentally validate some of our mechanistic predictions for chemicals without a well-known mechanism of toxicity.

Giving credit where credit is due: How to make more meaningful connections between people and their roles, work, and impact

Presenting Author: Nicole Vasilevsky, Oregon Health & Science University

Matthew Brush, Oregon Health & Science University
Anne Thessen, Oregon Health & Science University
Marijane White, Oregon Health & Science University
Karen Gutzman, Northwestern University
Lisa O’Keefe, Northwestern University
Kristi Holmes, Northwestern University
Melissa Haendel, Oregon Health & Science University


Traditional metrics for scholarship typically measure publication records and grants received. However scholarly contributions can extend well beyond these traditional contributions to include things such as algorithm or tool development, biocuration and data analysis. In order to properly give attribution, we need improved mechanisms for recognizing and quantifying these efforts. We aim to develop a computable system to better attribute scholars for the work they do.<br><br>The Contributor Role Ontology (CRO) is a structured representation of scholarly roles and contributions. The CRO can be used in combination with research object types to develop infrastructure to understand the scholarly ecosystem, so we can better understand, leverage, and credit a diverse workforce and community.<br><br>The Contributor Attribution Model (CAM) provides a simple and tightly scoped data model for representing information about contributions made to research-related artifacts - for example when, why, and how a curator contributed to a gene annotation record. This core model is intended to be used as a modular component of broader data models that support data collection and curation systems, to facilitate reliable and consistent exchange of computable contribution metadata. Additional components of the CAM specification support implementation of the model, data collection, and ontology-based query and analysis of contribution metadata.<br><br>Beyond this technical approach, we need to address this challenge from a cultural perspective and we welcome community involvement. We encourage stakeholders from various<br>communities and roles to get involved, provide perspective, make feature requests, and help shape the future of academic credit and incentives.

Nearest-neighbor Projected-Distance Regression to detect network interactions and control for confounders and multiple testing

Presenting Author: Trang Le, University of Pennsylvania

Bryan Dawkins, University of Tulsa
Brett McKinney, University of Tulsa


Efficient machine learning methods are needed to detect complex interaction-network effects in high-dimensional biomedical data such as GWAS, gene expression and neuroimage. Many current feature selection methods lack the ability to effectively detect interactions while providing statistical significance of features and controlling for potential confounders from demographic data or population structure. To address these challenges, we develop a new feature selection technique called Nearest-neighbor Projected-Distance Regression (NPDR) that uses the generalized linear model to perform regression in the space of paired distances of nearest observations projected onto predictor dimensions. Motivated by the nearest-neighbor mechanism in Relief-based algorithms, NPDR captures the underlying interaction structure of the data, handles both dichotomous and continuous outcomes, allows combinations of various predictor data types, statistically corrects for covariates and permits regularization. We use realistic simulations with main effects and network interactions to show that NPDR outperforms standard Relief-based methods and random forest at detecting functional variables while also enabling covariate adjustment and multiple testing correction. Using RNA-Seq data from a study of major depressive disorder (MDD), we show that NPDR with covariate adjustment effectively removes spurious associations due to confounding by sex. We apply NPDR to eQTL data to identify potentially interacting variants that regulate transcripts associated with MDD and demonstrate NPDR’s utility for GWAS and continuous outcomes.

Deep learning enables in silico chemical-effect prediction

Presenting Author: Jana Schor, Helmholtz Centre for Environmental Research - UFZ


All living species are exposed to a plethora of chemical substances. In addition to food and endogenous chemicals there are drugs and pollutants. Many chemicals are associated to the risk of developing severe diseases due to their interaction with bio-molecules, like proteins or nucleic acids. Hundreds of thousands of chemicals are listed in public databases worldwide, and there are similarly many bio-molecules encoded in the genomes of species. The advances in high-throughput sequencing technologies in genomics and high-throughput robotic testing in toxicology provide a great source of complex data (for a fraction of chemicals) that must be integrated on a large scale towards in silico prediction of disease risks, improved chemical design, and improved risk assessment. We present our deepFPlearn approach that uses deep learning to associate the molecular structure of chemicals with target genes involved in endocrine disruption - an interference with the production, metabolism or action of hormones in the body which is associated to the development of many severe diseases and disorders in humans. Trained on ~7,000 thousand chemicals for which an interaction with 6 target genes of interest has been measured, the program reached 92% prediction accuracy. Its application to the 700,000 toxCast chemicals identified a plethora of additional candidates and explainable AI is used on our model to identify responsible (combinations of) sub structures in the chemical-gene interaction. With deepFPlearn we demonstrate that transforming the enormous quantity of data in genomics and toxicology into value using deep learning will pave the way towards predictive toxicology.

Correlations and curses of dimensionality: optimizing k in nearest-neighbor feature selection

Presenting Author: Bryan Dawkins, University of Tulsa

Trang Le, University of Pennsylvania
Brett McKinney, University of Tulsa


We will present a systematic strategy for the optimization of k in nearest-neighbor distance-based feature selection. Our approach includes a novel simulation method for generating high dimensional bioinformatics data with mixed effects. We show how pairwise feature correlation causes distance distributions to rapidly diverge from Gaussianity. We will give a result for the expected number of nearest neighbors as a function of the average pairwise distance when the Gaussian assumption holds. Divergence from Gaussianity in distances impacts the optimal choice of k by affecting neighborhood order. We will present our new method of simulating interactions in both continuous and discrete data like gene expression and genome-wide associations studies, respectively. We will demonstrate how the best choice of k changes with sample size, number of features, and interaction effect size. Our method for choosing the optimal value of k will include nested cross-validation to maximize classification accuracy with a comparison of different types of models. Our analysis will provide a more detailed summary of the optimal choice of fixed k in nearest-neighbor feature selection, which will comprehensively show how to combat the curses of dimensionality that arise in the context of feature selection in high dimensional bioinformatics data.

Deep Learning based Multi-view model for deciphering gene regulatory keywords

Presenting Author: Pramod Bharadwaj Chandrashekar, Arizona State University

Li Liu, Arizona State University


Motivation: All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes transcription factors, histone modifications, gene sequences, and mutations, etc. Gene expression profiling can be used in clinical settings as prognostic, diagnostic, and therapeutic markers. Deciphering and cataloging these regulatory codes of gene and its effect on the expression level is one of the biggest and key challenges in precision medicine and genetic research. <br><br>Results: In this study, we propose a novel multi-view deep learning tool to use genetic and epigenetic markers to classify and predict tissue-specific gene expression levels into high and low expression. We use the same model to untangle and visualize the regulatory codes which contribute towards gene regulation. Our system achieved an F1-score of 0.805 which outperforms the existing methods. Our proposed model can not only identify highly enriched regions but also identify TF binding motifs in these regions. We believe that our model can help in detecting various mechanisms affecting gene regulation.

Identifying optimal mouse models for human asthma using a novel modeling approach

Presenting Author: Yihsuan Tsai, University of North Carolina at Chapel Hill

Lauren Donoghue, University of North Carolina
Samir Kelada, University of North Carolina
Joel Parker, University of North Carolina


Asthma is a complex disease caused by both environmental and genetic factors. Many mouse models have been developed to mimic features of human asthma, mainly by exposure to allergens, such as house dust mite (HDM). To date, however, no studies have evaluated how well mouse models represent human asthma using gene expression as the criterion. We addressed this data gap using a new modeling approach. Previously, we reported human consensus asthma-associated differential expressed genes (DEGs) in airway epithelia through meta-analysis of eight human studies. Here, we used gene expression data from the same eight studies to build prediction models, which we evaluated the AUC in 1/3 of the data as validation. The final model with highest AUC includes gene expression of 52 genes. We then applied the final model to publically available mouse datasets and some unpublished data. In most studies, we observed good separation between treated vs. control mouse lung gene expression based on application of the human-based prediction model. To compare among different mouse models, we used similarity scores estimated by the correlation of human meta-effect size and the effect size of each individual mouse study. More than one third of mouse DEGs changed concordantly with human asthma genes, but approximately 20% of mouse DEGs changed discordantly. In summary, we evaluated a set of mouse models of asthma and identified sets of genes that concordantly or discordantly regulated in mice vs. humans, providing insight on how these models do and do not mimic the human disease condition.

Bridging the Bioinformatics Knowledge Gap in the Pediatric Cancer Research Community with the Childhood Cancer Data Lab workshops

Presenting Author: Chante Bethell, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation)

Candace Savonen, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation)
Deepashree Prasad, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation)
Casey Greene, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation)
Jaclyn Taroni, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation)


Biomedical researchers with limited to no bioinformatics training face hurdles when it comes to utilizing their data. Many researchers rely on bioinformaticians to answer biological questions with their genomic data as a result of this knowledge gap. This collaboration process can be protracted as need often outpaces demand. The Childhood Cancer Data Lab (CCDL), an initiative of the Alex’s Lemonade Stand Foundation, has implemented hands-on three day bioinformatics workshops to help address the computational skills gap in the pediatric cancer research community. The goal of this workshop is to equip pediatric cancer researchers with the necessary tools to independently perform basic analyses on their own experimental data and to gain confidence for continued self-directed learning. Our 2019 workshops included four modules on tidyverse, RNA-seq, single-cell RNA-seq, and machine learning. Workshop participants run analyses on their own laptops in a versioned Docker container prepared by CCDL staff and leave with their own machines equipped for future analyses. We introduce reproducible research practices, such as literate programming via R notebooks. On the final day of the workshop, researchers apply their newly developed skills to their own data with support from CCDL’s data science team members. We anticipate that our workshops will helpbridge the bioinformatics knowledge gap and promote communities of practice in the pediatric cancer community.

Hetnet connectivity search provides rapid insights into how two biomedical entities are related

Presenting Author: Daniel Himmelstein, University of Pennsylvania

Michael Zietz, Columbia University
Vincent Rubinetti, University of Pennsylvania
Benjamin Heil, University of Pennsylvania
Kyle Kloster, North Carolina State University
Michael Nagle, Pfizer
Blair Sullivan, University of Utah
Casey Greene, University of Pennsylvania


Hetnets, short for “heterogeneous networks”, contain multiple node and relationship types and offer a way to encode biomedical knowledge. For example, Hetionet connects 11 types of nodes — including genes, diseases, drugs, pathways, and anatomical structures — with over 2 million edges of 24 types. Previously, we trained a classifier to repurpose drugs using features extracted from Hetionet. The model identified types of paths between a drug and disease that occurred more frequently between known treatments.<br><br>For many applications however, a training set of known relationships does not exist; Yet researchers would still like to know how two nodes are meaningfully connected. For example, users may want to know not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. Therefore, we developed hetnet connectivity search to propose the most important paths between any two nodes.<br><br>The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We implemented the method on Hetionet and provide an online interface at https://het.io/search. Several optimizations were required to precompute significant instances of node connectivity at scale. We provide an open source implementation of these methods in our new Python package named hetmatpy.<br><br>To validate the method, we show that it identifies much of the same evidence for specific instances of drug repurposing as the previous supervised approach, but without requiring a training set.

Landmark and Cancer-Relevant Gene Selection of RNA Sequencing Data for Survival Analysis

Presenting Author: Carly Clayman, Penn State University - Great Valley

Satish Srinivasan, Penn State University - Great Valley
Raghvinder Sangwan, Penn State University - Great Valley


Dimensionality reduction methods are used to select relevant features, and clustering performs well when applied to data with low effective dimensionality. This study utilized clustering to predict categorical response variables using Illumina Hi-Seq ribonucleic acid (RNA) Sequencing (RNA-Seq) data accessible on the National Cancer Institute Genomic Data Commons. The dimensionality of the dataset was reduced using several methods. One method selected genes for analysis using a set of landmark genes, which have been previously shown to predict expression of the remaining target genes with low error. Genes were also selected by mining cancer-relevant genes from the literature using the DisGeNET package in R. Groups within the dataset were characterized using clinical data to assess whether landmark genes would improve clustering results, compared to established cancer-relevant genes from the literature. Cancer-relevant genes and landmark genes with the most significant correlations with the clinical outcome of overall survival were also assessed in Kaplan Meier survival analysis. While individual gene expression levels, including TP53, and clinical variables were significant predictors of overall survival when assessed separately, the combination of genes along with clinical variable levels provided the most predictive power for overall survival. Important landmark genes selected by the Boruta random forest algorithm resulted in improved clustering consistent with high vs. low overall survival, compared to important disease-relevant genes. These findings indicate dimensionality reduction techniques may allow for selection of features that are predictive of clinical outcomes for cancer patients. This study has implications for assessing gene-environment interactions for multiple cancer types.

The impact of undesired technical variability on large-scale data compendia

Presenting Author: Alexandra Lee, Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA; Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania,

YoSon Park, University of Pennsylvania
Georgia Doing, Geisel School of Medicine, Dartmouth
Deborah Hogan, Geisel School of Medicine, Dartmouth
Casey Greene, University of Pennsylvania, Philadelphia


Motivation: In the last two decades, scientists working in different labs have assayed gene expression from millions of samples. These experiments can be combined into a compendium in order to gain a systematic understanding of biological processes. However, combining different experiments introduces technical variance, which could distort biological signals in the data leading to misinterpretation. As the scale of these compendia increases, it becomes crucial to evaluate how integrating multiple experiments affects our ability to detect biological patterns.<br><br>Objective: To determine the extent to which underlying biological signals are masked by technical artifacts via simulation of a large compendia.<br><br>Method: We used a generative multi-layer neural network to simulate a compendium of P. aeruginosa gene expression experiments. We performed a pairwise comparison of a simulated compendium containing one experiment versus a simulated compendium containing varying number of experiments, up to a maximum of 6000 experiments, using multiple assessments.<br><br>Results: We found that it was difficult to detect the simulated signals of interest in a compendium containing 2 to 100 experiments unless we applied batch correction. Interestingly, as the number of experiments increased, it became easier to detect the simulated signals without correction. Furthermore, when we applied batch correction, we reduced our power to detect the signal of interest. <br><br>Conclusion: When combining a few experiments, it is best to perform batch correction. However, as the number of experiments increases, batch correction becomes unnecessary and indeed harms our ability to extract biological signals.

A pedigree-level examination of Schistosoma japonicum following schistosomiasis reemergence in rural China

Presenting Author: Laura Timm, University of Colorado - Anschutz Medical Campus

David Pollock, University of Colorado - Anschutz Medical Campus


Schistosomiasis is a poorly characterized disease caused by infection with the parasitic platyhelminth, Schistosoma. In rural China, recent infection reemergence has been documented, despite decades of aggressive control measures. To better understand the underlying relatedness and genomic diversity associated with schistosome persistence, we undertook a large-scale next-generation sequencing effort targeting Schistosoma japonicum, the species responsible for schistosomiasis in the region. Initial analyses of population-level genomics revealed local reservoirs of infection, prompting fine-scale, pedigree-level analyses. However, the poor-quality genome of S. japonicum precludes the use of traditional analysis methods. Our recent efforts have focused on the development of a novel analysis pipeline to identify linkage groups in Big Data and simulate pedigree structure. This has resulted in a fast, flexible means of interrogating genomic data from non-model organisms in a number of ways: 1) linkage groups can be constructed from reduced representation libraries (RRLs) in the absence of a quality reference genome; 2) family-level relationships can be inferred between samples without advance knowledge of parental genotypes; and 3) the total number of genotypes contributing to the sample can be calculated. In the context of schistosomiasis, our research provides important insight into the mechanisms of disease persistence that can inform control measures, contributing to eradication. Generally, our work serves to extend the utility of RRL data, increasing its value in studies of non-model species.


Presenting Author: Nicholas Kinney, Edward Via College of Osteopathic Medicine

Parviz Shabane, Virginia Tech
Arichanah Pulenthiran, Edward Via College of Osteopathic Medicine
Robin Varghese, Edward Via College of Osteopathic Medicine
Ramu Anandakrishnan, Edward Via College of Osteopathic Medicine
Harold Garner, Gibbs Cancer Center &amp; Research Institute


Leucine repeat variants in carnosine dipeptidase 1 (CNDP1) are linked to diabetic nephropathy; in particular, individuals homozygous for the five leucine (Mannheim) allele have reduced risk to develop diabetic end stage renal disease. We preform molecular dynamics (MD) simulation and genomic analysis of CNDP1 variants harboring four, five, six, and seven leucine residues, respectively. MD simulations of the protein product show that the N-terminal tail – which includes the leucine repeat – adopts a bound or unbound state. Tails with four or five leucine residues remain in the bound state anchored to residues 200-220; tails with six or seven leucine residues remain in the unbound state exposed to the solvent. The unbound state is maintained by a bridge of two hydrogen bonds between residues neighboring the leucine repeat; the bridge is not present in the bound state. Functionally important residues in each state are inferred using betweenness centrality; differences are observed for residues 200-220 and 420-440, which in turn affect the active site. Exome sequencing of 5,000 individuals is used to stratify CNDP1 genotypes by super-population (African, American, East Asian, South Asian, and European) and disease status (diabetes type-II and non-diabetes). Distributions of genotypes differ significantly across super-population but not disease status.

Computational Analysis of Kinesin Mutations Implicated in Hereditary Spastic Paraplegias

Presenting Author: Shaolei Teng, Howard University


Hereditary spastic paraplegias (HSPs) are a genetically heterogeneous collection of neurodegenerative disorders. The complex HSP forms are categorized by characterized various neurological features including progressive spastic weakness, urinary sphincter dysfunction, extra pyramidal signs and intellectual disability (ID). The kinesin superfamily proteins (KIFs) are microtubule-dependent molecular motors involved in intracellular transport. Kinesins directionally transport membrane vesicles, protein complexes and mRNAs along neurites, thus playing important roles in neuronal development and function. Recent genetic studies have identified kinesin mutations in patients with HSPs. In this study, we used the computational approaches to investigate the disease-causing mutations associated with ID and HSPs in KIF1A and KIF5A. We performed homology modeling to construct the structures of kinesin-microtubule binding domain and kinesin-tubulin complex. We applied structure-based energy calculation methods to determine the effects of missense mutations on protein stability and protein-protein interaction. The results revealed that E253K associated with ID in KIF1A could change folding free energy and affect the protein stability of kinesin motor domains. We showed that the HSP mutations located in complex interface, such as A255V in KIF1A and R280C in KIF5A, can alter binding free energy and impact the binding affinity of kinesin-tubulin complex. Sequence-based bioinformatics predictions suggested that many of the kinesin mutations in motor domains are involved in post-translational modifications including phosphorylation and acetylation. The computational analysis provides useful information for understanding the roles of kinesin mutations in the development of ID and HSPs.

Enabling structure-based data-driven selection of targets for cancer immunotherapy

Presenting Author: Dinler Antunes, Rice University

Jayvee Abella, Rice University
Sarah Hall-Swan, Rice University
Kyle Jackson, UT MD Anderson Cancer Center
Gregory Lizée, UT MD Anderson Cancer Center
Lydia Kavraki, Rice University


Understanding the molecular triggers to an immune response is essential to fields such as vaccine development and cancer immunotherapy. In this context, a central step is the activation of T-cell lymphocytes by peptides displayed by Human Leukocyte Antigen (HLA) receptors. For instance, a tumor-derived peptide such as MAGEA3 can be used in a vaccine, triggering an immune response capable of eliminating melanoma cells. Cancer vaccines and T-cell-based therapies have been tested in several clinical trials, with remarkable results. However, in a few patients the therapeutic T-cells mistakenly recognized unrelated peptides, expressed by healthy cells, causing lethal off-target reactions. Molecular mimicry was shown to be the key factor determining these side-effects, making structural analyses an important component for designing safer immunotherapies. In addition, structural data of peptide-HLA complexes can be the key for better methods for neoantigen discovery, and immunogenicity prediction. Unfortunately, there is a irreconcilable mismatch between the diversity of HLA molecules (above 17,000 alleles) and the scarceness of experimentally-determined peptide-HLA structures (about 700). To address this problem, we implemented a fast method for structural modeling of peptide-HLA complexes (APE-Gen), and we are now conducting a large-scale modeling of all peptides deposited in the SysteMHC Atlas (i.e., more than 100,000 experimentally-determined peptides from immunopeptidomics projects). Our database of 3D models will be used for a range of data-driven applications, including the prediction of binding affinity, complex stability, and off-target toxicity. In turn, these new methods can be directly applied to enable personalized selection of peptide-targets for safer cancer immunotherapy treatments.

Biotherapeutic Protein Immunogenicity Risk Assessment with TCPro

Presenting Author: Osman Yogurtcu, FDA

Joseph McGill, FDA
Zuben Sauna, FDA
Million Tegenge, FDA
Hong Yang, FDA


Most immune responses to biotherapeutic proteins involve the development of anti-drug antibodies (ADAs). New drugs must undergo immunogenicity assessments to identify potential risks at early stages in the drug development process. This immune response is T cell-dependent. Ex vivo assays that monitor T cell proliferation often are used to assess immunogenicity risk. Such assays can be expensive and time-consuming to carry out. Furthermore, T cell proliferation requires presentation of the immunogenic epitope by major histocompatibility complex class II (MHCII) proteins on antigen-presenting cells. The MHC proteins are the most diverse in the human genome. Thus, obtaining cells from subjects that reflect the distribution of the different MHCII proteins in the human population can be challenging. The allelic frequencies of MHCII proteins differ among subpopulations, and understanding the potential immunogenicity risks would thus require generation of datasets for specific subpopulations involving complex subject recruitment. We developed TCPro, a computational tool that predicts the temporal dynamics of T cell counts in common ex vivo assays for drug immunogenicity. Using TCPro, we can test virtual pools of subjects based on MHCII frequencies and estimate immunogenicity risks for different populations. It also provides rapid and inexpensive initial screens for new biotherapeutics and can be used to determine the potential immunogenicity risk of new sequences introduced while bioengineering proteins. We validated TCPro using an experimental immunogenicity dataset, making predictions on the population-based immunogenicity risk of 15 protein-based biotherapeutics. Immunogenicity rankings generated using TCPro are consistent with the reported clinical experience with these therapeutics.

Large-scale phylogenetic analysis reveals different sequence divergence patterns in orthologous and paralogous proteins

Presenting Author: Joseph Ahrens, University of Colorado Denver Anschutz Medical Campus

Jessica Liberles, Florida International University
Ashley Teufel, Santa Fe Institute


Over the course of protein sequence evolution, individual amino acids at each sequence site can be replaced at markedly different rates and, moreover, site-specific replacement rates can vary over time. The shift in site-specific replacement rates over time (referred to as heterotachy) is thought to be partly governed by shifting structural and functional constraints acting on a protein sequence. Here, we present the results of a large-scale phylogenetic analysis on thousands of multiple sequence alignments, each containing either genes related by orthology (i.e., speciation events) or paralogy (gene duplications). In particular, we observe a positive correlation between overall sequence divergence (phylogenetic tree length) and the apparent variance among site rates (the alpha parameter of the gamma distribution that best fits the site rate scores). Furthermore, the positive correlation is more pronounced in paralogous sequence alignments than in orthologous alignments. Analysis of simulated sequence data shows that a high degree of heterotachy in the evolutionary process induces higher alpha values (i.e., lower apparent variance in site rates) under standard evolutionary models. We discuss these results in light of the ortholog conjecture: the long-standing notion that function is conserved in orthologous proteins more than in paralogous proteins.

A new framework for clustering single cell RNA-seq data

Presenting Author: Ziyou Ren, Northwestern University

Martin Gerlach, Northwestern University
Luis Amaral, Northwestern University
Scott Budinger, Northwestern University


Single cell RNA sequencing technologies (scRNA-seq) promise to enable the quantitative study of biological processes at the single cell level. Commercial platforms such as 10x chromium are becoming established in lab practice [Hwang, B. et al., 2018]. More than other high-throughput technologies, however, the reproducibility and accuracy of current analysis pipelines remains extremely challenging [Kiselev, V. Y. et al., 2019]. For example, cellular classification algorithms continue to be evaluated using datasets with cell labels generated by computational analysis of transcriptomic data [Pouyan M.B. et al, 2018, Jiang, H. 2019]. Thus, there is a crucial need for a benchmark that provides ground truth labels in an independent manner. Here, we develop such a benchmark using a dataset where ground truth labels are generated from surface protein level measurements. We demonstrate a substantial decrease in estimated accuracy of the current gold-standard, Seurat algorithm [Satija R. et al 2015, Butler, A. et al. 2018], in data with low information content. In order to overcome the challenge posed by noisy uninformative data, we implement an algorithm that optimizes information content through an information theory- based approach. Our approach yields dramatical improvement in accuracy for published and new clustering algorithms.

How to annotate what we don’t know

Presenting Author: Mayla Boguslav, University of Colorado Anschutz Medical Campus

Lawrence Hunter, University of Colorado Anschutz Medical Campus
Sonia Leach, University of Colorado Anschutz Medical Campus


There is an extensive natural language processing literature focused on extracting information, i.e. characterizing what we know. We propose to flip this emphasis and instead focus on extracting what we don’t know--known unknowns--in order to specifically characterize the current state of scientific inquiry. These known unknowns are stated in the scientific literature in the form of hypotheses, claims, future opportunities, anomalies, and evidence statements that identify goals for knowledge and future research. We present an annotation schema that can formally represent such statements and be used to train automatic classifiers. The schema includes a taxonomy of types of statements about unknowns, a list of 835 lexical cues (words that can indicate such statements), and a preprocessing step to ease the work of annotators. Example cues include "not known," "calls into question," “complex," “remarkably," “possible," “might," etc. We report on our progress as well as the strengths, weaknesses, and difficulties of annotating what we don’t know.

Filtering, classification, and selection of new knowledge for model assembly and extension

Presenting Author: Natasa Miskov-Zivanov, University of Pittsburgh


The amount of published material produced by experimental laboratories is increasing at an incredible rate, limiting the effectiveness of manually analyzing all available information, and highlighting the need for automated methods to gather and extract the vast knowledge present in the literature. Machine reading coupled with automated assembly and analysis of computational models is expected to have a great impact on understanding and efficient explanation of large complex systems. State-of-the-art machine reading methods extract, in hours, hundreds of thousands of events from the biomedical literature; however, many of the automatically extracted biomolecular interactions are incorrect or not relevant for computational modeling of a system of interest. Therefore, automated methods are necessary to filter and select accurate and useful information from the large machine reading output. We have developed several tools to efficiently filter, classify, and select the best candidate interactions for model assembly and extension. Specifically, the tools that we have built include: (1) a filtration method that utilizes existing databases to select from the extracted biochemical interactions only those with high confidence; (2) a classification method to score selected interactions with respect to an existing model; (3) several model extension methods to automatically extend and test models with respect to a set of desirable system properties. Our tools help reduce the time required for processing machine reader output by several orders of magnitude, and therefore, enable very fast iterative model assembly and analysis.

Pathway-based Single-Cell RNA-Seq Classification, Clustering, and Construction of Gene-Gene Interactions Networks Using Random Forests

Presenting Author: Herbert Pang, University of Hong Kong

Hailun Wang, University of Hong Kong
Pak Sham, University of Hong Kong
Tiejun Tong, Hong Kong Baptist University


Single-cell RNA-Sequencing (scRNA-Seq), an advanced sequencing technique, enables biomedical researchers to characterize cell-specific gene expression profiles. Although studies have adapted machine learning algorithms to cluster different cell populations for scRNA-Seq data, few existing methods have utilized machine learning techniques to investigate functional pathways in classifying heterogeneous cell populations. As genes often work interactively at the pathway level, studying the cellular heterogeneity based on pathways can facilitate the interpretation of biological functions of different cell populations. In this paper, we propose a pathway-based analytic framework using Random Forests (RF) to identify discriminative functional pathways related to cellular heterogeneity as well as to cluster cell populations for scRNA-Seq data. We further propose a novel method to construct gene-gene interactions (GGIs) networks using RF that illustrates important GGIs in differentiating cell populations. The co-occurrence of genes in different discriminative pathways and ‘cross-talk’ genes connecting those pathways are also illustrated in our networks. Our novel pathway-based framework clusters cell populations, prioritizes important pathways, highlights GGIs and pivotal genes bridging cross-talked pathways, and groups co-functional genes in networks. These features allow biomedical researchers to better understand the functional heterogeneity of different cell populations and to pinpoint important genes driving heterogeneous cellular functions. *