Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


ROCKY 2018 | Dec 6 – 8, 2018 | Aspen/Snowmass, CO | HOME


Proteomics of natural bacterial isolates powered by deep learning-based de novo identification

Presenting Author: Samuel Payne, Brigham Young University

Joon-Yong Lee, Pacific Northwest National Laboratory
Hugh Mitchell, Pacific Northwest National Laboratory
Meagan Burnet, Pacific Northwest National Laboratory
Sarah Jenson, Pacific Northwest National Laboratory
Eric Merkley, Pacific Northwest National Laboratory
Anil Shukla, Pacific Northwest National Laboratory
Ernesto Nakayasu, Pacific Northwest National Laboratory


The fundamental task in proteomic mass spectrometry is identifying peptides from their observed spectra. Where protein sequences are known, standard algorithms utilize these to narrow the list of peptide candidates. If protein sequences are unknown, a distinct class of algorithms must interpret spectra de novo. Despite decades of effort on algorithmic constructs and machine learning methods, de novo software tools remain inaccurate when used on environmentally diverse samples. Here we train a deep neural network on 5 million spectra from 55 phylogenetically diverse bacteria. This new model outperforms current methods by 25-100%. The diversity of organisms used for training also improves the generality of the model, and ensures reliable performance regardless of where the sample comes from. Significantly, it also achieves a high accuracy in long peptides which assist in identifying taxa from samples of unknown origin. With the new tool, called Kaiko, we analyze proteomics data from six natural soil isolates for which a proteome database did not exist. Without any sequence information, we correctly identify the taxonomy of these soil microbes as well as annotate thousands of peptide spectra

A platform for community-scale transcriptome-wide association studies

Presenting Author: YoSon Park, Perelman School of Medicine University of Pennsylvania

Casey Greene, Perelman School of Medicine University of Pennsylvania


Transcriptome-wide association studies (TWAS) infer causal relationships between genes, phenotypes and tissues using strategies such as 2-sample Mendelian randomization (MR). Such methods largely eliminate the need to access individual-level data and allow openly sharing data and results. Nonetheless, to our knowledge, there are no public platforms automating quality assurance and continuous integration of TWAS results. Consequently, finding, replicating, and validating causal relationships among millions of similar non-causal relationships remain enormously challenging and are often time- and resource-consuming with many duplicated efforts.

To address this shortcoming, we develop a platform that uses version control software and continuous integration to construct a data resource for the components of TWAS. Community members can contribute additional association studies or methods. We use automated testing to catch formatting mistakes and use pull request functionality to review contributions. We provide a set of tools, available in a Docker container, that perform common downstream analyses using these resources.

Researchers who contribute summary-level datasets substantially increase the impact of their work by making it easy to integrate with complementary datasets. Those who contribute analytical tools will benefit by providing users with numerous off-the-shelf use cases. For this proof-of-concept, we integrate a set of eQTLs provided by the Genotype-Tissue Expression (GTEx) project and a set of curated GWAS summary statistics using 2-sample MR. Our long-term goal for this project is a public community-driven repository where users contribute new summary-level data, download complementary data, and add new analytical methods that enables the field to rapidly translate new studies into actionable findings.

Harmonizing and Analyzing Clinical Trials Data in the AHA Precision Medicine Platform

Presenting Author: Carsten Goerg, University of Colorado

Christophe Roeder, University of Colorado
Bethany Doran, University of Colorado
Ann Marie Navar, Duke University
Michael Hinterberg, SomaLogic
John Graybeal, Stanford University
Mark Musen, Stanford University
Jennifer Hall, American Heart Association
David Kao, University of Colorado


Clinical trials have produced many highly valuable datasets, but their potential to support discovery through meta-analysis has not been fully realized. Answering biomedical questions often requires integrating and harmonizing data from multiple trials to increase statistical power. Due to the lack of supporting computational approaches, this challenging and time-consuming integration process is currently performed manually, which leads to scalability and reproducibility issues. We present a framework and prototype implementation within the cloud-based American Heart Association Precision Medicine Platform as a first step towards addressing this problem. Our framework provides (1) a metadata-driven mapping process from study-specific variables to the OMOP common data model, (2) a metadata-driven extraction process for creating analysis matrices of harmonized variables, and (3) an interactive visual interface to define and explore cohorts in harmonized studies. To demonstrate our approach, we present a prototype use case that investigates the relationship between blood pressure and mortality in patients treated for hypertension. Using our framework, we harmonized five publicly available NIH-funded studies (ALLHAT, ACCORD, BARI-2D, AIM-HIGH, and TOPCAT), assessed distributions of blood pressure by study, and using harmonized data performed individual patient-data meta analyses to show the statistical relationship between all-cause mortality and systolic blood pressure, for individual studies as well as the aggregated data. We discuss how the cloud-based implementation supports reproducibility as well as transparent co-development between collaborators over time and space. Future work will entail development of a generalized workflow for acquisition and semantic annotation of new datasets based on the CEDAR metadata management system.

anexVis: visual analytics framework for analysis of RNA expression

Presenting Author: Diem-Trang Tran, University of Utah

Tian Zhang, University of Utah
Ryan Stutsman, University of Utah
Matthew Might, University of Alabama at Birmingham
Umesh Desai, Virginia Commonwealth University
Balagurunathan Kuberan, University of Utah


Although RNA expression data are accumulating at a remarkable speed, gaining insights from them still requires laborious analyses, which hinder many biological and biomedical researchers. We introduce a visual analytics framework that applies several well-known visualization techniques to leverage understanding of an RNA expression dataset. Our analyses on glycosaminoglycan-related genes have demonstrated the broad application of this tool, anexVis (analysis of RNA expression), to advance the understanding of tissue-specific glycosaminoglycan regulation and functions, and potentially other biological pathways.
The application is publicly accessible at https://anexvis.chpc.utah.edu/, source codes deposited on GitHub.

The characterization of different cell types using the Benford law

Presenting Author: Sne Morag, Ariel University

Mali Salmon-Divon, Ariel University


Abstract publication declined

Use of metadata and Bag-of-words to map measurements across observational study data

Presenting Author: Laura Stevens, University of Colorado Anschutz Medical School

Tiffany Callahan, University of Colorado Anschutz Medical School
Sonia Leach, University of Colorado Anschutz Medical School
David Kao, University of Colorado Anschutz Medical School


Data integration is an important strategy for validating research results or increasing sample size in biomedical research. Integration is made challenging by metadata and data differences between studies, and is often done manually by a clinical expert for a highly select set of measurements. Unfortunately, this process is rarely documented, and when it is, the details are not accessible, interoperable, or reusable. We explored the utility of using bag-of-words, an information retrieval model, to map medical conditions, characteristics, and lifestyle measurements among multiple studies such as diabetes, age, blood pressure, or alcohol intake. We hypothesized applying cosine similarity to features extracted as a bag-of-words model from observational study measurement annotations would yield accurate recommendations for mapping measurements within and between studies and increase scalability compared to manual mapping. Each measurement’s metadata, including descriptions, units, and value-coding, were extracted and then combined for all 105,611 measurements in four cardiovascular-health observational studies. The measurement’s combined metadata was input to the bag-of-words model. Cosine similarity of word vectors was used to score similarity between measurement pairs. The highest scoring matches for each measurement were compared to 612 unique expert-vetted, manual mappings. Among the vetted measurement pairings, 99.8% had the correct mapping in the top-10 scored matches, 92.5% had the correct mapping in the top-5, and 55.7% had the correct mapping as the top score. This approach provides a scalable method for recommending measurement mappings in observational study data. Next steps include incorporating additional metadata such as measurement type or a synonyms dictionary for concept recognition.

ExtRamp: A novel algorithm for extracting the ramp sequence based on the tRNA adaptation index or relative codon adaptiveness

Presenting Author: Justin Miller, Brigham Young University

Logan Brase, Brigham Young University
Perry Ridge, Brigham Young University


Different species, genes, and locations within genes use different codons to fine-tune gene expression. Within genes, the ramp sequence assists in ribosome spacing and decreases downstream collisions by incorporating slowly-translated codons at the beginning of a gene. Although previously reported as occurring in some species, no previous attempt at extracting the ramp sequence from specific genes has been published. We present ExtRamp, a software package that quickly extracts ramp sequences from any species using the tRNA adaptation index or relative codon adaptiveness. Different filters facilitate the analysis of codon efficiency and enable researchers to identify genes with a ramp sequence. We validate the existence of a ramp sequence in most species by running ExtRamp on 229,742,339 genes across 23,428 species. We evaluate differences in reported ramp sequences when we use different parameters. Using the strictest ramp sequence cut-off, we show that across most taxonomic groups, ramp sequences are approximately 20-40 codons long and occur in about 10% of gene sequences. We also show that as gene expression increases, more ramp sequences are identified in Drosophila melanogaster. We provide a framework for performing this analysis on other species and present our algorithm at https://github.com/ridgelab/ExtRamp.

Using machine learning algorithms for classification of medulloblastoma subgroups based on gene expression data

Presenting Author: Sivan Gershanov, Ariel University

Igor Vainer, Ariel University
Helen Toledano, Schneider Children’s Medical Center of Israel
Albert Pinhasov, Ariel University
Nitza Goldenberg-Cohen, Bnai Zion Medical Center
Mali Salmon-Divon, Ariel University


Medulloblastoma (MB), the commonest malignant pediatric brain tumor, is divided into four molecular subgroups: WNT, SHH, Group 3 and Group 4. Clinical practice and treatment design are becoming subgroup-specific. Nowadays clinicians use a 22-gene signature set to diagnose the subgroups. While WNT and SHH subgroups are well-defined differentiating Group 3 from Group 4 is less obvious.
The aim of this study is to improve the diagnosis process in the clinic by identifying the most efficient list of biomarkers for accurate, fast and cost-effective MB subgroup classification.
We tested five machine learning based algorithms, four are well known and one is a novel method we developed. We applied them on a public microarray expression data set and compared their performance to that of the known 22-gene set.
Both decision tree and decision rules resulted in a reduced set with similar accuracy to the 22-gene set. Random forest and SVM-SMO methods showed improved performance, without applying feature-selection. When implementing our novel SARC (SVM Attributes Ranking and Combinations) classifier, allowing feature-selection, the resulted accuracy level was the highest and better than using the 22-gene set as input. The number of attributes in the best-performing combinations range from 13 to 32, including known MB related genes such as WIF1, NPR3 and GRM8, along with LOC440173 a long non-coding RNA.
To summarize we identified sets of attributes that have the potential to improve MB subgroup diagnosis. Broad clinical use of this classification may accelerate the design of patient’s specific targeted therapies and optimize clinical decision.

A human disease network from gene-publication relationships on PubMed

Presenting Author: Edward Lau, Stanford University

Cody Thomas, University of Colorado AMC
Maggie Pui Yu Lam, University of Colorado AMC


Human diseases can be represented as a network connecting similar disorders based on their shared phenotypic and molecular characterizations. Network analysis of disease-disease relationships can yield insights into important biological processes and pathogenic pathways. We recently described a method to determine the semantic similarity between a gene or protein and the literature publications related to a disease, by combining PubMed web queries and curated/text-mined annotations of gene-PMID links from NCBI. We devised a weighted co-publication distance metric to score gene-disease co-occurrences in PubMed, where genes with many non-specific publications are down-ranked whereas recent and high-impact publications are given more weight. We show that this method outperforms existing bibliometric analysis in predicting benchmark gene lists of disease terms. Using this method, we have now compiled significant protein lists from over 20,000 human disease or disease phenotype terms from three standardized vocabularies, namely Disease Ontology (DO), Human Phenotype Ontology (HPO), and Pathway Ontology (PWO). We find that disease terms are associated with specific popular protein lists that inform on protein-disease relationships. The PubMed-based disease network recapitulates several known properties from previous "diseasomes" constructed from OMIM or phenotypic similarity data (e.g., Barabási 2007), including the centrality of metabolic diseases and clustering of related diseases around high-level hub terms. We discuss applications for the disease network, including (i) finding commonly associated diseases from a list of differentially expressed genes in a RNA-seq experiment, and (ii) using gene-disease relationship to predict hidden disease genes in a particular disease

Transcriptome analysis of cancer adjacent normal tissues reveal genes co-expressed with LINE elements

Presenting Author: Mira Han, University of Nevada Las Vegas

Nicky Chung, University of Nevada, Las Vegas
G.M. Jonaid, University of Nevada, Las Vegas
Sophia Quinton, University of Nevada, Las Vegas
Austin Ross, University of Nevada, Las Vegas


Despite the long-held assumption that transposons are normally only expressed in the germ-line, recent evidence shows that transcripts of LINE sequences are frequently found in the somatic cells. However, the extent of variation in LINE transcript levels across different tissues and different individuals, and the genes and pathways that are co-expressed with LINEs are unknown. Here we report the variation in LINE transcript levels across tissues and between individuals observed in the normal tissues collected for The Cancer Genome Atlas. Mitochondrial genes and ribosomal protein genes were enriched among the genes that showed negative correlation with L1HS in transcript level. We hypothesize that oxidative stress is the factor that leads to both repressed mitochondrial transcription and LINE over-expression. KRAB zinc finger proteins (KZFPs) were enriched among the transcripts positively correlated with older LINE families. The correlation between transcripts of individual LINE loci and individual KZFPs showed highly tissue-specific patterns. There was also a significant enrichment of the corresponding KZFP’s binding motif in the sequences of the correlated LINE loci, among KZFP-LINE locus pairs that showed co-expression. These results support the KZFP-LINE interactions previously identified through ChIP-seq, and provide information on the in vivo tissue context of the interaction.

Highly accurate computational characterization of protein kinase family-specific phosphorylation sites

Presenting Author: Chen Li, ETH Zürich

Fuyi Li, Monash University
Tatiana Marquez-Lago, University of Alabama at Birmingham
Andre Leier, University of Alabama at Birmingham
Tatsuya Akutsu, Kyoto University
Anthony Purcell, Monash University
A. Ian Smith, Monash University
Trevor Lithgow, Monash University
Roger Daly, Monash University
Jiangning Song, Monash University
Kuo-Chen Chou, Gordon Life Science Institute


Kinase-regulated phosphorylation is a ubiquitous type of post-translational modification (PTM) in both eukaryotic and prokaryotic cells. Numerous experimental studies have demonstrated that phosphorylation is involved in regulation of a variety of fundamental cellular processes, such as protein-protein interaction, protein degradation, signal transduction and signaling pathways. It also has been revealed that signaling defects caused by aberrant phosphorylation are highly associated with a variety of human diseases, especially cancers. In light of this, a number of computational methods aiming to accurately predict protein kinase family-specific or kinase-specific phosphorylation sites have been established, thereby facilitating phosphoproteomic data analysis. In this work, we present Quokka, a novel bioinformatics tool that allows users to rapidly and accurately identify human kinase family-regulated phosphorylation sites. Quokka was developed by using a variety of sequence scoring functions combined with an optimized logistic regression algorithm. We evaluated Quokka based on well-prepared up-to-date benchmark and independent test datasets, curated from a variety of databases. The independent test demonstrates that Quokka improves the prediction performance compared with state-of-the-art computational tools for phosphorylation prediction. In summary, our tool provides users with high-quality predicted human phosphorylation sites for hypothesis generation and biological validation.

ORCHID: a method for detecting short-range chromatin interactions in high-resolution 5C and Hi-C datasets

Presenting Author: Fei Ji, Massachusetts general hospital

Sharmistha Kundu, Massachusetts general hospital
Robert Kingston, Massachusetts general hospital
Ruslan Sadreyev, Massachusetts general hospital


The chromatin interaction assays 5C and Hi-C are robust techniques to investigate spatial organization of the genome by capturing interaction frequencies between genomic loci. Although 5C and Hi-C resolution is theoretically restricted only by the length of digested DNA fragments (1Kb-4Kb), intrinsic stochastic noise and high frequencies of background interactions at the distances below 100 Kbp present a significant challenge to understanding short-distance chromatin organization. Here we present the shOrt Range Chromosomal Interaction Detection method (ORCHID) for a comprehensive high-resolution analysis of chromatin interactions in 5C and Hi-C experiments. This method includes background correction of raw interaction frequencies for individual primers or genomic bins, empirical correction for distance dependency of background noise, and detection of areas of significant interactions. When applied to publicly available datasets, ORCHID improves the identification of small (20-200Kb) interaction domains. Unlike larger classic TADs, these chromatin domains are often specific to cell type and functional state of the genomic region. In addition to the expected associations (e.g. with CTCF, cohesin, and mediator complexes), these domains show significant associations with other DNA-binding proteins. An important subtype of these small domains is fully covered and controlled by Polycomb Repressive Complex 1 (PRC1), which mediates transcriptional repression of many key developmental genes. As a separate unexpected example of a potential new mode of regulating chromatin interactions, the binding of RING1B, an essential subunit of the PRC1 complex, is also enriched near domain boundaries at the focused loci that do not necessarily correspond to repressed promoters.

Using Adversarial Deep Neural Networks to Remove Nonlinear Batch Effects from Expression Data

Presenting Author: Jonathan Dayton, Brigham Young University

Stephen Piccolo, Brigham Young University


Batch effects and other confounding effects can skew research results when working with quantitative molecular data (e.g. RNA-Seq). Most existing batch adjustment methods only take into account linear effects, but modern analysis tools such as machine learning can still identify and be influenced by nonlinear batch effects, even after linear effects have been removed. We introduce Confounded, a method that uses adversarial deep neural networks to identify and remove linear and nonlinear batch effects. Confounded is composed of 1) a discriminator designed to detect confounding effects and 2) an autoencoder designed to replicate the input data while identifying and removing confounding effects in order to fool the discriminator. Once the data have been faithfully reproduced and the confounders have been removed, the adjusted data are output for use in analysis. We have tested Confounded on image vectors with artificial nonlinear batch effects. We show that Confounded removes these batch effects more effectively than ComBat, the most commonly used batch-effect adjustment method, while still retaining most of the true signal as measured by several classification algorithms. We are also validating Confounded with molecular datasets, both with artificial and real batch effects, and publishing our software to enable other scientists to use Confounded in their bioinformatics pipelines. In addition to batch correction, Confounded may also be used for data integration between multiple databases or between different technologies (e.g. microarray and RNA-Seq) or for removing general confounding effects from data.

Med2Mech: Neural-Symbolic Representation of Molecular Mechanisms Underlying Pediatric Disease

Presenting Author: Tiffany Callahan, University of Colorado Denver Anschutz Medical Campus

Adrianne Stefanski, University of Colorado Denver Anschutz Medical Campus
Michael Kahn, University of Colorado Denver Anschutz Medical Campus
Lawrence Hunter, University of Colorado Denver Anschutz Medical Campus


Subphenotyping aims to cluster patients with a particular disease into clinically distinct groups. Genomic and related molecular signatures, such as mRNA expression, have shown great promise for subphenotyping, but such molecular data is not and will not be available for most patients. Here, we present Med2Mech, a method for linking knowledge from generalized molecular data to specific patients’ electronic patient records, and demonstrate its utility for subphenotyping. We hypothesized that integrating knowledge of molecular mechanisms with patient data would improve subphenotype classification. Med2Mech employs neural-symbolic representation learning to generate patient-level embeddings of molecular mechanisms using publicly available biomedical knowledge. Using clinical terminologies and biomedical ontologies, the mechanisms can then be mapped to patient data at scale. Med2Mech was developed and tested using clinical data from a subset of rare disease and other similarly medically complex patients from the Children’s Hospital Colorado. A one-vs-the-rest multiclass classification strategy was used to evaluate the discriminatory ability of embeddings generated using Med2Mech versus only clinical data. Clinical embeddings were built for 2,464 rare disease and 10,000 similarly complex patients using 6,382 conditions, 2,334 medications, and 272 labs. Molecular mechanism embeddings were generated from a knowledge graph (116,158 nodes and 3,593,567 edges) built with 23,776 genes, 3,744 diseases, 49,185 gene ontology concepts, 13,159 phenotypes, 11,124 pathways, and 15,019 drugs. For classification, the molecular mechanism embeddings (precision=0.95, recall=0.94) out-performed all parameterizations of clinical embeddings (precision=0.83, recall=0.82). The Med2Mech representation of patient data improves subphenotype classification relative to standard subphenotyping approaches by incorporating knowledge of molecular mechanisms.

A Machine Learning Classifier for Assigning Individual Patients with Systemic Sclerosis to Intrinsic Molecular Subsets

Presenting Author: Jennifer Franks, Geisel School of Medicine at Dartmouth

Viktor Martyanov, Geisel School of Medicine at Dartmouth
Guoshuai Cai, Arnold School of Public Health at University of South Carolina
Yue Wang, Geisel School of Medicine at Dartmouth
Tammara Wood, Geisel School of Medicine at Dartmouth
Michael Whitfield, Geisel School of Medicine at Dartmouth


High-throughput gene expression profiling of skin biopsies from patients with systemic sclerosis (SSc) has identified four “intrinsic” gene expression subsets (inflammatory, fibroproliferative, normal-like, limited) conserved across multiple cohorts and tissues. In order to characterize patients in clinical trials or for diagnostic purposes, supervised methods that can classify single samples are required.

Three gene expression cohorts were curated and merged for the training dataset. Supervised machine learning algorithms were trained using repeated three-fold cross-validation. We performed external validation using three additional datasets, including one generated by an independent laboratory on a different microarray platform. WGCNA and g:Profiler were used to identify and functionally characterize gene modules associated with the intrinsic subsets.

The final model, a multinomial elastic net, performed with average classification accuracy of 88.1%. All intrinsic subsets were classified with high sensitivity and specificity, particularly inflammatory (83.3%, 95.8%) and fibroproliferative (89.7%, 94.1%). In external validation, the classifier achieved an average accuracy of 85.4%. In a re-analysis of GSE58095, we identified subgroups of patients that represented the canonical inflammatory, fibroproliferative, and normal-like subsets. Inflammatory gene modules showed upregulated biological processes including inflammatory response, lymphocyte activation, and stress response. Similarly, fibroproliferative gene modules were enriched in cell cycle processes.

We developed an accurate, reliable classifier for SSc intrinsic subsets, trained and tested on 427 skin biopsies from 213 individuals. Our method provides a robust approach for assigning samples to intrinsic gene expression subsets and can be used to aid clinical decision-making and interpretation for SSc patients and in clinical trials.

A Data Quality Testing Tool for Cross-institutional OMOP Electronic Health Record Data Repositories

Presenting Author: Timothy Bergquist, University of Washington

Hossein Estiri, Harvard University
Justin Prosser, University of Washinton
Adam Wilcox, University of Washington
Kari Stephens, University of Washington


Data quality testing is critical to cross-institutional data sharing, a key component of health innovations produced through translational research. Harmonizing electronic health record (EHR) data is a resource intensive strategy used in many data sharing efforts, involving extraction, translation, and loading activities that can perpetuate and add to pre-existing data quality issues. Yet, we lack standards and tools for testing the quality of datasets produced through these complex harmonization processes. Given its large scale adoption, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) standard is primed as a front running CDM to target establishing a standard set of executable data quality tests to support cross institutional data sharing. We adapted a prototype tool, DQe-c, to OMOP CDM V5 with scalability across database platforms. Namely it examines completeness in all data tables and columns, calculates the percentage of patients who have key clinical variables present (e.g., blood pressure, height), detects the presence of orphan keys (i.e., foreign keys that are not present in their reference table), reports on the size of the databases, and assesses conformance to the standard. All test results are produced as data visualizations in a single HTML dashboard. This prototype is being explored for use in multiple data sharing pilot projects supported by the Clinical Translational Science Award (CTSA) Program Data to Health (CD2H) Coordinating Center, with an aim towards configuring a robust set of completeness, conformance, and plausibility tests that confirm OMOP CDM V5 datasets are fit for cross-institutional data sharing.

Comparative Analysis of Germline Microsatellites in the 1,000 Genomes Project

Presenting Author: Nicholas Kinney, Virginia College of Osteopathic Medicine

Kyle Titus-Glover, Virginia Tech
Jonathan Wren, Oklahoma Medical Research Foundation
Robin Varghese, Edward Via College of Osteopathic Medicine
Pawel Michalak, Edward Via College of Osteopathic Medicine
Han Liao, Virginia Tech
Ramu Anandakrishnan, Edward Via College of Osteopathic Medicine
Arichanah Pulenthiran, Edward Via College of Osteopathic Medicine
Lin Kang, Edward Via College of Osteopathic Medicine
Harold Garner, Edward Via College of Osteopathic Medicine


Microsatellites are regions of DNA characterized by short – one to six base pair – motifs repeated in tandem to form an array. Over 600,000 unique microsatellites exist in the human genome embedded in gene introns, gene exons, and regulatory regions. Indeed they are well established as an important source of genetic variation. A number of databases provide searchable interfaces to microsatellites within the human reference genome; however, none provide data on actual polymorphism rates among and within human populations. We introduce the Comparative Analysis of Germline Microsatellites (CAGm) Database. The database is designed to assist with future studies of germline microsatellites and enhance our understanding of human genetic variation. Samples can be easily grouped by population, ethnicity, and gender. Microsatellites can be searched by gene, functional element, and location. Users can query genotypes, view multiple sequence alignments, and easily download data for further analysis. The database has a wide range of additional capabilities. Database content is fully described with examples and future directions are discussed. The database is freely available at http://www.cagmdb.org/.

Unbiased Pathway Detection Expands Cancer Pathways

Presenting Author: Chih-Hsu Lin, Baylor College of Medicine

Stephen Wilson, Baylor College of Medicine
Teng-Kuei Hsu, Baylor College of Medicine
Minh Pham, Baylor College of Medicine
Olivier Lichtarge, Baylor College of Medicine


Pathways are functional gene groups and represent how signals are transmitted/received and which genes/proteins interact. Conventionally, domain experts annotate pathways based on the literature. Thus, the unbiased detection of functional gene groups solely based on the gene-gene interaction network structure may provide novel insights. Here, we hypothesized that gene members in a functional gene group interact within the group more than outside the group. We developed Recursive Louvain algorithm to detect communities (i.e., clustered gene groups) on a human protein-protein interaction network. 81.9 % of the communities overlapped with known pathways significantly compared to random controls, whereas 622 communities didn’t and may be novel gene groups. In addition, variants of genes overlapping with communities are more likely to be pathogenic in ClinVar and have high evolutionary impact quantified (p <<0.0001). As a case study in head and neck cancer, we found 38 communities are under significant mutational selection (q<0.1). By integratively clustering patients on the mutation, copy number variation, RNA and miRNA expression, one community separated patient survival (q=0.008). Furthermore, we designed neural network (NN) model architectures based on communities to predict human papillomavirus status. The results showed that NN based on communities outperformed NN based on random gene groups and performed similarly if not better than fully connected NN. In conclusion, these data suggest that the communities recover known functional and disease pathways, could be used as cancer survival predictors, and could capture underlying gene compositions of biological phenotypes to make NN models more interpretable. This study will help understanding of cancer pathways and provide biomarkers for cancer patients.

A systems biology approach to define essential kinases in small cell lung cancer

Presenting Author: Jihye Kim, University of Colorado Denver Anschutz Medical Campus

Daniel Foster, National Jewish Health
Rangnath Mishra, National Jewish Health
James Finigan, National Jewish Health
Jeffrey Kern, National Jewish Health
Aik Choon Tan, University of Colorado Denver


Small cell lung cancer (SCLC) is a deadly cancer where its five-year survival rate is < 7% and kills approximately 30,000 lives this year. Treatment of SCLC using the chemotherapy combination of cisplatin and etoposide with radiation therapy has not changed in almost 30 years. Therefore, novel therapies are needed for this disease. Building on the role of kinases and their regulation of cell growth and survival, we hypothesized that kinases regulate cell survival pathways in SCLC (essential kinases) and they may be effective targets as novel monotherapy, or act synergistically with standard chemotherapy, and improve therapeutic outcome. To test this hypothesis, we employed a systems biology approach to identify essential kinases in SCLC. We performed in vivo kinome-wide screening using an shRNA library targeting human kinases on seven chemonaïve SCLC patient derived xenografts (PDX). We developed a suite of bioinformatics tools to deconvolute the kinome screening data, and identified 23 essential kinases found in two or more PDX models. The top essential kinases were RET, MTOR and ATM. We connected these kinases to our drug database to identify specific inhibitors as potential therapy and performed in vitro and in vivo validation of their efficacy. Notably, monotherapy with a small molecule inhibitor targeting mTOR significantly reduced SCLC tumor growth in vivo proving mTOR’s essential kinase function. In addition, mTOR inhibition synergized with standard chemotherapy to significantly augment tumor responses in SCLC PDX models. These results warrant the further investigation of MTOR inhibitors combined with chemotherapy as novel treatment for SCLC.

Clustering of Protein Conformations using Parallelized Dimensionality Reduction

Presenting Author: Arpita Joshi, University of Massachusetts, Boston

Nurit Haspel, Umass Boston


Analyzing the conformational pathways that a macromolecule undergoes is imperative to understanding its function and dynamics. We present a combination of techniques to sample the conformational landscape of proteins better and faster. Datasets representing these landscapes of protein folding and binding are complex and high dimensional. Therefore, there is a need for dimensionality reduction methods that best preserve the variance in the data, and facilitate the analysis of the data. The crux of this work lies in the way this is done. We start with a non-linear dimensionality reduction technique, Isomap, which has been shown to produce better results than linear dimensionality reduction in approximating the complex niceties of protein folding. However, the algorithm is computationally intensive for large proteins or a large number of samples (samples here refer to the various conformations that are used to ascertain the pathway between two distinctively different structures of a protein). We present a parallel algorithm written in C, using OpenMP, with a speed-up of approximately twice. The results obtained are coherent with the ones obtained using sequential Isomap. Our method uses a distance function to calculate the distance between the points that in turn measures the similarity between the conformations that each of these points represent. The output is a lower-dimensional projection that can be used later for purposes of visualization and analysis. A proof of quantitative validation comes with the least RMSD computation for the two embeddings. The algorithm also makes efficient use of the available memory.

PredHPI: an integrated web-server platform for the prediction and visualization of host-pathogen interactions

Presenting Author: Rakesh Kaundal, Utah State University

Cristian Loaiza, Utah State University


Understanding the mechanisms underlying infectious diseases is fundamental to develop prevention strategies. Host-pathogen interactions, which includes from the initial invasion of host cells by the pathogen through the proliferation of the pathogen in their host, have been studied to find potential genomic targets for the development of novel drugs, vaccines, and other therapeutics. Few in silico prediction methods have been developed to infer novel host-pathogen interactions, however, there is no single framework which combines those approaches to produce and visualize a comprehensive analysis of host-pathogen interactions. We present a web server platform named PredHPI available at http://bioinfo.usu.edu/PredHPI/. PredHPI is composed of independent sequence-based tools for the prediction of host-pathogen interactions. The Interolog module, including some of the IMEX databases (HPIDB, MINT, DIP, BioGRID and IntAct), provides three comparison flavors using the BLAST homology results (best-match, ranked-based and generalized). The Domain module, which performs the predictions of the domains, using Pfam and HMMer, and the interactions using the 3DID and IDDI databases. And the GO Similarity module which uses some of the Bioconductor species databases to calculate similarities using GOSemSim R package of the GO terms detected using InterProScan. PredHPI incorporates the functionally to visualize the resulting interaction networks plus the integration of several databases with enriched information about the proteins involved in it. To our knowledge, PredHPI is the first system to build and visualize interaction networks from sequence-based methods as well as curated databases. We hope that our prediction tool will be useful for researchers studying infectious diseases.

Searching for translatable alternative splice isoforms in the human proteome

Presenting Author: Maggie Pui Yu Lam, University of Colorado Anschutz Medical Campus

Edward Lau, Stanford University


The human genome contains over 100,000 alternative splice isoform transcripts, but the biological functions of most isoform transcripts remain unknown and many are not translated into mature proteins. A full appreciation of the biological significance of alternative splicing therefore requires knowledge of isoforms at the protein level, such as using mass spectrometry-based proteomics. One described is to perform in-silico translation of alternative transcripts, and then to use the resulting custom FASTA protein sequence databases with a database search engine for protein identification in shotgun proteomics. However, challenges remain as custom protein databases often contain many sequences that are in fact not translated as proteins inside the cell, thus contributing to a high false discovery rate in proteomics experiments.
We describe here a computational workflow and software to generate custom protein databases of alternative isoform sequences using RNA-seq data as input. The workflow is designed with the explicit goal to minimize untranslated sequences to rein in false positives. To evaluate its performance, we processed public RNA sequencing data from ENCODE to build custom FASTA databases for 10 human tissues (adrenal gland, colon, esophagus, heart, lung, liver, ovary, pancreas, prostate, testis). We applied the databases to identify unique splice junction peptides from public mass spectrometry data of the same human tissues on ProteomeXchange. We identified 1,984 protein isoforms including 345 unique splice-specific peptides not currently documented in common proteomics databases. We suggest that the described proteotranscriptomics approach may help reveal previously unidentified alternative isoforms, and aid in the study of alternative splicing.

Text Mining Novel Disease- and Drug-Specific Pathways

Presenting Author: Minh Pham, Baylor College of Medicine

Stephen Wilson, Baylor College of Medicine
Chih-Hsu Lin, Baylor College of Medicine
Olivier Lichtarge, Baylor College of Medicine


In response to the exponential growth of scientific publications, text mining is increasingly used to extract biological pathways and processes. Though multiple tools explore individual connections between genes, diseases, and drugs, not many extensively examine contextual biological pathways for specific drugs and diseases. In this study, we extracted more than 3,000 functional gene groups for specific diseases and drugs by applying a community detection algorithm to a literature network. The network aggregated co-occurrences of Medical Subject Headings (MeSH) terms for genes, diseases, and drugs in publications. The detected literature communities were groups of highly associated genes, diseases, and drugs. The communities significantly captured genetic knowledge of canonical pathways and recovered future pathways in time-stamped experiments. Furthermore, the disease- and drug-specific communities recapitulated known pathways for those given diseases and drugs. In addition, diseases in same communities had high comorbidity with each other and drugs in same communities shared great numbers of side effects, suggesting that they shared mechanisms. Indeed, the communities robustly recovered mutual targets for drugs (AUROC = 0.75) and shared pathogenic genes for diseases (AUROC = 0.82). These data show that the literature communities not only represented known biological processes but also suggested novel disease- and drug-specific mechanisms, facilitating disease gene discovery and drug repurposing.

Optimizing nontuberculous mycobacteria (NTM) de novo genome assemblies for application in clinical case studies

Presenting Author: Sara Kammlade, National Jewish Health

Nabeeh Hasan, National Jewish Health
L. Elaine Epperson, National Jewish Health
Michael Strong, National Jewish Health
Rebecca Davidson, National Jewish Health


To enable studies related to bacterial acquisition and clinical infections of nontuberculous mycobacteria (NTM), we developed a standardized bioinformatic analysis pipeline to process sequenced bacterial isolates from paired-end Illumina reads to fully annotated genomes and a companion PostgreSQL genomic database. Our NTM Genomes Database includes 1200+ isolates from 20 different NTM species which have been processed through our automated and optimized steps for read-trimming, de novo genome assembly, species identification using the average nucleotide identity (ANI) method, contig-ordering against a reference genome, and comprehensive annotation of genomic features. To optimize genome assembly methods and explore the theoretical potential of assembling complete genomes in the context of NTM, we performed experiments testing different parameter combinations in Skewer, SPAdes, and Unicycler on sequences from Illumina MiSeq (2x300bp) and HiSeq (2x250bp) platforms as well as on synthetic reads of varying read lengths and sequencing depths derived from published complete genomes. Assemblies from Illumina data revealed a negative effect of high GC content on assembly quality as measured by NG50. SPAdes and Unicycler yielded similar quality assemblies with Unicycler yielding fewer small (<1Kbp) contigs. From the synthetic reads we found diminished returns on NG50 improvement beyond 25x coverage at 250bp, and assembled a single contig genome using 50Kbp reads at 60x coverage. Using our high quality genomes we are able to identify core and accessory genes and investigate clinically relevant genotype-phenotype relationships. As an example, we will share findings from a case study of bacterial genomic evolution during a long-term pulmonary infection.

REAL-neo, a comprehensive neoantigen prediction and prioritization pipeline using tumor sequencing data

Presenting Author: Yesesri Cherukuri, Mayo Clinic

Yingxue Ren, Mayo Clinic
Vivekananda Sarangi, Mayo Clinic
Yi Lin, Mayo Clinic
Keith Knutson, Mayo Clinic
Yan Asmann, Mayo Clinic


Neoantigens are immunogenic peptides from tumor-specific somatic mutations. The expressed neoantigens can be presented to class-I or class-II MHC molecules and induce robust and enduring anti-tumor T-cell responses. Recent studies have demonstrated the great potential of personalized neoantigen vaccines as a new type of immunotherapy.

In general, identification of neoantigens from tumor sequencing data includes the following steps: (1) call somatic mutations from tumor genomic sequencing data; (2) derive neo-peptide sequences containing somatic mutations; (3) predict binding affinities between neo-peptides and MHC molecules. However, the current bioinformatics practices ignore transcript splicing isoforms, expressed fusion gene products, and often times only focus on non-synonymous single nucleotide mutations but not frame-shifting INDELs. In addition, the MHC binding affinity prediction mainly focuses on class-I but not class-II MHC molecules. Furthermore, studies have shown that substantial numbers of neo-peptides predicted to have low MHC affinities are actually immunogenic, suggesting the necessity of alternative approaches for neoantigen discovery. Finally, nominated neoantigens need to be further filtered to ensure tumor specificity.

We have improved and optimized each step of the bioinformatics workflow for neoantigen identification from tumor sequencing data to address the complexity and current limitations of the process.

Measuring chromosome conformation

Presenting Author: Brian Ross, University of Colorado Anschutz Medical Campus

James Costello, University of Colorado Anschutz Medical Campus


The in-vivo conformation of chromosomes is an outstanding unsolved problem in structural biology. Most structural information is currently inferred indirectly from Hi-C data, as direct measurements of chromosomal positioning have not been possible for more than a handful of genetic loci. We have previously demonstrated a computational method for scaling direct positioning measurements up to the whole-chromosome scale. Here we present our latest results from simulations and experiments.

Modeling the Structure of BioGRID PPI Networks

Presenting Author: Sridevi Maharaj, University of California-Irvine

Pedro Silva, University of California-Irvine
Zarin Ohiba, University of California-Irvine
Wayne Hayes, University of California-Irvine


Protein-protein interaction (PPI) networks are being continuously updated but are still incomplete, sparse, and have false positives and negatives. Amongst the heuristics employed to describe network topology, graphlets have emerged successful in quantifying local structure of biological networks. Some studies analyzing the graphlet degree distributions and relative graphlet frequency, found Geometric (GEO) networks to be a reasonable basis for modeling PPI networks. However, all extensive studies to model PPI networks as a whole utilized older PPI network data. While there are numerous techniques through which PPI data can be curated, in this study, we re-evaluate these models on the newest PPI data available from BioGRID for the following nine species: AThaliana, CElegans, DMelanogaster, EColi, HSapiens, MMusculus, RNorvegicus, SCerevisiae, and SPombe. To the best of our knowledge, this has not yet been performed, as the data is relatively new. We compare the graphlet distributions of several models to distributions of the updated networks and analyze their fit using several measures that have been shown to be suitable for measuring network distances (or similarities): RGFD, GDDA, Graphlet Kernel, and GCD. Despite minor behavioral differences amongst the comparison measures, we find that other than the Sticky model, the Scale-Free Gene Duplication and Divergence (SFGD) and Scale-Free (SF) models unanimously outperform other traditional models (including GEO and GEOGD) in matching the structure of these 9 BioGRID PPI networks. We further corroborate these results using machine learning classifiers to categorize each species as a network model and visualize these results using t-SNE plots. *

Addressing the compositional data problem in sequencing with a novel, robust normalization method

Presenting Author: James St. Pierre, University of Toronto

John Parkinson, Hospital for Sick Children, Toronto


A problem that faces high-throughput sequencing datasets is that raw sequencing data is semi-quantitative due to the random sampling procedure of the sequencing process itself. The raw counts produced only give relative abundances of various genes and must be appropriately normalized to give an approximation of the absolute abundances of genes in the samples. This ‘compositional data problem’ in sequencing is especially apparent in the microbiome field. Normalization methods developed for RNA-seq data have been shown to fail when used on 16S microbiome sequencing data, leading to inflated false discovery rates when performing differential abundance analysis. Moreover, the effectiveness of these normalization techniques when used on metagenomics and metatranscriptomics data has yet to be systematically evaluated. We present a novel normalization method that shows improved performance over previous methods (DESeq2, edgeR, and metagenomeSeq) when applied to simulated sequencing data. All current normalization methods have the statistical assumption that most genes (or taxa) are not differentially abundant between experimental groups. The new technique does not have this assumption and is the only method that successfully controls false positive rates during differential abundance testing on a simulated 16S dataset where 50% of taxa were set to be differentially abundant. Even ANCOM and ALDEx2, two compositional data analysis tools previously shown to be more robust than other methods, are shown here to have inflated false positive rates. This new normalization method will be an asset to microbiome researchers, leading to more robust discoveries.

A Case Study on the Effects of Noisy, Long-read Correction Approaches on Assembly Contiguity

Presenting Author: Brandon Pickett, Brigham Young University

Justin Miller, Brigham Young University
Perry Ridge, Brigham Young University


Third-generation sequencing technologies are advancing our ability to sequence increasingly long DNA sequences in a high-throughput manner. Pacific Biosciences (PacBio) Single-molecule, Real-time (SMRT) sequencing and Oxford Nanopore Technologies (ONT) nanopore sequencing routinely produce raw sequencing reads averaging 20-30kbp in length. Maximum read lengths have, in some cases, exceeded 100kbp. Unfortunately, these long reads are expensive to generate and have a high error rate (10-15%) when compared with Illumina short reads (1%). The limitation on assembly from high error rates can be mitigated by (a) co-assembling high-error, long reads with low-error, short reads (e.g., MaSuRCA) or (b) correcting the errors prior to assembly. Pre-assembly error correction typically happens by either (a) self-correction or (b) hybrid correction. Self-correction requires increased sequencing depth (and thus expense) and can be done with stand-alone software (e.g., Racon) or via a module in an assembler (e.g., Canu). Hybrid correction involves alignment of low-error, short reads to the raw long reads to generate the consensus (e.g., CoLoRMap). Note that low-error, short reads can also be used to polish the assembled contigs, i.e., correct misassemblies and errors. To investigate how self-correction, hybrid correction, or both correction methods affect assembly contiguity, we tried each approach in a case study. Bonefish (Albula glossodonta) DNA was extracted and sequenced on PacBio Sequel to theoretical 70x coverage and on Illumina HiSeq 2500 to theoretical 100x coverage with paired-end (PE) 2x250 in Rapid run mode. Our assembly results demonstrate that a combination of both approaches generates the most contiguous bonefish assembly.

Integrative analysis of transcriptomics and proteomics to detect novel protein isoforms from alternatively spliced transcripts induced by SF3B1 spliceosomal mutations

Presenting Author: Kelsey Nassar, University of Colorado Anschutz

Hyunmin Kim, University of Colorado Anschutz
Jihye Kim, University of Colorado Anschutz
Maggie Lam, University of Colorado Anschutz
Aik Choon Tan, University of Colorado Anschutz


Alternative splicing (AS) contributes to transcriptional complexity and is hypothesized to alter the proteome. AS events have been found to be increased in various cancers, however, the functional consequences of AS events on tumorigenesis remains unclear. Recently, mutations in core spliceosomal proteins, such as SF3B1, have been identified at a high frequency in multiple cancers. Next generation RNA-sequencing (RNA-seq) has identified that SF3B1 mutations result in global transcriptomic alterations in AS, primarily an increase in alternative 3’ splice site recognition. We hypothesized that mutations in SF3B1 increases proteome diversity through alternative splice variants that contribute to tumorigenesis. To test this hypothesis, we performed deep RNA-Sequencing on an SF3B1-Mutant and SF3B1-WildType uveal melanoma cell lines. We developed SALSA (Systemic ALternative Splice Analysis), for RNA-seq analysis to identify novel AS events as a result of SF3B1 mutations. In addition, we conducted proteome-wide mass spectrometry (MS) to identify novel protein isoforms detected from RNA-seq. We curated a novel peptide database from our custom AS events identified by SALSA to detect novel protein isoforms. From this integrative analysis, we identified 76 novel peptides enriched in SF3B1-Mutant cells detected at both RNA-seq and MS levels. From the MS peptide list, we validated SETD5, an 3’ alternatively spliced transcript. To our knowledge, this is the first description of a novel alternatively spliced transcript that results in a novel protein in SF3B1-mutant cells. This preliminary analysis lays the ground work for further identification of novel protein isoforms resulting from SF3B1 mutations that ultimately may contribute to tumorigenesis.

Exploring the Fabric of Breast Cancer Using Gene Sets

Presenting Author: Judith Blake, The Jackson Laboratory

Carol Bult, The Jackson Laboratory
Leigh Carmody, The Jackson Laboratory
Mary E Dolan, The Jackson Laboratory
Harold J Drabkin, The Jackson Laboratory
Akenna Harper, The Jackson Laboratory
Joan Malcolm, The Jackson Laboratory
Monica S McAndrews, The Jackson Laboratory
Peter Robinson, The Jackson Laboratory
Sara Patterson, The Jackson Laboratory
Susan Mockus, The Jackson Laboratory
Erich Baker, Baylor University
David P Hill, Baylor University


A key requirement in understanding the complexities of biological systems is being able to move from a single gene approach to understanding how genes interact to give rise to complex phenomena. One way of analyzing how multiple genes interact is to study gene sets that represent a given aspect of biology. By comparing and contrasting sets, we determine whether sets describing different aspects are related, share common members or can be refined based on their metadata. The GeneWeaver suite of analysis tools allows comparison and manipulation of gene sets to identify and understand commonalities and differences and to understand the underlying biology that genes in the sets share. We took a targeted approach to gene set analysis by curating genes sets related to breast cancer. We used our curated sets to identify and quantify mutations that are common in breast cancer, and to explore how differentially expressed genes may influence chemotherapy response and how they can be used to identify underlying biology that might contribute to breast cancer progression.

This work supported by NIH grants: NHGRI grant R25 HG007053 Diversity action Plan for Mouse Genome Database (C. Bult, PI); NCI grant P30 CA034196 Jackson Laboratory Cancer Center (E. Liu, PI); NHGRI U41 HG000330 Mouse Genome Informatics (C. Bult and J. Blake PIs); and NIAA/NIDA AA018776 Data Structures, algorithms and tools for ontological discovery (E. Chesler, PI). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health

Measuring and Protecting the Sensitive Linking Information Leakage Across Epigenetic and Transcriptomics Datasets Through Genotype and Assay Prediction

Presenting Author: Arif Harmanci, University of Texas Health Science Center


The next generation sequencing (NGS) is used to measure cellular phenotypes across many different levels such as epigenetic and transcriptomic datasets. Currently there are hundreds of functional genomics assays based on NGS technologies and more assays are being proposed for measuring diverse set of epigenetic and transcriptomic states of cells, even at some individual levels. While the main purpose of these data are to probe and reveal important biological knowledge, such as cancer epigenetics and transcriptomics, the privacy aspect of the data is not well studied. Much of the data is being distributed We have previously shown that gene expression matrices can be used to predict the genotypes of expression quantitative loci (eQTL) and these can be used in linking attacks to link expression (such as GTEx Project) and genotype (such as The 1000 Genomes Project) datasets. In this study we extend the possible linkages that can be performed and study whether the correlations in epigenetic, transcriptomic, and genetic datasets can be exploited to perform accurate linking attacks. Specifically, we evaluate linkage of epigenetic-transcriptomic and epigenetic-transcriptomic-genetic datasets. We first propose robust rank-based measurement statistics to measure leakages originating from epigenetic datasets. We then present a practical linking attack and evaluate linking accuracies under different scenarios. We finally present a method for sanitizing the leakage of genetic and transcriptomic information from epigenetic data. We demonstrate the effectiveness of sanitization method on histone modification ChIP-Seq signal profiles and matrices.

Education, Networking and Building Next Generation Prototypes -- Hackathons and Analyzeathons for Bioinformaticians, Biomedical Informaticians and Computational Biologists!

Presenting Author: Ben Busby, NCBI


Over the past three years, NCBI has run or been involved in 31 data science hackathons. In these hackathons, participants assemble into teams of five or six to work collaboratively for three days on pre-scoped projects of general interest to the bioinformatics community. On average, about 80% of teams produce an alpha or beta working prototype, and approximately ten percent ultimately publish a manuscript describing their work. Thus, NCBI hackathons have generated over 150 products, and about 50% of them are stable, and/or continue to be developed. Some of these can be found at http://biohackathons.github.io. In addition to the production aspect, hackathons provide an immersive learning environment and promote networking opportunities. This presentation will discuss the educational aspects of these hackathons, options for setting up hackathons at your own institution, and tricks to make them successful. NCBI and other parts of NLM and NIH are also involved in other programs pertaining to project-based data science education. These include the NIH data science mentorship program, the visiting bioinformatician program, and the microbial metagenomics discovery challenge. In spring 2019, we are scaling up to run analyzeathons aimed at indexing massive data collections and making them amenable to modern analysis techniques, including machine learning.

Alternative Splicing of Single Cells in Squamous Cell Lung Cancer Premalignancy

Presenting Author: Hyunmin Kim, University of Colorado Anschutz Medical Campus

Moumita Ghosh, National Jewish Health
Jihye Kim, University of Colorado Anschutz Medical Campus
Aik-Choon Tan, University of Colorado Anschutz Medical Campus


Single-cell RNA-seq sequencing (scRNA-seq) is a rapidly evolving technology for studying transcriptomic landscape at the resolution of individual cells. Unique Molecular Identifier (UMI)-based approach such as 10x scRNA-seq system allows researchers to reconstitute gene expression to cell (G2C) association by extracting the UMI and cell barcode from the reads assigned to the genes. One of the common data analytics for G2C matrix is to present the reduced dimensionality making a visualization plot. However, the underlying biological mechanisms such as cell viability and transcription efficiency were not addressed by the current approaches. We hypothesized that alternative splicing (AS) could provide a complementary approach to address these limitations. To test this hypothesis, we performed deep scRNA-seq to deconvolute the premalignancy of Squamous Cell Lung Cancer (SCC). SCC often develops from a premalignant field that includes dysplasia. Identification of genomic change in the dysplasia epithelium could aid both early detection and improved prevention strategies. The dysplastic lung is enriched with heterogeneous cell types complicating identification of cell types in the premalignancy of SCC. We performed scRNA-seq on two endo-bronchial biopsies from a high-risk patient. We performed standard scRNA-seq analysis to deconvolute the premalignancy landscape of SCC. We also develop a novel pipeline that considered AS in the analysis pipeline. We will report comparison results by showing cell subgroups obtained by the standard approach and the AS pipeline. We believe that the AS pipeline for scRNA-seq is providing additional information to decompose the mixture of cells in a heterogeneity population.

Inferring trade-offs in protein folding networks

Presenting Author: Sebastian Pechmann, Université de Montréal


How proteins fold inside the cell remains a fundamental open question. The cell employs a complex regulatory and quality control system, the protein homeostasis network, that keeps proteins in their correct shape. Failure of protein homeostasis is directly linked to so-called protein misfolding diseases such as Alzheimer’s and Parkinson’s. However, how proteins interact with their quality control mechanisms remains poorly understood. Here, we demonstrate how the integration of genomic data into the systematic analysis of protein families can decouple contributions to sequence-structure relationships within proteins that define their interactions with cellular quality control mechanisms. Our work highlights overlapping constraints and trade-offs between protein synthesis, folding, and quality control. Joint inference of these trade-offs outlines quantitative principles underlying protein folding in the cell. We conclude by discussing how our results help to understand how mutations may perturb protein homeostasis and ultimately lead to ageing and neurodegenerative diseases.

Fully Bayesian model for non-random missing data in qPCR

Presenting Author: Valeriia Sherina, University of Rochester Medical Center

Matthew McCall, University of Rochester Medical Center
Tanzy Love, University of Rochester Medical Center


We propose a new statistical approach to obtain differential gene expression of non-detects in quantitative real-time PCR (qPCR) experiments through Bayesian hierarchical modeling. We propose to treat non-detects as non-random missing data, model the missing data mechanism, and use this model to impute Ct values or obtain direct estimates of relevant model parameters. A typical laboratory does not have unlimited resources to perform experiments with a large number of replicates; therefore, we propose an approach that does not rely on large sample theory. We aim to demonstrate possibilities that exist for analyzing qPCR data in the presence of non-random missingness through the use of Bayesian estimation. Bayesian analysis typically allows for smaller data sets to be analyzed without losing power while retaining precision. The heart of Bayesian estimation is that everything that is known about a parameter before observing the data (the prior) is combined with the information from the data itself (the likelihood), resulting in updated knowledge about the parameter (the posterior). In this work we introduce and describe our hierarchical model and chosen prior distributions, assess the model sensitivity to the choice of prior, perform convergence diagnostics of the Markov Chain Monte Carlo, and present the results of a real data application.

Homologous Inter-Domain Segments in Protein Families

Presenting Author: Dylan Barth, University of Nevada Las Vegas


We are interested in sequences between conserved domains of multi-domain proteins. These sequences have historically been ignored in evolutionary analysis because they are not conserved between species and therefore cannot be aligned effectively. To study the evolution of the lengths of these segments, we first need to define homologous inter-domain segments across species. We gathered gene trees from the Ensembl database to provide information on homologous gene families and the evolutionary relationships of the genes. Gene trees were divided into subtrees that are less than 400 million years old. Domain data for each human protein within each gene family have been gathered from both the Superfamily and Pfam databases. Using the boundaries of human domains, we inferred the homologous domain positions across the alignment of the gene family, and defined the homologous inter-domain segments. We have found that these inter-domain segments approximately follow an exponential distribution with a mean and median length of 46 and 23 bp respectively. Based on these data, we plan to study how the lengths of these segments have evolved through insertions and deletions.

Integrating extracted relations into existing knowledge bases

Presenting Author: Harrison Pielke-Lombardo, University of Colorado


The KaBOB knowledge base (Livingston et al., “KaBOB.”) was built using structured data sources. However, these data sources must be manually curated by consulting existing literature sources, each of which contains only a few fragments of knowledge. Literature sources are unstructured and distributed information.
Here, we present a method for extracting such unstructured information and integrating it into an existing knowledge base. First, concept embeddings are generated using ConceptMapper (Tanenblatt et al., “The ConceptMapper Approach to Named Entity Recognition.”) covering ten OBO ontologies, followed by relation extraction using a modification to the Snowball algorithm (Agichtein and Gravano, “Snowball.”) which considers the syntax and dependency within a sentence. Relations are matched to the Relation Ontology such that the domains and ranges of the relations are satisfied. Next, relations are matched by the coreference chains of their subjects and objects to form a graph of relations which can then be integrated into the knowledge base.
Introducing statements from the literature can lead to logical inconsistencies. Reasoners like ELK and HermiT can be used to evaluate whether new statements violate the current state of knowledge. Additionally, a confidence for each relation is assigned based on the number of literature sources that make each claim.
To evaluate these methods, we attempt to demonstrate that we can recreate the existing model of cholesterol clearance in the liver from Reactome by using literature sources. The concept, coreference, and relation annotations are evaluated using the CRAFT corpus (Bada et al., “Concept Annotation in the CRAFT Corpus.”).

The Use of Scientific Ignorance to Drive Literature-Based Discovery in Prenatal Nutrition Across Disciplinary Boundaries

Presenting Author: Mayla Boguslav, University of Colorado Anschutz Medical Campus

Lawrence Hunter, University of Colorado Anschutz Medical Campus
Sonia Leach, National Jewish Health


Researchers are interested in the gaps in knowledge (unknowns, speculations, hypotheses). We aim to reframe the literature in terms of such gaps: we hypothesize that knowledge gaps exist in the scientific literature, we can automatically identify and classify them, and we can use them to drive further scientific research. Knowledge gaps exist in the literature because current natural language processing (NLP) explicitly attempts to discard them and instead focuses on the literature as a knowledge source. For example, the statement “little is known about whether calcium interacts with iron” would be discarded. However, researchers include such statements of open questions, research goals, and current controversies (ignorance statements) in scientific publications because scientists and clinicians are interested. Following the example above, the scientific goal remains to determine if calcium interacts with iron or not. Utilizing such statements, we propose that formal computational representations of them, analogous to knowledge representations, will allow new computational tools to support research, identify and integrate relevant information from articles in other disciplines, and enhance clinical research. We hypothesize that this new form of literature-based discovery, finding new relevant information from other articles, will find novel connections across disciplinary boundaries in the biomedical domain. To evaluate this approach, we apply it the field of prenatal nutrition because of its public health relevance and clearly stated knowledge gaps. The creation of such NLP tools will allow researchers, funders, medical professionals, and the public to explore the dynamic landscape of research and discover novel insights into current knowledge gaps.

Governance Innovations for Promoting Cross-institutional Electronic Health Data Sharing

Presenting Author: Kari Stephens, University of Washington

Adam Wilcox, University of Washington
Philip Payne, Washington University
Jason Morrison, University of Washington
Jennifer Sprecher, University of Washington
Rania Mussa, University of Washington
Randi Foraker, Washington University
Sarah Biber, Oregon Health Sciences University
Sean Mooney, University of Washington


Cross institutional electronic health data sharing is an essential requirement for health innovation research. Healthcare organizations across the country are governed separately by state and local laws and policies that complicate research related data sharing. Electronic health record (EHR) data are not only highly protected via federal laws (i.e., HIPAA) and regional Internal Review Boards (IRBs), but are also often protected as assets by individual organizations. No clear pathway exists for organizations to execute governance for rapid EHR data sharing, stifling research efforts ranging from simple observational studies to complex multi-institutional trials. Universal governance solutions are essential to provide pathways for data sharing to address the rapid pace of research. The Clinical Translational Science Award (CTSA) Program Data to Health (CD2H) Coordinating Center has launched a cloud data sharing pilot project to begin addressing this complex issue. In order to configure a web-based data sharing software tool, Leaf, that can cross-query comprehensive harmonized EHR data generated by multiple healthcare organizations, we are exploring a singular governance solution (i.e., embodied in a data use agreement (DUA) and Internal Review Board (IRB) solution) to accommodate both a general and research specific use. While DUAs and IRBs are not streamlined governance solutions, this is an essential first step in creating broader sustainable national governance solutions (i.e., master consortium agreements, access governance policies).