Comparative Genomics and Alliance for Genome Resources

Presenting Author: Judith Blake, Jackson Laboratory

I will report on the emergence of the Alliance for Genome Resources (AGR), a unified approach to bringing together data from the major model organisms to facilitate comparative genomics approaches to understanding human biology and disease. The objective of the Alliance initiative is to develop a common framework for acquiring, curating, and accessing genome data from different model organisms, facilitating utilization of different systems for experimentation, while preserving the details of the unique biology of each model organism. The Alliance, formed in the fall of 2016 under NHGRI guidance, combines work from six model organism databases: (Mouse Genome Database, Saccharomyces Genome Database, WormBase, FlyBase, Zebrafish Information Network, Rat Genome Database) and the Gene Ontology Consortium. With initial release of web portal in October, we demonstrate approaches to data unification. I will report on process of integration, development of standards for data exchange and representation, and next steps and future plans for the Alliance community.

RSEQREP: RNA-Seq Reports, an open-source cloud-enabled framework for reproducible RNA-Seq data processing, analysis, and result reporting

Presenting Author: Travis Jensen, The Emmes Corporation

We present RNA-Seq Reports (RSEQREP), a new open-source cloud-enabled framework for reproducible and scalable RNA-Seq analysis. The software allows users to execute start-to-end gene-level RNA-Seq analysis on a pre-configured RSEQREP Amazon Virtual Machine Image (AMI) hosted by AWS or on their own Ubuntu Linux machine. As input, users can specify unstranded, stranded, and paired-end sequence data stored locally, on AWS S3, or at the Sequence Read Archive (SRA). The software automates a series of customizable steps including human reference genome alignments, reference-based compression of BAM files (CRAM), reference alignment QC, data normalization, multivariate data visualization, identification of differentially expressed genes, heatmaps, co-expressed gene clusters, enriched pathways, and a series of custom visualizations. The strength of the RSEQREP software is its start-to-end open-source solution that combines operating system, bioinformatics software, reference data set download, data processing, analysis, advanced data visualizations, and automatic reporting. In addition to intermediate files, RSEQREP outputs dynamically generated PDF reports using R, knitr, and LaTeX as well as table and figure files that can readily be integrated into manuscripts. All RSEQREP components are built using open-source software, R code, and Bioconductor packages allowing for further customization. To highlight and exemplify RSEQREP functionality, we provide RSEQREP example results for a publicly available RNA-Seq dataset (GEO ID: GSE45764).

Confounds in biomedical natural language processing

Presenting Author: KEVIN COHEN, University of Colorado School of Medicine

Typical machine learning papers, in natural language processing or otherwise, compare two algorithmic approaches and label one as better than the other on the basis of some figure of merit. The assumption is that the algorithmic approach under consideration is the causal factor for the observed difference in performance. However, a reanalysis of the literature on natural language processing is consistent with the hypothesis that “minor” differences in preprocessing are plausible explanations for the performance differences that are usually ascribed to algorithmic differences. This work examines that possibly, focussing on research in the biomedical domain. It focusses on the biomedical domain because the results may be explanatory with respect to understanding the ubiquitous phenomenon of failure of general-domain language processing approaches to generalize to the biomedical domain.

Integrated molecular and clinical analysis for understanding human disease relationships

Presenting Author: Winston Haynes, Stanford University

Rohit Vashisht, Stanford University
Francesco Vallania, Stanford University
Charles Liu, Stanford University
Gregory Gaskin, Stanford University
Erika Bongen, Stanford University
Shane Lofgren, Stanford University
Timothy Sweeney, Stanford University
Paul J. Utz, Stanford University
Nigam Shah, Stanford University
Purvesh Khatri, Stanford University

A detailed understanding of relationships among diseases will enable a deeper understanding of disease causation as well as offer opportunities to reposition drugs. To identify unbiased cluster of molecularly and clinically related diseases, we performed gene expression meta-analysis of 104 diseases using 600 studies with 41,000 samples and electronic health record analysis of over two million patients. Based on molecular data, we observed autoimmune diseases clustering with their specific infectious triggers and brain disorders clustering by disease class. In contrast, clinical data clustered diseases based on clinical practice. Our integrated molecular and clinical analysis spanned vastly different scales to identify robust disease clusters. We identified diseases with under-appreciated, therapeutically actionable relationships in our analysis. We highlighted the relationship between myositis and interstitial cystitis to encourage collaboration by connecting the seemingly disparate research communities.

Continuous integration for ensuring interoperability of the Open Biomedical Ontologies

Presenting Author: William Baumgartner, University of Colorado Anschutz Medical Campus

The modern ecosystem of biomedical ontologies coupled with Semantic Web computational technology has the potential to transform knowledge-based interpretation of biomedical research results, however, the distributed nature of biomedical ontology development and the varying degrees to which quality control measures are used within the community may hinder this potential. This talk will present a continuous integration (CI) system to monitor development of the community-driven Open Biomedical Ontologies (OBOs), a family of ontologies guided by the principles of openness, orthogonality, and interoperability. Our recent work shows that, since the OBOs are developed in a distributed and loosely coupled way, unintentional inter-ontology conflicts and logical inconsistencies can arise and persist despite being able to be detected automatically. The potential for inter-ontology conflict has become more prevalent as many OBOs now define their concepts logically, using concepts spanning other ontologies. Such logical conflicts hamper efforts to reason over this base of knowledge as a whole and should alarm the ontology developer and user communities. Further, as the OBOs are under continuous development new conflicts are sure to arise. CI plays an integral quality assurance role when engineering software by monitoring code for changes, executing tests whenever change is detected, and providing feedback for errors induced by change. We will present data supporting the need for CI over the OBOs to ensure their global interoperability, and will introduce to the ontology developer and user communities a soon-to-be deployed system capable of continuously monitoring the OBOs for logical consistency and ontology development best practices.

Classification of glioblastoma subtypes by integrating genomic and histopathological image features with probabilistic graphical models

Presenting Author: Dimitris Manatakis, University of Pittsburgh

Panayiotis Benos, University of Pittsburgh
Akif Tosun, University of Pittsburgh
Chakra Chennubhotla, University of Pittsburgh

Integrating multi-modal biomedical data types under the same analytic framework is an important step towards harvesting the existing, fragmented knowledge that is collected in a clinical setting. H&E staining has been used extensively in pathology for diagnosing various diseases or disease subtypes, but it has limitations. Computational pathology, i.e. the processing and analysis of H&E stained images with computational methods augmented by high-throughput data collection from the same patients aims in improving disease diagnosis by combining multiple, complementary sources of information. Currently, the methods for integrating clinical image and –omics features have used mainly second order methods. These methods, however, cannot properly capture the complexity of the direct and indirect relations between these features. In this paper, we present an alternative method for data integration: the use of directed graphs, a form of probabilistic graphical models. Using new methods we have developed to (1) parse tissue heterogeneity information from H&E images and (2) learn directed graphs over mixed data types, we test the hypothesis that there is a relationship between gene expression and spatial information in the tissue organization in glioblastoma.

MPEG-G the emerging standard for genomic data compression

Presenting Author: Mikel Hernaez, University of Illinois

Claudio Alberti, GenomSys
Marco Mattavelli, EPFL
Idoia Ochoa, University of Illinois

The development of Next Generation Sequencing technologies might enable the usage of genomic information as everyday practice in public health, but the large volumes of raw data generated becomes a serious obstacle for its wide diffusion. The lack of an appropriate representation for compressed genomic data is widely recognized as a critical element limiting its application potential. Beside efficient compression, which is an essential element for any usage of genomic information, several other functionality that the current data formats do not support are of paramount importance.
Two ISO committees, TC 276 and JTC 1/SC 29/WG 11 (MPEG), have combined their efforts and are jointly working to develop MPEG-G, a new compression standard for genomic sequencing data aiming at providing new effective solutions for genomic information processing applications. Efficient compression, selective data access and queries, support for data streaming and data protection are among the new functionality natively provided in the compressed domain.
The different elements composing the MPEG-G standard (transport, compression technology, APIs, conformance and reference SW) intend to provide the standard framework that will enable stakeholders to exploit to its full potential the large body of genomic data that the current sequencing technology is generating.
This paper provides a summary of the components, main innovations, functionality and performance of the current version of the MPEG-G technology currently undergoing the final steps of the ISO standardization process.

Physical aspects of the -helix and their implications towards amyloidal tendency

Presenting Author: Simcha Srebnik, Technion - Israel Institute of Technology

α-Helices are the most abundant structures found within proteins and play an important role in the determination of the global structure of proteins and their function. It is common to describe protein structure using Ramachandran (φ,ψ) dihedrals, which reveal a diagonally aligned ellipsoidal region of the α-helices on the Ramachandran Map. We show that an alternative orthogonal coordinates system can be used to describe the helical conformation in terms of physical parameters, i.e., the number of residues per turn (ρ) and the angle (ϑ) between backbone carbonyls relative to the helix direction through a linear transformation between the two coordinates system (φ,ψ and ρ,ϑ). When described in this way, we observe a direct correlation between the physical interpretation of the α-helical structure and its tendency for amyloid formation. We conclude with qualitative scenarios that may affect the helical structure and induce modifications to ρ and ϑ.

A Novel Systematic Analysis of ALDH Isozyme Specificity in Head and Neck Squamous Cell Carcinoma

Presenting Author: Brian Jackson, University of Colorado Anschutz Medical Campus

AC Tan, University of Colorado Anschutz Medical Campus
Antonio Jimeno, University of Colorado Anschutz Medical Campus

Head and Neck Squamous Cell Carcinoma (HNSCC) is among the top 10 diagnosed cancers in the United States, and is one of the few cancer that has increased in incidence over the last 10 years. It has recently been elucidated that Aldehyde Dehydrogenase (ALDH) superfamily members play important roles in cancer, and in particular ALDH1 family members are implicated as markers and effectors of stemness in HNSCC. The ALDH superfamily consists of 19 members in humans that catalyze the conversion of an aldehyde to its corresponding carboxylic acid. This enzyme superfamily is responsible for the detoxification of a large number of endogenous and exogenous substrates. Interest in inhibition of ALDH members has increased in recent years, but ALDH isozymes have overlapping but often distinct substrates and roles in human physiology. We attempt to address these problems by 1) building a comprehensive database of ALDH substrates and inhibitors and 2) systematic computational modeling of ALDH isozymes to better predict the action and specificity of novel ALDH substrates and inhibitors. With these tools, future efforts at specific ALDH isozyme inhibition will be more efficiently realized.

Building a Systems-level model of the immune response to Salmonella infection

Presenting Author: Marta Andres-Terre, Stanford University

Adityia Rao, Stanford University
Michele Donato, Stanford University
Purvesh Khatri, Stanford University

Infectious diseases are the result of a molecular warfare between the host immune system and the pathogen. Their treatment and eradication are complicated by the heterogeneous nature of these interactions, which remains poorly understood. Here, we have identified a set of genetic and molecular determinants that characterize the host immune response to bacterial infection. First, we conducted an integrated multi-cohort analysis of publicly available gene expression data and identified a common host gene signature across different bacterial infections. This meta-bacterial signature can 1) distinguish bacterial from viral and fungal infections, and 2) predict symptom onset and disease outcome in infected individuals. We then identified a Salmonella-specific host-response gene signature, which can be used as a prognostic marker for typhoid fever, as well as for understanding the biology of Salmonella infections. Second, we applied cell mixture deconvolution to the same datasets we used to obtain the gene expression signature and estimated the cellular populations driving the response to Salmonella infection. We are currently working on building a quantitative model of bacterial infections in which we take into account both the cellular and gene expression signatures identified using heterogeneous data sources. Defining the metrics that characterize the immune response to bacterial infection will go beyond the concept of identifying biomarkers, as this model could potentially be used as a platform to identify and understand novel mechanisms underlying host-pathogen interactions

Development of a statistical model of the genetics of maternal-fetal dyads in neonatal abstinence syndrome

Presenting Author: James Denvir, Marshall University

Richard Egleton, Marshall University
Don Primerano, Marshall University
Jun Fan, Marshall University
Vincent Setola, West Virginia University
Laura Lander, West Virginia University

As a consequence of the dramatic increase in opioid use in the US in the last decade, a large increase in the number of infants born with Neonatal Abstinence Syndrome (NAS) has been observed. Opioids entering the mother’s bloodstream during pregnancy are metabolized at three primary sites before having the opportunity to influence brain function in the developing fetus: the maternal liver, the placenta, and the fetal liver. We are in the process of performing whole exome sequencing on mother-infant dyads in order to identify genetic variants in either the mother or infant that may be predictors of NAS severity or prognosis, and may inform treatment plans for the neonate. Here, we present the development of a genetic model that regards the mother-infant dyad as a single, quasi-tetraploid, entity that may be used for hypothesis testing in this context.

Tracing the Innate Genetic Evolution and Spatial Heterogeneity in Treatment Naïve Lung Cancer Lesions

Presenting Author: Jihye Kim, University of Colorado Anschutz Medical Campus

Kenichi Suda, Kindai University Faculty of Medicine
Isao Murakami, Higashi-Hiroshima Medical Center
Leslie Rozeboom, University of Colorado Anschutz Medical Campus
Christopher J. Rivard, University of Colorado Anschutz Medical Campus
Tetsuya Mitsudomi, Kindai University Faculty of Medicine
Fred R. Hirsch, University of Colorado Anschutz Medical Campus
Aik-Choon Tan, University of Colorado Anschutz Medical Campus

Extensive intratumor heterogeneity (ITH) have been observed in individual patient tumors by large-scale sequencing analyses. ITH can contribute to drug resistance and cancer metastasis. Distinct microenvironments provide selection advantages to sub-population of cells for cancer metastasis. We hypothesized that ITH can contribute to metastasis by disseminating different sub-populations of cancer cells to distant sites. We collected tumor specimens and non-cancer tissues from treatment naïve autopsied patients to study the innate genetic evolution and spatial heterogeneity. Our cohort consists of four (two adenocarcinoma and two squamous cell carcinoma) NSCLC patients and one SCLC patient. Each patient had 5–9 primary and metastatic lesions. Comprehensive data analyses were performed on the RNA-seq that include gene expression and pathway analyses, fusion detection and somatic variants detection. Global unsupervised clustering of expression data reveals the NSCLC patients clustered together from the SCLC, and the two adenocarcinoma and squamous cell carcinoma patients formed two clusters. Within each patient, metastatic lesions clustered according to the distant metastatic sites. Pathway analysis and somatic mutation analysis in individual patients revealed that, in general, the primary lesion is distinct from metastatic lesions in NSCLCs. For the SCLC patient, distant metastases and lymph node metastases clustered according to different parts of the primary tumors. We also identified KIF5B-RET fusion as a founder mutation in all tumor specimens obtained from a never-smoking adenocarcinoma patient. This study provides evidence that ITH contributes to distant metastasis sites based on the similarity and the heterogeneity between primary and metastatic lesions in lung cancer patients.

Locating sites of ribonucleotide incorporation in RNase H2-deficient cells

Presenting Author: Alli Gombolay, Georgia Institute of Technology

Francesca Storici, Georgia Institute of Technology
Fredrik Vannberg, Georgia Institute of Technology

Ribonucleoside monophosphates (ribonucleotides or rNMPs) that are inadvertently incorporated into DNA can wreak havoc on genome stability. Under normal cellular conditions, the RNase H2 enzyme initiates the removal of rNMPs by efficiently cleaving these toxic nucleotides. However, when left unrepaired, rNMPs can cause breaks in the DNA strand, replication stress, and spontaneous mutagenesis. To understand the biological consequences of rNMPs and their role in the pathogenesis of disease, we must profile the distribution of rNMPs in the genome. Determining where rNMPs are differentially incorporated will allow us to identify how rNMPs cause genome instability. Recent advances in laboratory techniques and computational methods provide the unique opportunity to capture these non-standard nucleotides and map their locations in the genome. One of these techniques is ribose-seq (Koh et al. Nature Methods 2015). In contrast to other techniques, ribose-seq directly captures rNMPs embedded in DNA and may be applied to any cell type at any stage of the cell cycle. Achieving the full potential of ribose-seq is dependent upon computational methods tailored to analyzing this type of data. The Ribose-Map toolkit is a novel collection of user-friendly, open-source, and well-documented scripts developed to profile the incorporation of rNMPs captured via ribose-seq. Ribose-Map allows the user to determine the genomic coordinates of rNMPs, calculate nucleotide frequencies, locate rNMP genomic hotspots, and create publication-ready figures. By exploring the location and distribution of rNMPs in RNase H2-deficient cells, we may begin to understand the role rNMPs play in genome instability and, ultimately, disease.

Increasing flexibility of Laboratory Information Management System (LIMS) through extension of relational database model with NoSQL

Presenting Author: Marcin Domagalski, University of Virginia

Przemyslaw Porebski, University of Virginia
David Cooper, University of Virginia
Marcin Cymborowski, University of Virginia
Heping Zheng, University of Virginia
Marek Grabowski, University of Virginia
Wladek Minor, University of Virginia

LabDB is a modular, specialized super-LIMS, originally developed to track the macromolecular structure determination pipeline from cloning to structure solution (Zimmerman et al. 2014, Methods Mol. Biol. 1140). The system has modules to automatically or semi-automatically import data from a variety of different types of laboratory equipment, including chromatography and electrophoresis systems, crystallization observation robots, isothermal titration calorimetry, and others. Here we present a novel abstract data model for a new generation of LabDB LIMS, designed to address the extreme complexity of biology workflows. A purely relational database model was reduced to minimum number of database tables capable of storing experiment specific data structures and their predicate schemas (definitions) in JavaScript Object Notation (JSON) form. The model is suitable for extension with programmatic plugins as well as for manual definition of samples and workflows through user interface. Its “generic” design limits the structural requirements for stored data and provides better usability and higher performance. The data is accessible using a representational state transfer (REST) API, which provides easy interoperability with other software systems. In conclusion, LabDB flexible schema can integrate instruments and manage various experimental samples and workflows from end-to-end, ensuring traceable results, reproducibility of experiments, and last but not least improving overall lab efficiency.

Correcting for non-random fragmentation allows for more accurate estimation of relative abundance in NGS metagenomic data.

Presenting Author: Elmar Pruesse, University of Colorado Denver

Catherine Lozupone, University of Colorado Denver

Accurately estimating the relative abundances of genes, contigs or entire genomes
is essential to metagenomic studies. Variation in contig abundance between samples is
used for de-novo in-silico organism isolation (binning); per genome or gene
abundances allow comparative microbial community studies to establish associations
of genes or microbes with disease states or environmental factors. However, current methods are surprisingly naive. Using simple averaging, most tools assume that reads are randomly sampled from the source DNA material.

This assumption, while acceptable in the past, is broken by modern
NGS sequencing methods. Tagmerase based library preparation protocols such as
Nextera exhibit distinct biases in fragmentation point distribution. Sonication
based fragmentation protocols exhibit less, but still noticeable, bias. These biases
result in read depth variation at gene-scale windows far greater than to be expected
from a Poisson distribution. This inhibits the binning of small contigs which often comprise
the majority of an assembly. Moreover, the association of bias with DNA patterns
potentially distorts relative abundances within gene families, which can lead to false positives
or false negatives in comparative analyses.

We further present an algorithm for abundance estimation in the presence of non-random
fragmentation. By learning fragmentation point preferences of the employed library
preparation protocols from all mapped reads, the read depths can be adjusted for the
positional likelihood of observing a read. Positions at which no fragmentation was
observed are accounted for by incorporating the ratio of expected and observed zero
observation positions. Overall, our Cython implementation demonstrates far lower variance
in predicted abundance at small (1kb) window sizes than naive approaches, compensating
for the effects of non-random fragmentation nearly completely.

Assembly-free metagenomic analyses on distributed computational platforms

Presenting Author: Anna Paola Carrieri, IBM Research UK

Philippe Gambron, Science Technology Facilities Council
Martyn Winn, Science and Technology Facilities Council
Neil Venables, Cancer Research UK Manchester Institute
Vipin Sachdeva, Silicon Therapeutics
Will Rowe, Science and Technology Facilities Council
Kirk Jordan, IBM Research

Recent advances in high-throughput sequencing technologies have enabled the characterization and comparison of microbial communities in very diverse environments, such as the human gut microbiome. Characterizing microbial function and composition across different individuals and diseases is important to understand the role of microbiota in disease development. Consider comparing metagenomes of healthy and disease subjects, or pre-and-post antibiotic treatment microbial communities. Current taxonomic classification methods focus on sequencing of specific marker genes, such as 16S rRNA, and rely on existing microbial reference databases, often incomplete. A more informative method is whole-metagenome shotgun sequencing which generates huge collections of short reads. The use of larger reference databases and/or the need to assemble reads makes whole-metagenome analysis both data and computation intensive. We present a high performance computing (HPC) tool which generates distributed k-mer spectra from metagenomic sequences and facilitates assembly free metagenomics analyses. K-mer spectra are compressed representations of metagenomic reads that allow comparisons of metagenomes including the entire data volume into analysis. The HPC k-mer library rapidly estimates pairwise dissimilarities between metagenomes and accelerates the metagenomic analyses exploiting distributed computational platforms. The computed dissimilarity measures can be applied for cluster analysis and classification, for differentiating between phenotype groups, for inferring features of metagenomic composition and for phylogenetic reconstruction from metagenomes. The latter problems are computationally expensive therefore the use of high performance computing can overcome limitations in time and space.

Discovering the Contribution of the Gut Microbiome to the Plasma Metabolome

Presenting Author: Michael Shaffer, University of Colorado Denver Anschutz Medical Campus

Catherine Lozupone, University of Colorado Denver Anschutz Medical Campus
Nichole Reisdorph, University of Colorado Denver Anschutz Medical Campus

The human body contains roughly the same number of bacterial cells as human cells and these bacteria encode for 150 times more genes. These microbes primarily live in the gut, produce metabolites that are transported all over the human body and have to potential to influence disease. While untargeted metabolomics can be used to investigate influence of microbial metabolites, determining whether disease-associated metabolites come from microbes, the host or the environment can be challenging. We have developed methods that use the KEGG and PICRUSt databases to predict the origin of metabolites and applied them to a set of 54 paired plasma metabolome and stool 16S microbiome samples. Using human KEGG genes we predicted 1376 unique compounds could be produced by the host and using PICRUSt and KEGG we predicted 1321 compounds produced by the detected gut microbial community. Of 1018 KEGG annotated compounds in the plasma metabolome, 155 were predicted human metabolites and 135 were bacterial. This cohort contains individuals with HIV and with HIV and lipodystrophy (a metabolic comorbidity); of compounds that differed with lipodystrophy 2 were predicted to only be produced by bacteria, 11 only by humans and 11 by both. However, the majority of metabolites in the plasma metabolome could not be assigned to KEGG IDs, limiting these techniques to only a small subset of metabolites. Our results highlight both the promise and challenges of using metabolic networks to predict bacterial origin of metabolites.

Statistical Integration and Feature Selection for Candidate Biomarker Discovery

Presenting Author: Bobbie-Jo Webb-Robertson, Pacific Northwest National Laboratory

Lisa Bramer, Pacific Northwest National Laboratory
Sarah Reehl, Pacific Northwest National Laboratory
Brigette Frohnert, University of Colorado
Marian Rewers, University of Colorado

High-throughput technologies currently have the capability to capture information at both global and targeted scales for the transcriptome, proteome, metabolome, and lipidome as well as determining functional aspects of these biomolecules. The promise of data integration is that by utilizing these disparate data streams, in combination with low-throughput clinical information, a more complete or accurate estimate of system behavior can be obtained. In the case of biomarker discovery to better diagnose and predict outcomes of disease one goal is to identify the best subset of molecules that can separate specific phenotypes of interest. However; in a space of tens of thousands of variables (e.g., genes, proteins), feature selection approaches often yield over-trained models with poor predictive power. Moreover, feature selection algorithms are typically focused on single sources of information and do not evaluate the effect on downstream statistical integration models. We present an ensemble-based feature selection approach that optimizes the outcome of interest in the context of the integrated posterior probability. We demonstrate that this approach improves sensitivity and specificity over simple selection routines based on individual datasets and present the application of the approach on a juvenile diabetes cohort.

DySE – Dynamic System Explanation Framework, Application in Cancer

Presenting Author: Natasa Miskov-Zivanov, University of Pittsburgh

Biomedical research results are being published at a high rate, and with existing search engines, the vast amount of published work is usually easily accessible. To accurately reuse this voluminous knowledge that is fragmented and sometimes inconsistent, one can extract and assemble published information into models. However, the creation of models often relies on intense human effort: model developers have to read hundreds of published papers, and conduct discussions with domain experts. This laborious process results in slow development of models, as it includes steps such as model validation with experimental data and model extension with new available information. To automate the process of explaining biological observations and answering biological questions, we have built a Dynamic System Explanation (DySE) framework. DySE automatically assembles models, extends them with the interactions extracted by automated reading engines, analyzes the models for various conditions and scenarios, and iteratively improves assembled models. The framework includes techniques such as stochastic simulation, statistical model checking, static and dynamic sensitivity analysis, and hypothesis generation. With the automated process of reading, assembly, and reasoning, our framework allows for rapidly designing and conducting thousands of in silico experiments, and thus, can speed up discoveries from the order of decades to the order of hours or days. We have applied DySE to studying cancer microenvironment, as well as on explaining mechanisms of several melanoma drugs. The techniques and the modeling approach incorporated within the framework are not disease specific, and therefore, DySE can be used for explaining systems in many other domains.

A systems biology approach using cell-to-cell heterogeneity to map cell fate decisions in developmental processes

Presenting Author: Jens Preussner, Max Planck Institute for Heart and Lung Research

Guangshuai Jia, Max Planck Institute for Heart and Lung Research
Stefan Guenther, Max Planck Institute for Heart and Lung Research
Xuejun Yuan, Max Planck Institute for Heart and Lung Research
Michael Yekelchyk, Max Planck Institute for Heart and Lung Research
Carsten Kuenne, Max Planck Institute for Heart and Lung Research
Mario Looso, Max Planck Institute for Heart and Lung Research
Yonggang Zhou, Max Planck Institute for Heart and Lung Research
Thomas Braun, Max Planck Institute for Heart and Lung Research

Challenges for understanding developmental processes are the characterization of cell fate transitions, specification and hierarchical lineage descendants of progenitor cells. Differentiating cell populations typically manifest in profound cell-to-cell heterogeneity, justified by gene expression changes and gradual remodeling of underlying gene regulatory networks (GRNs). RNA sequencing at the single cell level allows the comprehensive characterization and analysis of cell population heterogeneity, especially if large ensembles of identical systems are profiled. Here we present computational approaches to harness statistical properties of heterogeneity with the objective of in-silico reconstruction of lineage trajectories and cell fate decision mapping in ~1500 differentiating cardiac progenitor cells. We define suitable measures of cell-to-cell heterogeneity and employ self-organizing maps and a density-based, hierarchical clustering to identify heterogeneous sub-populations. Reconstruction of the developmental trajectory revealed a bifurcation event when cells segregated into their terminal fates. Next, we show how a correlation-based analysis of cells in a transiently unstable state can be used to identify key regulatory elements in the underlying gene regulatory network that drive differentiation. We additionally show that those determinants progressively overcome epigenetic barriers to achieve open chromatin states associated with elevated expression in differentiated cell populations. Our approach not only comprehensively exploits heterogeneity and emerging correlations among large numbers of cells and genes to study early cardiogenesis at single-cell resolution, but also sets a general systems biology framework of transcriptional and epigenetic regulations in cell fate decisions.

Enabling Deep Learning on Structural Biological Data

Presenting Author: Tom Corcoran, Lawrence Berkeley National Laboratory

Rafael Zamora, Lawrence Berkeley National Laboratory

Convolutional Neural Network (CNN)-based machine learning has made noticeable breakthroughs in feature extraction tasks, but its applications in protein research are constrained by limited training data availability, as well as by the disparity between the 2D-oriented functionality of mainstream CNN technology and the inherently 3D nature of protein structures. We present a mapping algorithm that converts 3D structures to 2D data grids by first traversing the 3D space with a space-filling curve, encoding the 3D structural information into a 1D vector, and then projecting that vector into 2D via a reverse process with a complementary curve. For comparison against state-of-the-art CNN-based classification methods, we explore the performance of 2D CNNs trained on data encoded with our method versus comparable volumetric CNNs operating upon raw 3D data. Our results indicate that our mapping process preserves sufficient locality information across the transformation to be useful for training 2D CNNs on classification tasks between proteins and other structural models. We show that 2D CNNs trained on data generated using our method perform equivalently to state-of-the art volumetric methods in terms of accuracy and generalizability while offering the potential for decreased training time cost compared to their 3D counterparts. We discuss several experiments that show the effectiveness of our approach, including classifying between everyday 3D object models from the ModelNet10 benchmarking dataset and between protein models sourced from the Protein Data Bank such as KRas and HRas. An implementation of our encoding process and neural network architectures is available for download at GitHub.

Unsupervised discovery of phenotype specific heterogeneous gene networks

Presenting Author: Jenny Shi, University of Colorado AMC

Pamela Russell, University of Colorado AMC
Pratyaydipta Rudra, University of Colorado AMC
Brian Vestal, National Jewish Health
Brian Hobbs, Brigham and Women's Hospital
Craig Hersh, Brigham and Women's Hospital
Laura Saba, University of Colorado AMC
Katerina Kechris, University of Colorado AMC

Complex diseases often have a wide spectrum of symptoms. Better understanding the biological mechanism behind each symptom (a.k.a. phenotype) promotes targeted and effective treatment plans. We propose to utilize a machine learning technique, sparse canonical correlation analysis (SCCA), to integrate messenger RNA (mRNA) and microRNA (miRNA) expression data, taking a phenotype of interest into account. With canonical weights, mRNA-miRNA (i.e. heterogeneous) subnetworks that are specific to the phenotype can be constructed. Unlike traditional pairwise target prediction, the SCCA approach allows identification of associations that can be missed based on marginal correlations. We applied the method to a recombinant inbred mouse panel with endophenotypes that are associated with alcohol use disorders in human, and constructed heterogeneous subnetworks that are specific to drinking behavior. The leading subnetwork identified included three mRNA and four miRNAs, including two miRNAs from the same family. Most of the SCCA detected network connections were not predicted using pairwise methods, yielding novel associations. The strong associations discovered will be validated through biological experiments, such as knock-out. We also applied the method to a chronic obstructive pulmonary disease pilot study. The preliminary results revealed three subnetworks, which contain candidate features for more focused studies. The proposed SCCA method is not limited to expression data. It can be easily generalized to other data types, such as copy number variation, etc. It can also be applied to integrating more than three data types. The versatility of the approach will be useful in many other applications.

Dissimilarity Matrix Based Clustering for Phylogenetic Analysis

Presenting Author: Tara Eicher, The Ohio State University

Piyali Das, The Ohio State University
Juan Barajas, The Ohio State University
Ewy Mathe, The Ohio State University
Andrew Patt, The Ohio State University
Kevin Coombes, The Ohio State University


Clustering is a useful tool for evaluating the relative evolutionary relationships among organisms in a population. It forms the basis of assessing genetic similarity between individuals for phylogenetic analyses. Toward this end, software has been developed that uses Bayesian inference algorithms to obtain a set of clusters that represent the population distribution with respect to genetic similarity. However, these algorithms are computationally intensive, and the software typically requires domain specific knowledge before it can be successfully used. The process also involves running the analysis for several iterations to derive the desired number of clusters. Because this can be cumbersome for researchers, it is advantageous to provide a comprehensive method that allows for quick and simple cluster generation.

In this work, we show that standard clustering methods can be applied to distance matrices of sequences between individuals across multiple loci, obtaining results similar to those of Bayesian algorithms and traditional phylogenetic classification. Our analyses included subspecies of giraffa, species of Ursus, and orders of Aves. We first used the Demerau-Levenshtein string distance metric to compute distances between genetic sequences of individuals at specified loci, then used the Manhattan distance to compute the total distance between individuals across all loci of interest and obtain our final distance matrix. By applying conventional clustering algorithms to these distance matrices, we obtained results that resembled those of previous studies in all three data sets.

Automated Biomedical Text Classification with Research Domain Criteria

Presenting Author: Mohammad Anani, Montana State University

Indika Kahanda, Montana State University

Research Domain Criteria (RDoC) is a recently introduced framework for accurate diagnosis of mental illness. This framework contains five domains of interest, where each domain contains a number of constructs that define a specific behavior. Developing a method to automate the process of labeling biomedical articles with RDoC constructs is highly useful to advance research efforts in the area of mental illness. Therefore, this study focuses on exploring the feasibility of developing a tool for this purpose. Using a gold-standard dataset of about 40,000 Medline abstracts tagged with 26 RDoC constructs, we model this both as binary and multilabel classification problem, to perform document classification using several supervised learning algorithms. We use a simple Bag-of-Words model and apply standard preprocessing steps such as stemming and stop words removal. According to AUROC (Area Under Receiver Operating Characteristic curve) values obtained through 5-fold cross validation, we observe that overall, Artificial Neural Networks and Support Vector Machines perform best on the multilabel problem providing 96% average AUROC across all the constructs. Interestingly, all the binary classifiers showed the same level of performance. However, the cohort of binary classifiers took significantly longer time to train compared to their multilabel counterparts, showing the utility of modeling this as a multi-label problem. We also note that the articles labeled with more specific constructs were predicted better than the rest of the constructs. To the best of our knowledge, this is the first study on automated prediction of RDoC constructs for biomedical literature.

Unsupervised methods for lexical ambiguity in concept normalization

Presenting Author: Negacy Hailu, University of Colorado

Mike Bada, University of Colorado
William A. Baumgartner Jr, University of Colorado
Kevin Bretonnel Cohen, University of Colorado
Lawrence E. Hunter, University of Colorado

Previous studies have argued that ConceptMapper, which is a dictionary-based look up concept normalization tool, is the best performing system. The big limitation of dictionary-based look up systems is that they do not correctly map lexically ambiguous terms. In this work, we investigate various methods that will tackle lexical ambiguity in the task of concept normalization. We specifically apply part-of-speech (POS) tag information, Word Sense Disambiguation (WSD) and word embeddings to remove false positive annotations of ConceptMapper. We used the CRAFT corpus that is manually annotated for eight biomedical ontologies for evaluation. The ontologies are Cell Ontology (CL), Gene Ontology (biological process (GOBP), cellular component (GOCC), and molecular function (GOMF)), Sequence Ontology, Chemical Entities of Biological Interest (CHEBI), NCBI Taxonomy and Protein Ontology (PR).

Word embeddings improved performance of GOBP, SO, CHEBI, and PR; and POS information improved performance of SO, CHEBI and NCBI Taxonomy on F1 measure. But, WSD didn't bring any improvement on F1 measure. All the three methods achieved improvement on precision for all the ontologies. Our evaluations are on a five fold cross validation. All our improvements are statistically significant.

We propose three unsupervised methods to solve the problem of lexical ambiguity on concept normalization. The performance of the proposed methods varies depending on the ontology. Also, it can be tuned depending the need (whether high precision or high F1 measure). In general word embeddings achieved best results.

A Comparison of Viral Identification Techniques in Longitudinal Metagenomic Datasets

Presenting Author: Cody Glickman, University of Colorado Anschutz Medical Campus

Viruses that influence bacterial community compositions, also known as bacteriophages, are commonly small genomes that lack a universally conserved genetic marker. As a result, bacteriophages make up a small proportion of reads in traditional DNA shotgun metagenomic experiments. Biological methods to enrich for virus-like particles (VLPs) suffer from the biases of filtering for free floating elements, thus limiting the recovery of endogenous viral elements. The importance of endogenous viral elements in the life cycles and adaptation of pathogenic bacteria is well known (Schuch et al. 2009-Aug-12 ; van et al. 2017-Sep-01). To capture novel endogenous viral elements in DNA shotgun metagenomic data, I propose a novel methodology to extract and pool viral reads across longitudinal patient samples. I propose that this methodology will produce longer viral read assemblies and more accurate taxonomic assignments than assembly of individual samples and full cross-sample assembly.

To test the performance of viral read assembly across time, I performed a simulation study using bacteria with endogenous viral elements. The study tests the lengths of viral read assemblies and the accuracy of the methodologies to recapitulate the elements in the synthetic dataset. In addition, I performed a synthetic spike-in on a real longitudinal metagenomic dataset with rare viral species to measure the sensitivity of the methodologies within noisy data. These simulations are a benchmark for mining viral elements from longitudinal datasets in the publicly available databases. Mining viral reads from metagenomic experiments allows researchers to study endogenous viral elements not typically found in viral enrichment studies.

A Pilot Systematic Genomic Comparison of Recurrence Risks of Hepatitis B Virus-associated Hepatocellular Carcinoma with low and high degree of liver fibrosis

Presenting Author: Seungyeul Yoo, Icahn School of Medicine at Mount Sinai

Wenhui Wang, Icahn School of Medicine at Mount Sinai
Qin Wang, Icahn School of Medicine at Mount Sinai
Maria Fiel, Icahn School of Medicine at Mount Sinai
Eunjee Lee, Icahn School of Medicine at Mount Sinai
Spiros Hiotis, Icahn School of Medicine at Mount Sinai
Jun Zhu, Icahn School of Medicine at Mount Sinai

Chronic Hepatitis B virus (HBV) infection leads to liver fibrosis which is a major risk factor in Hepatocellular carcinoma (HCC). HBV genome can be inserted into human genome, and chronic inflammation may trigger somatic mutations. However, how HBV integration and other genomic changes contribute to the risk of tumor recurrence with regard to different degree of liver fibrosis is not clearly understood. We performed comprehensive genomic analyses of our RNAseq data from 21 HBV-HCC patients treated at Mount Sinai Medical Center, and public available HBV-HCC sequencing data. Using a robust pipeline we developed, consistently more HBV integrations were identified in non-neoplastic liver than in tumor tissues. HBV host genes identified in non-neoplastic liver tissues significantly overlapped with known tumor suppressor genes. More significant enrichment of tumor suppressor genes was observed among HBV host genes identified from patients with tumor recurrence, indicating potential risk of tumor recurrence driven by HBV integration in non-neoplastic liver tissues. Pathogenic SNP loads in non-neoplastic liver were consistently higher than ones in normal liver tissues. And HBV host genes identified in non-neoplastic liver tissues significantly overlapped with pathogenic somatic mutations, suggesting HBV integration and somatic mutations targeting the same set of genes important to tumorigenesis. HBV integrations and pathogenic mutations showed distinct patterns between low and high liver fibrosis patients with regard to tumor recurrence. The results suggest that HBV integrations and pathogenic SNPs in non-neoplastic tissues are important for tumorigenesis and different recurrence risk models are needed for patients of low and high liver fibrosis.

Identifying and Annotating Uninvestigated Preeclampsia-Related Genes Using Linked Open Biomedical Data

Presenting Author: Tiffany Callahan, University of Colorado Denver Anschutz Medical Campus

Adrianne Stefanski, University of Colorado Denver Anschutz Medical Campus
William Baumgartner Jr., University of Colorado Denver Anschutz Medical Campus
Jin-Dong Kim, Database Center for Life Science
Toyofumi Fujiwara, Database Center for Life Science
Ann Cirincione, University of Maryland
Maricel Kann, University of Maryland
Lawrence Hunter, University of Colorado Denver Anschutz Medical Campus

Preeclampsia is a leading cause of maternal and fetal morbidity and mortality. Currently, there is no cure for preeclampsia except delivery of the placenta, which is central to preeclampsia pathogenesis. Transcriptional profiling of human placenta from pregnancies complicated by preeclampsia and control has been extensively performed to identify differentially expressed genes (DEGs). DEGs are identified using unbiased assays, however the decisions to investigate DEGs experimentally are biased by many factors (e.g., investigator interests, available reagents, knowledge of gene function) causing many DEGs to remain uninvestigated. To address this shortcoming, we utilized existing linked open biomedical resources and publicly available high-throughput transcriptional profiling data to identify and annotate the function of currently uninvestigated preeclampsia-associated DEGs. Using the keyword “preeclampsia”, we identified and reviewed 68 publicly available human gene expression experiments deposited in the Gene Expression Omnibus. Meta-analysis of the 13 experiments meeting our inclusion criteria generated a list of 273 DEGs. We annotated these genes using a knowledge graph constructed with Semantic Web Technologies that contained several Open Biomedical Ontologies and publicly available datasets. The relative complement of the annotation-derived and meta-analysis derived genes were identified as the uninvestigated preeclampsia-associated genes. Experimentally investigated DEGs were then identified from published literature based on semantic and syntactic annotations to PubMed abstracts by PubTator and PubAnnotation. Finally, novel biological relationships between experimentally investigated and uninvestigated preeclampsia-associated genes were identified by learning neuro-symbolic logic embeddings to predict missing edges in the knowledge graph. Detailed documentation and source code can be found on GitHub (https://github.com/callahantiff/ignorenet/wiki).

Novel Approaches for Accuracy Assessment of 3D Chromatin Configuration Reconstructions

Presenting Author: Mark Segal, University of California San Francisco

Assessing accuracy of 3D chromatin configurations inferred from HiC data is crucial. Such evaluation is challenging given the absence of gold standards and the many competing algorithms. Using recent multiplexed FISH and genomic architecture mapping assays we devise novel approaches to accuracy assessment, with rigorous attendant inferential methods.

Codon Aversion: An alignment-free method to recover phylogenies

Presenting Author: Justin Miller, Brigham Young University

Perry Ridge, Brigham Young University
Michael Whiting, Brigham Young University
Lauren McKinnon, Brigham Young University

Codon bias refers to the non-random usage of synonymous codons, and differs between organisms, between genes, and even within a gene. We previously identified a strong phylogenetic signal, based on codon usage preferences, in 72 tetrapod species, focusing on stop codon usage preferences. Here we report the expansion of our previous work into >20,000 species across all kingdoms of life, and the development of tools to streamline phylogenetic inference based on codon usage preferences, and here specifically codon non-usage (or codon aversion). For each organism, we constructed a set of tuples, where each tuple contains a list of unused codons for a given gene. We define the pairwise distance between two species, A and B, as the ratio of total possible overlaps to direct overlaps. Total possible overlaps is the number of tuples in the set, for A or B, containing the fewest tuples, and direct overlaps is the intersection of tuples in the two sets. This approach allows us to calculate pairwise distances, even though there are substantial differences in the number of genes for each species. Finally, we use neighbor-joining to recover phylogenies. Using the Open Tree of Life and NCBI Taxonomy Database as expected phylogenies, our approach compares well, recovering phylogenies that largely match expected trees.

Isoform level differential network analysis: A proffer alternative for identifying cancer markers

Presenting Author: Manoj Kandpal, Northwestern University

Ramana Davuluri, Northwestern University

Given that the majority of multi-exon genes generate diverse functional products, it is important to study expression at the isoform level. Furthermore, the interaction networks may have distinct connections among participating entities. In the proposed two-level analysis, our first objective is to observe the capability of differential network analysis (DifNA) between normal and diseased state in detecting the important markers for cancer. We then, through a comparative study, propose isoform based differential network as a better alternative than its gene counter in identifying cancer biomarkers. To test the proposed methodology, two comparative studies have been done using data from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA). Raw data is collected from publically available GEO repository in form of 160 Affymetrix exon-array mammalian cell line data comprising of 87 normal and 73 tumor samples or varied origin. We used Partial Least Square based Differential Network Analysis algorithm on top variable genes/isoforms. Differential network analysis provides each gene/isoform in the network with a connectivity score. Permutation tests based on these scores determined if the connectivity of a gene differs between the two networks. Top differentially connected genes/isoforms with p-value less than 0.05 were selected and Ingenuity Pathway Analysis (IPA) is performed to find out the involvement of selected genes/isoforms in various metabolic pathways and diseases. The analysis is repeated for 105 GBM samples from TCGA. DiffNA performed great in finding markers genes and isoforms. Further, Isoform level analysis showed better performance than gene level.

Non-B DNA Predictor: A Web Application for Identifying Potential Intramolecular Non-B DNAs

Presenting Author: Hannah Ajoge, Western University

Hinissan Kohio, Western University
Henry He, Western University
Stephen Barr, Western University

Genome biology is not limited to the confines of the canonical right-handed double-helix DNA (a.k.a. B DNA), but includes other secondary structures that are collectively termed non-B DNA (NBD) structures. NBDs play an important role in biological processes such as DNA replication, telomere maintenance, gene expression, viral replication, immune evasion, neurologic disorders and cancers.

We developed a tool called ‘Non-B DNA Predictor’ (NBDP) to enable researchers with little computational skills to identify potential intramolecular NBDs from nucleic acid sequences. NBDP is an interactive web application developed in R with Shiny with the back-end based on R package GQUAD, which is the only standalone tool available for identifying multiple NBDs. NBDP implements knowledge-based algorithms that are highly sensitive and accurate and out-performs all other tools with human genome-wide data containing known NBDs. It accurately identifies millions of NBD motifs in the human genome at a sensitivity level of >96%.

NBDP is presently the only NBD prediction application capable of: (1) identifying multiple NBDs, (2) accepting single/multiple sequence inputs, (3) accepting data in three possible forms (raw sequences, fasta and accession numbers), (4) irrespective of source (e.g. species) and (5) online/offline availability. Motifs that are identified (with/without overlaps) are A-phased, G-quadruplexes, H-DNA, slipped, short tandem repeats, triplex forming oligonucleotides and Z-DNA. NBDP has been used to show the difference in NBD density between human female and male genome and that NBDs are highly enriched in oncogenes. NBDP is hosted on http://ext0574.mni.fmd.uwo.ca/Non-BDNAP/.

IMPACT Web Portal: An Oncology Database Integrating Molecular Profiles with Actionable Therapeutics from Next Generation Sequencing Data

Presenting Author: Jennifer Hintzsche, University of Colorado Anschutz Medical Campus

Minjae Yoo, University of Colorado
Jihye Kim, University of Colorado
Carol Amato, University of Colorado
William Robinson, University of Colorado
Aik Choon Tan, University of Colorado

Next generation sequencing (NGS) technology allows researchers to identify important variants and structural changes in DNA and RNA in cancer patient samples. Using this information, we can now correlate specific variants and/or structural changes with known inhibitory actionable therapeutics. We introduce the creation of the IMPACT Web Portal, a new online resource that links molecular profiles from NGS of tumors to approved drugs, investigational therapeutics, and pharmacogenetics associated drugs. IMPACT Web Portal contains a total of 776 drugs connected to 1,326 target genes and 435 target variants, fusion, and copy number alterations. The online IMPACT Web Portal allows users to search for genetic alterations and connects them to three levels of actionable therapeutics. The results are categorized into 3 levels: Level 1 contains approved drugs separated into two groups; Level 1A contains approved drugs with variant specific information while Level 1B contains approved drugs with gene level information. Level 2 contains drugs currently in oncology clinical trials. Level 3 provides pharmacogenetic associations between approved drugs and genes. The results for each level are ranked by a p-value calculated from a hypergeometric test of all overlapping gene targets. This allows users to understand the specificity of each actionable therapeutic. Each drug also links to another web page containing external information and additional gene targets, allowing further investigation of each actionable therapeutic. This resource is a valuable database for personalized medicine and drug repurposing translational oncology studies. IMPACT Web Portal is freely available for non-commercial use at http://tanlab.ucdenver.edu/IMPACT/.

Exploring Frequented Regions in Pan-Genomic Graphs

Presenting Author: Brendan Mumey, Montana State University

Alan Cleary, Montana State University
Thiruvarangan Ramaraj, National Center for Genome Resourcses
Indika Kahanda, Montana State University
Joann Mudge, National Center for Genome Resources
Shubhang Kulkarni, Purdue University

We consider the problem of identifying regions within a pan-genome de Bruijn graph that are traversed by many sequence paths. Such regions and the subpaths that traverse them are denoted frequented regions (FRs). We have recently developed efficient algorithms for finding FRs. FRs have proved useful in pan-genomic applications such as providing distinguishing features for classifiers and for visualizing the pan-genomic graph. In initial work, we corroborated the biological relevance of FRs to identify introgressions in yeast that aid in alcohol tolerance. This talk will concentrate on (1) recent results to scale the approach to larger pan-genomic data sets such as the plant Medicago truncatula, with a 500 Mb genome, (2) developing visualization approaches that provides a simplified representation of pan-genomic space.

TOBIAS - Transcription factor Occupancy prediction By Investigation of ATAC-seq Signals

Presenting Author: Mario Looso, Max Planck Institute for Heart and Lung Research

Mette Bentsen, Max Planck Institute for Heart and Lung Research
Annika Fust, Max Planck Institute for Heart and Lung Research
Jens Preussner, Max Planck Institute for Heart and Lung Research
Carsten Kuenne, Max Planck Institute for Heart and Lung Research

Chromatin accessibility and transcription factor (TF) binding are important aspects for mechanisms of gene regulation. Sequencing data from high-throughput chromatin accessibility assays such as ATAC-seq provides the possibility to identify opened chromatin regions as well as binding locations of a wide range of TFs within a single experiment. The exact location of DNA bound TFs can be identified by the intrinsic effect of the enzymes used by these assays to only act on accessible DNA locations, generating a coverage drop on occupied regions with a single base resolution, referred to as footprints. However, an advanced method for footprint detection and quantification in ATAC-seq data is critically missing.
Here we introduce our open source bioinformatics tool TOBIAS. It supports enzyme cutting bias normalization and applies a density-based clustering algorithm and footprint model to ATAC-seq data. In a first step, TOBIAS identifies potential TF binding sites in a global genome scale. Next, it combines these in silico predicted footprints with PWM scored motifs from hundreds of TFs. It provides individual binding locations, scores and probabilities for each TF as well as summarizing profile plots of bound and unbound TF binding sites. Ongoing algorithm development targets the quantification of footprints between biological conditions and downstream analysis of TF specific gene sets.

Integrating Protein Localization Information in Signaling Pathway Reconstructions

Presenting Author: Anna Ritz, Reed College

Ibrahim Youssef, Reed College

Understanding the components of intracellular signaling pathways is an important task in systems biology. Computational methods have been developed to automatically reconstruct signaling pathways from large networks of protein-protein interactions (interactomes). These pathway reconstruction methods can accelerate the time-consuming manual curation of pathway databases and generate hypothesis that aid in the discovery of novel signaling components. PathLinker is a pathway reconstruction method that computes many short paths within an interactome that connect known receptors to known transcriptional regulators specific to a particular pathway (e.g., Wnt or EGFR). PathLinker has previously been shown to outperform other state-of-the-art methods (e.g., Steiner trees, network flow, and random walk approaches), and RNAi experiments have confirmed a PathLinker prediction of CFTR's involvement in canonical Wnt signaling. Despite the improved performance over other methods, PathLinker reconstructions still contain erroneous signaling interactions. Here, we utilize information about protein localization within the cell to improve pathway reconstructions. By adding information about cellular compartments, we preserve the spatial hierarchy of signaling flow by explicitly finding paths that begin at the membrane, terminate at the nucleus, and pass through intermediate compartments such as the cytosol. The new pathway reconstructions contain fewer false positives than the original PathLinker reconstructions, based on a benchmark dataset of pathways associated with cancer and immune signaling from the NetPath database. This work illustrates the utility of using additional biological information about protein interactions to improve predictions about signaling pathway interactions.

Reducing the footprint of mass spectrometry data

Presenting Author: Idoia Ochoa, University of Illinois at Urbana-Champaign

Ruochen Yang, Tsinghua University

High-resolution mass spectrometry (MS) is a powerful technique used to identify and quantify molecules in simple and complex mixtures by separating molecular ions on the basis of their mass and charge. MS has become invaluable across a broad range of fields and applications, including biology, chemistry and physics, but also clinical medicine and even space exploration. As a result, there has been a rapid growth in the volume of MS data being generated. It is therefore crucial to design better compression algorithms to lessen the pressure of data storage and transmission.

The MS data is mainly composed of the mass-to-charge ratio (m/z)-intensity pairs, which can take several GB of space per experiment. We present a lossless compressor for these data that utilizes an adaptive scheme for compression. Specifically, the proposed algorithm compresses the floating-point pairs efficiently by calculating the hexadecimal difference of consecutive m/z values, and by searching for parts of the intensity values that match previous ones. The proposed algorithm delivers an average compression ratio improvement of 40% when compared to the lossless compressors GZIP, the current de-facto compressor for MS data, and FPC, a state-of-the-art compressor for floating point data. In particular, the proposed algorithm reduces the MS file sizes by 56% on average.

Meeting today's challenging computational biology demands by integrating a focused IT team into the research process

Presenting Author: Dan Timmons, University of Colorado Boulder/BioFrontiers Institute

Matt Hynes-Grace , University of Colorado Boulder
Jon Demasi, University of Colorado Boulder
Robin Dowell, University of Colorado Boulder
Mary Allen, University of Colorado Boulder
Cassidy Thompson, University of Colorado Boulder

BioFrontiers IT (BIT) team is taking steps towards meeting today’s rapidly changing HPC needs to support life science and health informatics research. This initiative includes team members immersing themselves in researchers’ work through one-on-one support and consultation, workshops to introduce tools to aid in reproducibility and data integrity, and the facilitation of hackathons to bring together researchers and programmers working through innovative, interdisciplinary projects. Through these personal interactions, BioFrontiers IT members are able to ascertain practical knowledge from researchers that helps to engineer large-scale data and project management-oriented tools to support the community at large. To further this goal, and recognizing that computational research in academia benefits from collaboration with industry partners, BIT strives to create environments to foster mutually beneficial connections. These efforts have led to an increased integration between BIT and supported researchers which allows a more complete involvement in experimental design and execution, thus allowing BIT to step beyond a traditional troubleshooting role to provide more value with direct partnership. *

Predicting Adverse Events Associated with Kinase Inhibitors from Clinical Drug Trials.

Presenting Author: Ilyssa Summer, University of Colorado Anschutz Medical Campus

Minjae Yoo, University of Colorado Anschutz Medical Campus
Jimin Shin, University of Colorado Anschutz Medical Campus
Aik Choon Tan, University of Colorado Anschutz Medical Campus

Drug adverse events (AEs) are a major health threat to patients seeking medical treatment and a significant barrier in drug discovery and development. We performed systematic analysis of kinase inhibitors and their associated adverse events by querying clinical trial results reported in Clinicaltrials.gov. We obtained 368 kinase inhibitors from the Drug Repurposing Hub, and queried against our Kinase Inhibition Experiments Omnibus (KIEO) database to obtain kinase inhibition profiles. In total, we collected kinase inhibition and clinical trials data for 83 kinase inhibitors (35 approved and 48 investigational therapeutics) for this study. We extracted >1500 serious adverse events from 758 clinical trials covering 224,200 patients tested with kinase inhibitors. We also collected > 300 kinase inhibition data for these 83 kinase inhibitors. To determine the kinase inhibition – adverse events, we performed various computational and statistical analyses. In particular, we developed a computational method that integrates proportional reporting frequency of adverse events and kinase inhibition data for identifying kinase inhibition-adverse event (KI-AE) relationships. To validate the co-occurrence of kinase inhibition-adverse event detected in the clinical drug trials data, we queried the co-occurrence of KI-AE in PubMed. We computed the co-occurrence of the KI-AE correlation pairs using Fisher’s exact test, corrected by multiple comparisons. We also compared the KI-AE associations against SIDER web resources. Finally, we developed an interaction network for predicting new associations between kinase inhibition and adverse events. We envision that the KI-AE network will assist future drug discovery and development in reducing drug adverse events.

Characterization of the mechanism of drug-drug interactions from PubMed

Presenting Author: Feng Cheng, University of South Florida

Identifying drug-drug interaction (DDI) is an important topic for the development of safe pharmaceutical drugs and for the optimization of multidrug regimens for complex diseases such as cancer and HIV. There have been about 150,000 publications on DDIs in PubMed, which is a great resource for DDI studies. In this abstract, we introduced an automatic computational method for the systematic analysis of the mechanism of DDIs using MeSH (Medical Subject Headings) terms from PubMed literature. MeSH term is a controlled vocabulary thesaurus developed by the National Library of Medicine for indexing and annotating articles. Our method can effectively identify DDI-relevant MeSH terms such as drugs, proteins and phenomena with high accuracy. The connections among these MeSH terms were investigated by using co-occurrence heatmaps and social network analysis. Our approach can be used to visualize relationships of DDI terms, which has the potential to help users better understand DDIs. As the volume of PubMed records increases, our method for automatic analysis of DDIs from the PubMed database will become more accurate.