Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

ROCKY 2018 | Dec 6 – 8, 2018 | Aspen/Snowmass, CO | HOME

POSTER PRESENTATIONS



P01
Simplifying biological data access for data science

Subject: Data management methods and systems

Presenting Author: David Adams, Brigham Young University, United States

Co-Author(s):
Sean Beecroft, Brigham Young University, United States
Amanda Oliphant, Brigham Young University, United States
Emily Hoskins, Brigham Young University, United States

Abstract:

Data collection methods are improving at an increasingly rapid rate, leading to larger and larger datasets. For molecular omics data, many national and international consortia have begun to collect population-scale datasets for the characterization of DNA, RNA, protein, and metabolites of numerous diseases. An explicit goal of these consortia is the dissemination and re-use of their data. Given the recent explosion of data science enthusiasts, a natural target for data re-use is non-biologist computational scientists. Unfortunately, practical use of these data by those outside of the narrow sub-domain is hindered by the steep learning curve necessary to understand data formats, experimental assumptions, raw data processing techniques, etc. Here we present a data dissemination mechanism designed to interface more fluidly with tools and expectations of the data science community. We have packaged the proteo-genomic data from a large uterine cancer cohort in a Python package, accessible natively as dataframes. In addition to the ready-for-use dataframes, the package’s API provides a variety of utilities for multi-omics comparison including the metaclinical information. With an understanding of dataframes, any data scientist can follow our Jupyter-based tutorials to explore the deep molecular profiling of cancer data and participate in scientific discovery.



P02
Dynamical comparison between Hemoglobin and Myoglobin reveals the affects of quaternary structure of Hemoglobin on the intrinsic dynamics of its subunits

Subject: other

Presenting Author: Rotem Aharoni, Ariel University, Israel

Co-Author(s):
Dror Tobi, Ariel University, Israel

Abstract:

Myoglobin and hemoglobin are globular hemeproteins, when the former is a monomer and the latter a heterotetramer. Despite the structural similarity of myoglobin to α and β subunits of hemoglobin, there is a functional difference between the two proteins, owing to the quaternary structure of hemoglobin. The effect of the quaternary structure of hemoglobin on the intrinsic dynamics of its subunits is explored by dynamical comparison of the two proteins. Anisotropic Network Model modes of motion were calculated for hemoglobin and myoglobin. Dynamical comparison between the proteins was performed using global and local Anisotropic Network Model mode alignment algorithms based on the algorithms of Smith-Waterman and Needleman–Wunsch for sequence comparison. The results indicate that the quaternary structure of Hemoglobin substantially alters the intrinsic dynamics of its subunits, an effect that may contribute to the functional difference between the two proteins. Local dynamics similarity between the proteins is still observed at the major exit route of the ligand.



P03
Convolutional Neural Networks In Classifying Cancer Through DNA Methylation

Subject: Machine learning

Presenting Author: Satya Avva, Saama, United States

Co-Author(s):
Soham Chatterjee, Saama, India
Archana Iyer, Saama, India
Abhai Kollara, Saama, India
Malaikannan Sankarasubbu, Saama, United States

Abstract:

DNA Methylation has been the most extensively studied epigenetic mark. Usually a change in the genotype, DNA sequence, leads to a change in the phenotype, observable characteristics of the individual. But DNA methylation, which happens in the context of CpG (cytosine and guanine bases linked by phosphate backbone) dinucleotides, does not lead to a change in the original DNA sequence but has the potential to change the phenotype. DNA methylation is implicated in various biological processes and diseases including cancer. Hence there is a strong interest in understanding the DNA methylation patterns across various epigenetic related ailments in order to distinguish and diagnose the type of disease in its early stages. In this work, the relationship between methylated versus unmethylated CpG regions and cancer types is explored using Convolutional Neural Networks (CNNs). A CNN based Deep Learning model that can classify the cancer of a new DNA methylation profile based on the learning from publicly available DNA methylation datasets is then proposed.



P04
Homologous Inter-Domain Segments in Protein Families

Subject: inference and pattern discovery

Presenting Author: Dylan Barth, University of Nevada Las Vegas, United States


Abstract:

We are interested in sequences between conserved domains of multi-domain proteins. These sequences have historically been ignored in evolutionary analysis because they are not conserved between species and therefore cannot be aligned effectively. To study the evolution of the lengths of these segments, we first need to define homologous inter-domain segments across species. We gathered gene trees from the Ensembl database to provide information on homologous gene families and the evolutionary relationships of the genes. Gene trees were divided into subtrees that are less than 400 million years old. Domain data for each human protein within each gene family have been gathered from both the Superfamily and Pfam databases. Using the boundaries of human domains, we inferred the homologous domain positions across the alignment of the gene family, and defined the homologous inter-domain segments. We have found that these inter-domain segments approximately follow an exponential distribution with a mean and median length of 46 and 23 bp respectively. Based on these data, we plan to study how the lengths of these segments have evolved through insertions and deletions.



P05
GPCR-PEnDB: A database of protein sequences and derived features to facilitate prediction and classification of G protein-coupled receptors

Subject: Data management methods and systems

Presenting Author: Khodeza Begum, University of Texas at El Paso, United States

Co-Author(s):
Jonathon Mohl, University of Texas at El Paso, United States
Ming-Ying Leung, University of Texas at El Paso, United States

Abstract:

G protein-coupled receptors (GPCRs) constitute the largest group of membrane receptor proteins in eukaryotes. Due to their significant roles in physiological processes such as vision, smell, and inflammation, GPCRs are targets of many prescription drugs. However, the functional and sequence diversity of GPCRs has kept their prediction and classification based on amino acid sequence data as a challenging bioinformatics problem. There are existing computational approaches, mainly using machine learning and statistical methods, to predict and classify GPCRs based on amino acid sequence and sequence derived features. In this project, we have constructed a searchable MySQL database and web application, named GPCR-PEnDB, of confirmed GPCRs and non-GPCRs for users to compile and download reliable training and testing datasets for different combinations of computational tools. This database contains over 2800 GPCRs and 3500 non-GPCR sequences (including transmembrane proteins) collected from the UniProtKB/Swiss-Prot protein database, covering more than 1100 species. Each protein is assigned a unique identification number and linked to information about its source organism, sequence length, and other features including amino acid and dipeptide compositions. For the GPCRs, family classifications according to the GRAFS and IUPHAR systems and the lengths of characteristic structural regions are also included. The web-based user interface allows researchers to compile and customize datasets with adjustable sequence diversity using the clustering tool CD-HIT, and output them as FASTA files. The current database provides a framework for future expansion to include predicted but unconfirmed GPCRs that would help the development and assessment of GPCR prediction and classification tools.



P06
A Data Quality Testing Tool for Cross-institutional OMOP Electronic Health Record Data Repositories

Subject: Data management methods and systems

Presenting Author: Timothy Bergquist, University of Washington, United States

Co-Author(s):
Hossein Estiri, Harvard University, United States
Justin Prosser, University of Washinton, United States
Adam Wilcox, University of Washington, United States
Kari Stephens, University of Washington, United States

Abstract:

Data quality testing is critical to cross-institutional data sharing, a key component of health innovations produced through translational research. Harmonizing electronic health record (EHR) data is a resource intensive strategy used in many data sharing efforts, involving extraction, translation, and loading activities that can perpetuate and add to pre-existing data quality issues. Yet, we lack standards and tools for testing the quality of datasets produced through these complex harmonization processes. Given its large scale adoption, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) standard is primed as a front running CDM to target establishing a standard set of executable data quality tests to support cross institutional data sharing. We adapted a prototype tool, DQe-c, to OMOP CDM V5 with scalability across database platforms. Namely it examines completeness in all data tables and columns, calculates the percentage of patients who have key clinical variables present (e.g., blood pressure, height), detects the presence of orphan keys (i.e., foreign keys that are not present in their reference table), reports on the size of the databases, and assesses conformance to the standard. All test results are produced as data visualizations in a single HTML dashboard. This prototype is being explored for use in multiple data sharing pilot projects supported by the Clinical Translational Science Award (CTSA) Program Data to Health (CD2H) Coordinating Center, with an aim towards configuring a robust set of completeness, conformance, and plausibility tests that confirm OMOP CDM V5 datasets are fit for cross-institutional data sharing.



P07
Med2Mech: Neural-Symbolic Representation of Molecular Mechanisms Underlying Pediatric Disease

Subject: Machine learning

Presenting Author: Tiffany Callahan, University of Colorado Denver Anschutz Medical Campus, United States

Co-Author(s):
Adrianne Stefanski, University of Colorado Denver Anschutz Medical Campus, United States
Michael Kahn, University of Colorado Denver Anschutz Medical Campus, United States
Lawrence Hunter, University of Colorado Denver Anschutz Medical Campus, United States

Abstract:

Subphenotyping aims to cluster patients with a particular disease into clinically distinct groups. Genomic and related molecular signatures, such as mRNA expression, have shown great promise for subphenotyping, but such molecular data is not and will not be available for most patients. Here, we present Med2Mech, a method for linking knowledge from generalized molecular data to specific patients’ electronic patient records, and demonstrate its utility for subphenotyping. We hypothesized that integrating knowledge of molecular mechanisms with patient data would improve subphenotype classification. Med2Mech employs neural-symbolic representation learning to generate patient-level embeddings of molecular mechanisms using publicly available biomedical knowledge. Using clinical terminologies and biomedical ontologies, the mechanisms can then be mapped to patient data at scale. Med2Mech was developed and tested using clinical data from a subset of rare disease and other similarly medically complex patients from the Children’s Hospital Colorado. A one-vs-the-rest multiclass classification strategy was used to evaluate the discriminatory ability of embeddings generated using Med2Mech versus only clinical data. Clinical embeddings were built for 2,464 rare disease and 10,000 similarly complex patients using 6,382 conditions, 2,334 medications, and 272 labs. Molecular mechanism embeddings were generated from a knowledge graph (116,158 nodes and 3,593,567 edges) built with 23,776 genes, 3,744 diseases, 49,185 gene ontology concepts, 13,159 phenotypes, 11,124 pathways, and 15,019 drugs. For classification, the molecular mechanism embeddings (precision=0.95, recall=0.94) out-performed all parameterizations of clinical embeddings (precision=0.83, recall=0.82). The Med2Mech representation of patient data improves subphenotype classification relative to standard subphenotyping approaches by incorporating knowledge of molecular mechanisms.



P08
Cell4D: a Spatial Stochastic Simulator for Biological Modeling

Subject: Qualitative modeling and simulation

Presenting Author: Donny Chan, University of Toronto, Canada

Co-Author(s):
Graham Cromar, The Hospital for Sick Children, Canada
Billy Taj, The Hospital for Sick Children, Canada
John Parkinson, The Hospital for Sick Children, Canada

Abstract:

With high-throughput datasets revealing complexity in many biological pathways, computational models can allow biologists to study a large variety of biochemical processes within complicated systems; a process that would otherwise be time consuming if investigated in vitro or in vivo. In silico simulations of metabolic or signaling pathways can be utilized as an approach to identify novel interactions that can then be experimentally validated. We are developing a graphical cell simulator, Cell4D, that can capture how spatial effects change the behavior of biological systems. The program can be used to simulate complex biological systems of interest, such as the role of CEACAM-related proteins in immune surveillance and evasion. Cell4D is an improved version of an older simulator from the lab by Sanford et al., with expanded infrastructure changes. The new features include support for rule-based modeling, robust compartment rules, and more efficient neighbor-searching algorithms. These improvements will also allow us to model CEACAM-mediated signaling, a family of membrane receptors that mediate intercellular adhesion and have roles in cellular growth, differentiation, and inflammation. I will demonstrate Cell4D functionality by modeling a simple biological pathway, along with some preliminary results from a simulated CEACAM1 signaling pathway. The goal of this project is to develop a robust and extendable cell simulator that is biologically accurate and easily applicable to the modeling of diverse types of cell-based processes and biological systems.



P09
REAL-neo, a comprehensive neoantigen prediction and prioritization pipeline using tumor sequencing data

Subject: other

Presenting Author: Yesesri Cherukuri, Mayo Clinic, United States

Co-Author(s):
Yingxue Ren, Mayo Clinic, United States
Vivekananda Sarangi, Mayo Clinic, United States
Yi Lin, Mayo Clinic, United States
Keith Knutson, Mayo Clinic, United States
Yan Asmann, Mayo Clinic, United States

Abstract:

ABSTRACT
Neoantigens are immunogenic peptides from tumor-specific somatic mutations. The expressed neoantigens can be presented to class-I or class-II MHC molecules and induce robust and enduring anti-tumor T-cell responses. Recent studies have demonstrated the great potential of personalized neoantigen vaccines as a new type of immunotherapy.

In general, identification of neoantigens from tumor sequencing data includes the following steps: (1) call somatic mutations from tumor genomic sequencing data; (2) derive neo-peptide sequences containing somatic mutations; (3) predict binding affinities between neo-peptides and MHC molecules. However, the current bioinformatics practices ignore transcript splicing isoforms, expressed fusion gene products, and often times only focus on non-synonymous single nucleotide mutations but not frame-shifting INDELs. In addition, the MHC binding affinity prediction mainly focuses on class-I but not class-II MHC molecules. Furthermore, studies have shown that substantial numbers of neo-peptides predicted to have low MHC affinities are actually immunogenic, suggesting the necessity of alternative approaches for neoantigen discovery. Finally, nominated neoantigens need to be further filtered to ensure tumor specificity.

We have improved and optimized each step of the bioinformatics workflow for neoantigen identification from tumor sequencing data to address the complexity and current limitations of the process.



P10
NaVARgator: A bioinformatics program to cluster phylogenetic trees and identify representative variants

Subject: Optimization and search

Presenting Author: David Curran, Hospital for Sick Children, Canada

Co-Author(s):
Jamie Fegan, University of Toronto, Canada
John Parkinson, Hospital for Sick Children, Canada

Abstract:

Phylogenetic trees are representations of the relatedness of a group of variants. No matter what the variants represent – genes, proteins, genomes, species, etc – the task of identifying clusters of similar variants arises in many different fields. NaVARgator performs clustering by identifying k variants as cluster centers such that the total phylogenetic distance from all variants to their nearest cluster center is minimized. The software can be run on any phylogenetic tree and allows the clustering procedure to be customized by classifying variants. If the tree contains an outgroup, or other variants to be removed from the clustering procedure, they can be assigned as “ignored”. If there are variants that should be selected as cluster centers – perhaps because they are biologically important or already well studied – they can be assigned as “chosen”. The variants that the remaining cluster centers will be chosen from should be assigned as “available”. Unassigned variants will still impact the clustering calculations but cannot be selected as cluster centers.

NaVARgator provides a rich graphical user interface designed to aid the user in evaluating a cluster configuration, as well as comparing different configurations or numbers of clusters. Clustering data can be exported in a number of ways: as a customizable image of the tree, a list of variant names in the clusters or other subsets, a list of the distance between each variant and its cluster center, or a histogram of those distances. The software is available for installation or as a web tool.



P11
Identifying candidate druggable targets in canine cancer cell lines using whole exome sequencing

Subject: inference and pattern discovery

Presenting Author: Sunetra Das, Colorado State University, United States

Co-Author(s):
Rupa Idate, Colorado State University, United States
Dawn Duval, Colorado State University, United States

Abstract:

The FACC canine cell line panel is a valuable resource to study genome variations that drive cancer in dogs and assess pharmacogenomic correlations through in vitro testing of new targeted therapies. The goal of this study is to create a database of somatic mutations in canine cancer cell lines using whole exome sequencing (WES) technology. WES data of 33 cell lines from ten different cancer types were mapped against the canine genome using BWA tool. Variant calling and annotation was conducted with Freebayes and SnpEff resources, respectively. Following removal of germline variants and known polymorphisms a total of 66,344 somatic variants were identified. Mutational load throughout the FACC panel ranged from 15.79 to 129.37 per MB, and 13.2% of all variants were located in protein coding region of 5,085 genes. Using the Cancer Gene Census (COSMIC), 232 curated genes that play a role in cancer, were identified in this dataset. Upon cross-checking with human driving mutations, 62 variants were collated as candidate cancer drivers across 30 cell lines. To identify other protein coding variants that may play a role in cancer progression, following functional annotation of genes, a two prong approach was used to select functional terms: A. associated with activating and maintaining cancer (GO, PFAM, KEGG); B. with at-least one cancer-causing gene. This yielded 502 genes that are not currently in COSMIC database, with an enrichment of MAPK, and PI3K-Akt pathways. This functionally annotated database will be useful in conducting hypothesis-driven research based on the cell line mutational landscape.



P12
Poster Withdrawn


P13
SumSec: Accurate Prediction of Sumoylation Sites using Predicted Secondary Structure

Subject: Machine learning

Presenting Author: Abdollah Dehzangi, Morgan State University, United States

Co-Author(s):
Yosvany Lopez, Genesis Healthcare Co.,, Japan
Ghazaleh Taherzadeh, Griffith University, Australia
Tatsuhiko Tsunoda, RIKEN Center for Integrative Medical Sciences, Japan
Alok Sharma, RIKEN Center for Integrative Medical Sciences, Japan

Abstract:

Post Translational Modification (PTM) is defined as the interaction of amino acids along the protein sequences with different macromolecules after the translation process. These interactions significantly impact on the functioning of proteins and can range from strongly deleterious to strongly advantageous. Therefore, understanding the underlying mechanism of PTMs can play critical role in understanding the functioning of proteins. Among a wide range of PTMs, Sumoylation is one the most important ones due to its important functioning which includes, transcriptional regulation, protein stability and protein subcellular localization. Despite its importance, determining sumoylation sites using experimental methods is time consuming and costly. Therefore, there is a crucial demand for the development of fast computational methods able to accurately determine the sumoylation sites in proteins. In this study, we develop a new machine learning based method to predict sumoylation sites called SumSec. To do this, we employ the predicted secondary structure of amino acids to extract two types of structural features from neighboring amino acids along the protein sequence which has never been used for this task. We also employ the concept of profile-bigram to extract local information about the interaction of the amino acids based on structural information. As a result, our proposed method is able to enhance the sumoylation site prediction task better than previously proposed methods found in the literature. SumSec demonstrates high sensitivity (0.91), accuracy (0.94) and MCC (0.88). The prediction accuracy achieved in this study is 21% better than previous studies found in the literature.



P14
Leaf: A self service cohort discovery and extraction browser for mining clinical enterprise data warehouses for research and quality improvement

Subject: Data management methods and systems

Presenting Author: Nicholas Dobbins, University of Washington, United States

Co-Author(s):
Anthony Black, University of Washington, United States
Cliffard Spital, University of Washington, United States
Robert Harrington, University of Washington, United States
Bas de Veer, University of Washington, United States
XIyao Yang, University of Washington, United States
Robert Meizlik, University of Washington, United States
Beth Britt, University of Washington, United States
Jason Morrison, University of Washington, United States
Kari Stephens, University of Washington, United States
Adam Wilcox, University of Washington, United States
Peter Tarczy-Hornoch, University of Washington, United States

Abstract:

Academic medical centers and health systems are increasingly challenged with supporting appropriate secondary uses of data from a multitude of sources. To that end, the UW Medicine Enterprise Data Warehouse (EDW) has emerged as a central port for all data that can include clinical, research, administrative, financial and other datatypes. Although EDW’s have been popular and successful in providing a single stop for data, they are often non-self service and require an informatician or clinical informatics expert to access. To address this challenge, we have developed an easy to use, self service web-based tool for querying, browsing and extracting clinical cohorts from the UW Medicine EDW, called Leaf. Leaf enables querying by data dictionaries or ontologies and allows both de-identified and identified access to patient data and grants access to these datasets in a compliant manner. Leaf is an interface that is being built upon multiple data models and is independent of a specific data model. While Leaf provides basic visualizations, it contains robust tools for exporting directly to REDCap projects. Leaf is different from existing query tools (e.g. i2b2, SlicerDicer) because it does not specify a specific data model and is intended to only be a reusable lightweight modern web interface. The users of Leaf include both quality improvement and research investigators and has been developed using an Agile development process with a soft production rollout to identify and address software, support and data quality concerns.



P15
The GA4GH/DREAM Workflow Execution Challenge

Subject: System integration

Presenting Author: James Eddy, Sage Bionetworks, United States

Co-Author(s):
Brian O'Connor, University of California, Santa Cruz, United States
Denis Yuen, Ontario Institute for Cancer Research, Canada
Justin Guinney, Sage Bionetworks, United States

Abstract:

Software and platforms for workflow sharing and execution are increasingly utilized in massive data generation efforts. In turn, groups are developing standards, APIs, and best practices for running portable and reproducible pipelines. In order to ensure that the promises of reproducibility are being met by these standards and projects, we must critically assess workflows and workflow management systems.

With the GA4GH/DREAM Infrastructure Challenges, we aim to bring groups together to test and demonstrate tool portability while continuing to develop common standards. We also aim to bring together workflow authors and execution platform engineers to increase communication and accelerate the resolution of compatibility issues. In the GA4GH/DREAM Workflow Execution Challenge (synapse.org/WorkflowChallenge), participants downloaded a Dockerized, CWL/WDL-described workflow—along with any required input, reference, or parameter files—from Synapse. Participants ran the workflow in their environment and uploaded results to Synapse along with a description of their methods.

Through this challenge, we began to formalize methods for evaluating workflow portability. We not only piloted the use of centralized and systematic validation through Synapse, but defined standard procedures for authoring, registering, and onboarding workflows. The processes and frameworks used in this challenge resulted in a collection of stress-tested workflows, a rich body of examples and documentation, and a well annotated record of workflow/platform compatibility. This work has also informed efforts of the GA4GH Cloud Work Stream (ga4gh.cloud) to establish a “testbed” framework for centralized benchmarking of workflows and platforms.



P16
A Machine Learning Classifier for Assigning Individual Patients with Systemic Sclerosis to Intrinsic Molecular Subsets

Subject: Machine learning

Presenting Author: Jennifer Franks, Geisel School of Medicine at Dartmouth, United States

Co-Author(s):
Viktor Martyanov, Geisel School of Medicine at Dartmouth, United States
Guoshuai Cai, Arnold School of Public Health at University of South Carolina, United States
Yue Wang, Geisel School of Medicine at Dartmouth, United States
Tammara Wood, Geisel School of Medicine at Dartmouth, United States
Michael Whitfield, Geisel School of Medicine at Dartmouth, United States

Abstract:

High-throughput gene expression profiling of skin biopsies from patients with systemic sclerosis (SSc) has identified four “intrinsic” gene expression subsets (inflammatory, fibroproliferative, normal-like, limited) conserved across multiple cohorts and tissues. In order to characterize patients in clinical trials or for diagnostic purposes, supervised methods that can classify single samples are required.

Three gene expression cohorts were curated and merged for the training dataset. Supervised machine learning algorithms were trained using repeated three-fold cross-validation. We performed external validation using three additional datasets, including one generated by an independent laboratory on a different microarray platform. WGCNA and g:Profiler were used to identify and functionally characterize gene modules associated with the intrinsic subsets.

The final model, a multinomial elastic net, performed with average classification accuracy of 88.1%. All intrinsic subsets were classified with high sensitivity and specificity, particularly inflammatory (83.3%, 95.8%) and fibroproliferative (89.7%, 94.1%). In external validation, the classifier achieved an average accuracy of 85.4%. In a re-analysis of GSE58095, we identified subgroups of patients that represented the canonical inflammatory, fibroproliferative, and normal-like subsets. Inflammatory gene modules showed upregulated biological processes including inflammatory response, lymphocyte activation, and stress response. Similarly, fibroproliferative gene modules were enriched in cell cycle processes.

We developed an accurate, reliable classifier for SSc intrinsic subsets, trained and tested on 427 skin biopsies from 213 individuals. Our method provides a robust approach for assigning samples to intrinsic gene expression subsets and can be used to aid clinical decision-making and interpretation for SSc patients and in clinical trials.



P17
ShapeShifter: Making it Easy to Transform Genomic and Transcriptomic Data from One File Format to Another

Subject: Data management methods and systems

Presenting Author: Brandon Fry, Brigham Young University, United States

Co-Author(s):
Stephen Piccolo, Brigham Young University, United States

Abstract:

Despite bioinformatics’ emphasis on handling and interpreting large data files, there is a distinct lack of uniformity in file formats for sharing this data among researchers. Because biomolecular data is stored in many different and frequently incompatible formats, researchers spend a frustrating amount of time transforming data from one format to another, which impedes solving the more interesting biological problems that researchers wish to address. To ease and simplify this process, we have developed ShapeShifter, a Python module and accompanying command-line tool that enable researchers to quickly transform preprocessed, tabular data from one format to another. Additionally, researchers can perform queries on the data, select specific columns, merge multiple files into one, and gzip the resulting data using simple commands from a terminal. ShapeShifter currently supports transforming 15 different file types, ranging from formats used across many disciplines—like CSV, Excel, and SQLite—to those used specifically in bioinformatics applications for processing genomic and transcriptomic data, including Kallisto, Salmon, and GenePattern. Support for additional file formats is ongoing, and we encourage requests for such to be made to our open-source repository at https://github.com/srp33/ShapeShifter. To demonstrate ShapeShifter's utility, we performed a benchmark evaluation, comparing the time taken to import and export small, medium, and large data files when various filters are applied. Our evaluation shows that ShapeShifter excels at transforming and filtering small and medium sized files; currently, we are developing a solution to efficiently process files that are too large to store in memory.



P18
Using machine learning algorithms for classification of medulloblastoma subgroups based on gene expression data

Subject: Machine learning

Presenting Author: Sivan Gershanov, Ariel University, Israel

Co-Author(s):
Igor Vainer, Ariel University, Israel
Helen Toledano, Schneider Children’s Medical Center of Israel, Israel
Albert Pinhasov, Ariel University, Israel
Nitza Goldenberg-Cohen, Bnai Zion Medical Center, Israel
Mali Salmon-Divon, Ariel University, Israel

Abstract:

Medulloblastoma (MB), the commonest malignant pediatric brain tumor, is divided into four molecular subgroups: WNT, SHH, Group 3 and Group 4. Clinical practice and treatment design are becoming subgroup-specific. Nowadays clinicians use a 22-gene signature set to diagnose the subgroups. While WNT and SHH subgroups are well-defined differentiating Group 3 from Group 4 is less obvious.
The aim of this study is to improve the diagnosis process in the clinic by identifying the most efficient list of biomarkers for accurate, fast and cost-effective MB subgroup classification.
We tested five machine learning based algorithms, four are well known and one is a novel method we developed. We applied them on a public microarray expression data set and compared their performance to that of the known 22-gene set.
Both decision tree and decision rules resulted in a reduced set with similar accuracy to the 22-gene set. Random forest and SVM-SMO methods showed improved performance, without applying feature-selection. When implementing our novel SARC (SVM Attributes Ranking and Combinations) classifier, allowing feature-selection, the resulted accuracy level was the highest and better than using the 22-gene set as input. The number of attributes in the best-performing combinations range from 13 to 32, including known MB related genes such as WIF1, NPR3 and GRM8, along with LOC440173 a long non-coding RNA.
To summarize we identified sets of attributes that have the potential to improve MB subgroup diagnosis. Broad clinical use of this classification may accelerate the design of patient’s specific targeted therapies and optimize clinical decision.



P19
Computational Identification and Analysis of Bacterial Virulence Factors Embedded Into Bacteriophage Genomes

Subject: Metogenomics

Presenting Author: Cody Glickman, University of Colorado Anschutz, United States

Co-Author(s):
Michael Strong, National Jewish Health, United States
Josephina Hendrix, University of Colorado Anschutz, United States

Abstract:

Pathogenic bacteria utilize gene products called virulence factors to circumvent host immunity and promote colonization. Bacteria are adept at acquiring genetic information capable of producing virulence factors from their environment through horizontal gene transfer. One understudied mechanism of horizontal gene transfer is the integration of viral elements called bacteriophages into a host bacterial genome. These integrated bacteriophages engage in a passive lifestyle called lysogeny, replicating in parallel with the host bacteria. Propagative success of these viral elements are dependent upon the ability of the host bacteria to thrive in a niche. Thus there is a selective advantage for bacteriophages that carry genetic elements apt at increasing fitness and propagative success of the host.
In this study, we utilized sequence similarity techniques to establish a baseline distribution of virulence factor genes embedded within viral genomes. The bacterial taxonomy listed in the name of viral genomes are used to discretize the bacteriophages into genus level categories. Comparisons between the categories suggest that viral elements known to infect pathogenic bacteria contain different percentages of virulence factor genes. In addition, we use network based methods to explore the functional potential of virulence factors between the categories. Finally, we compare our baseline distribution against the percentage of virulence factor genes embedded in bacteriophages isolated from clinical non-tuberculosis mycobacterial samples.
This study expands our understanding of horizontal gene transfer by bacteriophages and provides a resource for further research. In addition, the study provides information into the abundance of virulence factors within lysogenic bacteriophages of clinical mycobacteria.



P20
Harmonizing and Analyzing Clinical Trials Data in the AHA Precision Medicine Platform

Subject: other

Presenting Author: Carsten Goerg, University of Colorado, United States

Co-Author(s):
Christophe Roeder, University of Colorado, United States
Bethany Doran, University of Colorado, United States
Ann Marie Navar, Duke University, United States
Michael Hinterberg, SomaLogic, United States
John Graybeal, Stanford University, United States
Mark Musen, Stanford University, United States
Jennifer Hall, American Heart Association, United States
David Kao, University of Colorado, United States

Abstract:

Clinical trials have produced many highly valuable datasets, but their potential to support discovery through meta-analysis has not been fully realized. Answering biomedical questions often requires integrating and harmonizing data from multiple trials to increase statistical power. Due to the lack of supporting computational approaches, this challenging and time-consuming integration process is currently performed manually, which leads to scalability and reproducibility issues. We present a framework and prototype implementation within the cloud-based American Heart Association Precision Medicine Platform as a first step towards addressing this problem. Our framework provides (1) a metadata-driven mapping process from study-specific variables to the OMOP common data model, (2) a metadata-driven extraction process for creating analysis matrices of harmonized variables, and (3) an interactive visual interface to define and explore cohorts in harmonized studies. To demonstrate our approach, we present a prototype use case that investigates the relationship between blood pressure and mortality in patients treated for hypertension. Using our framework, we harmonized five publicly available NIH-funded studies (ALLHAT, ACCORD, BARI-2D, AIM-HIGH, and TOPCAT), assessed distributions of blood pressure by study, and using harmonized data performed individual patient-data meta analyses to show the statistical relationship between all-cause mortality and systolic blood pressure, for individual studies as well as the aggregated data. We discuss how the cloud-based implementation supports reproducibility as well as transparent co-development between collaborators over time and space. Future work will entail development of a generalized workflow for acquisition and semantic annotation of new datasets based on the CEDAR metadata management system.



P21
Integrating pathway databases with Gene Ontology Causal Activity Models

Subject: inference and pattern discovery

Presenting Author: Benjamin Good, Berkeley Labs, United States

Co-Author(s):
Paul Thomas, USC, United States
David Hill, The Jackson Laboratory, United States
Huaiyu Mi, USC, United States
Kimberly van Auken, Caltech, United States
Seth Carbon, Berkeley Labs, United States
Laurent-Philippe Albou, Berkeley Labs, United States
Nomi Harris, Berkeley Labs, United States
Suzanna Lewis, Berkeley Labs, United States
Chris Mungall, Berkeley Labs, United States
James Balhoff, RENCI, United States
Peter Deustachio , NYU, United States

Abstract:

The Gene Ontology (GO) Consortium (GOC) is developing a new knowledge representation approach called ‘causal activity models’ (GO-CAM). A GO-CAM describes how one or several gene products contribute to the execution of a biological process. In these models (implemented as OWL instance graphs anchored in Open Biological Ontology (OBO) classes and relations), gene products are linked to molecular activities via semantic relationships like ‘enables’, molecular activities are linked to each other via causal relationships such as ‘positively regulates’, and sets of molecular activities are defined as ‘parts’ of larger biological processes. This approach provides the GOC with a more complete and extensible structure for capturing knowledge of gene function. It also allows for the representation of knowledge typically seen in pathway databases.

Here, we present details and results of a rule-based transformation of pathways represented using the BioPAX exchange format into GO-CAMs. We have automatically converted all Reactome pathways into GO-CAMs and are currently working on the conversion of additional resources available through Pathway Commons. By converting pathways into GO-CAMs, we can leverage OWL description logic reasoning over OBO ontologies to infer new biological relationships and detect logical inconsistencies. Further, the conversion helps to increase standardization for the representation of biological entities and processes. The products of this work can be used to improve source databases, for example by inferring new GO annotations for pathways and reactions and can help with the formation of meta-knowledge bases that integrate content from multiple sources.



P22
Human Skin Biopsy Culture Model Maintains Psoriasis Disease Function and Demonstrate Pathway Engagement by Dexamethasone

Subject: other

Presenting Author: Shaun Grosskurth, AbbVie, United States

Co-Author(s):
Susan Huang, AbbVie, United States
Loan Miller, AbbVie, United States
Hetal Patel, AbbVie, United States
Lauren Olson, AbbVie, United States
Joseph Wetter, AbbVie, United States
Mark Reppell, AbbVie, United States
Marc Domanus, AbbVie, United States
Christopher Miller, AbbVie, United States
Marie Honore, AbbVie, United States
Victoria Scott, AbbVie, United States

Abstract:

Early indication of efficacy or pathway engagement in relevant disease models is valuable during drug discovery. While animal models of skin disease are useful, not all features of human skin are recapitulated. In collaboration with AbbVie Clinical Pharmacology Research Unit, we obtained skin biopsies from psoriasis patients to develop and evaluate an ex vivo human skin biopsy culture model. Full thickness skin biopsies from psoriasis patients and healthy donors were bisected and cultured for 3, 6, or 24 hours with and without dexamethasone treatment. Conditioned media and skin biopsies were harvested to characterize cytokine levels and assess transcriptomics. Confirming elevated inflammation, higher cytokine levels were seen in the media from psoriasis lesional versus control skin samples after 24 hours of culturing. For transcriptomic analysis, skin biopsy gene expression profiling was performed on Affymetrix Human Gene ST 1.0 arrays and differential expression was performed with linear modeling. The transcriptomic psoriatic lesional phenotypic status of the cultured biopsies were confirmed with gene lists identified from a meta-analysis incorporating 9 skin biopsy public datasets from psoriasis patients. Also as expected, biopsy samples treated with dexamethasone exhibited features consistent with decreased inflammation and activation of the glucocorticoid receptor NR3C1. Here we show that the human psoriasis skin biopsy culture model maintains many clinical phenotypes of fresh psoriasis skin biopsies. More importantly, we show that the skin biopsy model is a valuable tool for interrogation of pharmacodynamic pathway engagement and potential efficacy for future candidate therapeutics.



P23
How open is open? The (Re)usable Data Project assesses data licensing

Subject: Data management methods and systems

Presenting Author: Melissa Haendel, Oregon Health & Science University, United States

Co-Author(s):
Seth Carbon, LBNL, United States
Robin Champieux, OHSU, United States
Lilly Winfree, OHSU, United States
Letisha Wyatt, OHSU, United States
Julie Mcmurry, Oregon State University , United States

Abstract:

Complex licensing and data reuse restrictions hinder most publicly-funded, seemingly “open,” biomedical and biological data from being used, modified, and redistributed to its full potential. Such issues include missing or non-standard licenses, restrictive provisions that do not allow for resources to be redistributed after modification, and terms that limit synthesis with other resources. Further, navigating legal compliance with data licensing and use agreements is complicated, as data is often manipulated, shared, and redistributed by many types of research groups and users in various and subtle ways. The community is plagued by complex licensing and legal terms of reuse when integrating data from a broad array of publicly funded resources. This struggle spurred the creation of the (Re)usable Data Project (http://reusabledata.org), an open source and open data project in which we created a five-part rubric to evaluate data resources’ licensing information. Here we present our rubric and evaluations of sources based on the findability and clarity of the terms of use, how accessible the data is, and the degree to which unnegotiated and unrestricted reuse and redistribution can occur. We have tested the (Re)usable Data Project’s rubric against over 50 biological data sources. Approximately 40% of the resources rank poorly, and more than half of the sources lack a clear, easily findable license. We hope that this systematic review of the data licensing landscape will build awareness, engage the community, and ultimately improve licensing practices that impact data reuse.



P24
Custom database for identifying coral symbionts

Subject: Metogenomics

Presenting Author: Graham Hamilton, University of Glasgow, United Kingdom

Co-Author(s):
Nick Kamenos, University of Glasgow, United Kingdom

Abstract:

Determine which species of symbiotic photosynthetic algae are present in coral samples and to assess population changes due to rising ocean temperatures



P25
Transcriptome analysis of cancer adjacent normal tissues reveal genes co-expressed with LINE elements

Subject: inference and pattern discovery

Presenting Author: Mira Han, University of Nevada Las Vegas, United States

Co-Author(s):
Nicky Chung, University of Nevada, Las Vegas, United States
G.M. Jonaid, University of Nevada, Las Vegas, United States
Sophia Quinton, University of Nevada, Las Vegas, United States
Austin Ross, University of Nevada, Las Vegas, United States

Abstract:

Despite the long-held assumption that transposons are normally only expressed in the germ-line, recent evidence shows that transcripts of LINE sequences are frequently found in the somatic cells. However, the extent of variation in LINE transcript levels across different tissues and different individuals, and the genes and pathways that are co-expressed with LINEs are unknown. Here we report the variation in LINE transcript levels across tissues and between individuals observed in the normal tissues collected for The Cancer Genome Atlas. Mitochondrial genes and ribosomal protein genes were enriched among the genes that showed negative correlation with L1HS in transcript level. We hypothesize that oxidative stress is the factor that leads to both repressed mitochondrial transcription and LINE over-expression. KRAB zinc finger proteins (KZFPs) were enriched among the transcripts positively correlated with older LINE families. The correlation between transcripts of individual LINE loci and individual KZFPs showed highly tissue-specific patterns. There was also a significant enrichment of the corresponding KZFP’s binding motif in the sequences of the correlated LINE loci, among KZFP-LINE locus pairs that showed co-expression. These results support the KZFP-LINE interactions previously identified through ChIP-seq, and provide information on the in vivo tissue context of the interaction.



P26
Refining orthology determination in RNA-seq phylogenetics

Subject: Optimization and search

Presenting Author: Madison Hansen, American Museum of Natural History, United States

Co-Author(s):
Ward Wheeler, American Museum of Natural History, United States

Abstract:

Orthologous genes are genes in different species which have evolved from the same gene in the common ancestor of those species. Determining groups of orthologous genes is important to evolutionary biology research, including phylogenetics and gene function studies. While orthology determination has historically relied on alignment to reference genome sequences, now next-generation sequencing techniques can generate large quantities of genome-wide sequences from organisms that do not yet have a reference genome. When no reference genome is available, orthology determination for multiple species is computationally expensive. Assorted methods for orthology determination have been developed; each uses particular sequence similarity measurements and clustering algorithms to organize the sequences into potential orthologous groups. However, the groups sometimes conflict with the inferred evolution of the species, indicating that the groups may not comprise true orthologs. Here, we refine orthology determinations using phylogenetic inference methods and heuristics, in order to produce more consistent and precise orthologous groups.



P27
Optimized hybrid assembly of mycobacterial genomes using MinION and Illumina next generation sequence reads

Subject: Optimization and search

Presenting Author: Jo Hendrix, University of Colorado Anschutz, United States

Co-Author(s):
Elaine Epperson, National Jewish Health, United States
Nabeeh Hasan, National Jewish Health, United States
Cody Glickman, National Jewish Health, United States
Michael Strong, National Jewish Health, United States

Abstract:

Next generation sequencing (NGS) technology allows researchers to sequence and compare bacterial genomes at an increasingly rapid pace, but in order to assemble complete bacterial genomes, and to identify plasmids, a hybrid assembly method often proves more effective. Illumina NGS produces highly accurate reads that are less than 300 bases in length and have difficulty covering low-complexity regions of the genome such as repeats. Such regions result in gaps between assembled contigs, computationally assembled segments of the sequence. In contrast to the short reads produced by Illumina, MinION technology is capable of sequencing segments that are tens of thousands of bases in length. These reads can span entire repetitive regions; however, these long reads are more expensive per base, are lower-throughput, and have a higher error rate.
In this study, we used and tested a method of hybrid genome assembly of Illumina and MinION sequence data in order to assemble complete bacterial genomes. We used the higher-throughput and more accurate Illumina reads to make a reliable scaffold of fragmented contigs which were stitched together utilizing MinION long reads. The result is an assembly of the bacterial genome and its plasmids. We demonstrate this method on two nontuberculous mycobacterial genomes, M. kubicae and M. gordonae.
With completed reference genomes from hybrid assembly, we can better annotate the complete genomes, identifying genes involved in virulence and drug resistance. Further, these annotations can be used to infer drug susceptibility and the drug combination that may be the most effective against a bacterial infection.



P28
Predicting cancer outcomes based on gene-expression profiles more accurately with deep neural networks

Subject: Machine learning

Presenting Author: Kimball Hill, Brigham Young University, United States

Co-Author(s):
Stephen Piccolo, Brigham Young University, United States

Abstract:

The development of next-generation sequencing technologies has led to the creation of massive "omic" datasets, the analysis of which has imposed a challenging task for bioinformatics researchers. Although "shallow" machine-learning algorithms and statistical models have contributed greatly to the successful analysis of such data, these models frequently are unable to deal with the complexity of omic data and are difficult to optimize due to the need for feature engineering. Recent developments in the field of deep learning have shown that deep neural networks (DNNs) can outperform shallow algorithms in a variety of applications. These improvements are mostly demonstrated by DNNs’ performance in computer vision; however, their usefulness is being demonstrated in an increasing number of other applications such as natural language processing and diverse classification problems. DNNs are highly customizable, overcoming many limitations imposed by shallow algorithms. By utilizing the latest techniques in DNN architecture design, including the use of transfer learning, self-normalizing networks, and stacked auto-encoders, we sought to generate more accurate predictions of patient diagnoses, outcomes, and treatment responses with transcriptomic data, which could more reliably inform oncologists’ patient-care decisions. In our comparison across more than 30 different classification algorithms, DNNs performed best for the majority of 40+ different classification problems. In addition, we further improved the models by transferring network layers trained on other, similar models. These results encourage future research such as the unpacking of high performing and transferable DNNs to uncover relationships between genes and to understand how they influence the models' classification decisions.



P29
Computational and cultural aspects of improved attribution

Subject: other

Presenting Author: Kristi Holmes, Northwestern University, United States

Co-Author(s):
Melissa Haendel, Oregon State University, United States
David Eichmann, University of Iowa, United States
Patty Smith, Northwestern University, United States
Nicole Vasilevsky, Oregon Health & Science University, United States
Marijane White, Oregon Health & Science University, United States
Sara Gonzales, Northwestern University, United States
Karen Gutzman, Northwestern University, United States

Abstract:

Open science practices, collaborative team science, and a drive to understand meaningful outcomes and impacts have transformed research at all levels. It is not sufficient to consider scholarship simply from the perspective of the number of papers written, citations garnered, and grant dollars awarded. We must enable a more nuanced characterization and contextualization of the wide array of contributions of varying types and intensities that are necessary to move science forward. Unfortunately, little infrastructure exists to identify, aggregate, present, and (ultimately) assess the impact of these contributions. Moreover, these challenges are technical as well as social and require an approach that assimilates cultural perspectives for investigators and organizations, alike.

Here we will present ongoing work through the National Center for Data to Health (CD2H) to address these challenges, with a special emphasis on the unique needs and opportunities for trainees and early stage investigators (ESI) in translational science, especially in data science and informatics. We will discuss contributor roles, research products, and scholarly workflows that can be leveraged for ESI to put their best foot forward to more effectively communicate their science, get credit for their work, and ultimately drive knowledge to impact. We will also examine this topic from an institutional perspective to identify new opportunities for institutions to integrate workflows that will enable them to recognize and credit a diverse complement of work.



P30
A new computational pipeline for PAR-CLIP characterizes a key immune regulatory mechanism

Subject: inference and pattern discovery

Presenting Author: Rachel Hovde, Chimera Bioengineering, United States

Co-Author(s):
Gus Zeiner, Chimera Bioengineering, United States
Melissa Fardy, Chimera Bioengineering, United States
Krista McNally, Chimera Bioengineering, United States
Jay Danao, Chimera Bioengineering, United States
Charlotte Davis, Chimera Bioengineering, United States
Joe Solvason, Chimera Bioengineering, United States

Abstract:

RNA-binding proteins are key effectors of post-transcriptional gene regulation, but for most RNA-binding proteins, the scope and specifics of the RNA-binding landscape are unknown. In 2010, Hafner et al. described the PAR-CLIP assay, an approach that uses crosslinking, immunoprecipitation and next-generation sequencing to yield a transcriptome-wide RNA:protein interaction map at single-nucleotide resolution. To facilitate data analysis, Corcoran et al. (2011) developed PARalyzer, an algorithm for PAR-CLIP data analysis that predicts which groups of short reads are derived from true protein binding sites.

Although PAR-CLIP/PARalyzer is a powerful workflow, it was designed to produce short (20-30nt) sequence reads that often map ambiguously to the genome, and it is not optimized for deep-sequenced datasets that span multiple timepoints. We have developed a modified version of PAR-CLIP that produces longer sequence reads and GOLDMINE, a tailored informatics pipeline that efficiently processes multiple terabytes of HiSeq data. GOLDMINE uses a characteristic read distribution pattern to separate true binding sites from random noise. Following identification of binding sites, it models the secondary structure of overlapping k-mer segments of these sites to identify conserved structures predictive of the presence of the binding protein. By using this workflow to find protein binding sites in activated and unactivated T cells, we are characterizing the regulatory activity of an RNA-binding protein that is critical to immune cell function.



P31
Poster Withdrawn


P32
ORCHID: a method for detecting short-range chromatin interactions in high-resolution 5C and Hi-C datasets

Subject: Machine learning

Presenting Author: Fei Ji, Massachusetts general hospital, United States

Co-Author(s):
Sharmistha Kundu, Massachusetts general hospital, United States
Robert Kingston, Massachusetts general hospital, United States
Ruslan Sadreyev, Massachusetts general hospital, United States

Abstract:

The chromatin interaction assays 5C and Hi-C are robust techniques to investigate spatial organization of the genome by capturing interaction frequencies between genomic loci. Although 5C and Hi-C resolution is theoretically restricted only by the length of digested DNA fragments (1Kb-4Kb), intrinsic stochastic noise and high frequencies of background interactions at the distances below 100 Kbp present a significant challenge to understanding short-distance chromatin organization. Here we present the shOrt Range Chromosomal Interaction Detection method (ORCHID) for a comprehensive high-resolution analysis of chromatin interactions in 5C and Hi-C experiments. This method includes background correction of raw interaction frequencies for individual primers or genomic bins, empirical correction for distance dependency of background noise, and detection of areas of significant interactions. When applied to publicly available datasets, ORCHID improves the identification of small (20-200Kb) interaction domains. Unlike larger classic TADs, these chromatin domains are often specific to cell type and functional state of the genomic region. In addition to the expected associations (e.g. with CTCF, cohesin, and mediator complexes), these domains show significant associations with other DNA-binding proteins. An important subtype of these small domains is fully covered and controlled by Polycomb Repressive Complex 1 (PRC1), which mediates transcriptional repression of many key developmental genes. As a separate unexpected example of a potential new mode of regulating chromatin interactions, the binding of RING1B, an essential subunit of the PRC1 complex, is also enriched near domain boundaries at the focused loci that do not necessarily correspond to repressed promoters.



P33
Clustering of Protein Conformations using Parallelized Dimensionality Reduction

Subject: Optimization and search

Presenting Author: Arpita Joshi, University of Massachusetts, Boston, United States

Co-Author(s):
Nurit Haspel, Umass Boston, United States

Abstract:

Analyzing the conformational pathways that a macromolecule undergoes is imperative to understanding its function and dynamics. We present a combination of techniques to sample the conformational landscape of proteins better and faster. Datasets representing these landscapes of protein folding and binding are complex and high dimensional. Therefore, there is a need for dimensionality reduction methods that best preserve the variance in the data, and facilitate the analysis of the data. The crux of this work lies in the way this is done. We start with a non-linear dimensionality reduction technique, Isomap, which has been shown to produce better results than linear dimensionality reduction in approximating the complex niceties of protein folding. However, the algorithm is computationally intensive for large proteins or a large number of samples (samples here refer to the various conformations that are used to ascertain the pathway between two distinctively different structures of a protein). We present a parallel algorithm written in C, using OpenMP, with a speed-up of approximately twice. The results obtained are coherent with the ones obtained using sequential Isomap. Our method uses a distance function to calculate the distance between the points that in turn measures the similarity between the conformations that each of these points represent. The output is a lower-dimensional projection that can be used later for purposes of visualization and analysis. A proof of quantitative validation comes with the least RMSD computation for the two embeddings. The algorithm also makes efficient use of the available memory.



P34
Optimizing nontuberculous mycobacteria (NTM) de novo genome assemblies for application in clinical case studies

Subject: Optimization and search

Presenting Author: Sara Kammlade, National Jewish Health, United States

Co-Author(s):
Nabeeh Hasan, National Jewish Health, United States
L. Elaine Epperson, National Jewish Health, United States
Michael Strong, National Jewish Health, United States
Rebecca Davidson, National Jewish Health, United States

Abstract:

To enable studies related to bacterial acquisition and clinical infections of nontuberculous mycobacteria (NTM), we developed a standardized bioinformatic analysis pipeline to process sequenced bacterial isolates from paired-end Illumina reads to fully annotated genomes and a companion PostgreSQL genomic database. Our NTM Genomes Database includes 1200+ isolates from 20 different NTM species which have been processed through our automated and optimized steps for read-trimming, de novo genome assembly, species identification using the average nucleotide identity (ANI) method, contig-ordering against a reference genome, and comprehensive annotation of genomic features. To optimize genome assembly methods and explore the theoretical potential of assembling complete genomes in the context of NTM, we performed experiments testing different parameter combinations in Skewer, SPAdes, and Unicycler on sequences from Illumina MiSeq (2x300bp) and HiSeq (2x250bp) platforms as well as on synthetic reads of varying read lengths and sequencing depths derived from published complete genomes. Assemblies from Illumina data revealed a negative effect of high GC content on assembly quality as measured by NG50. SPAdes and Unicycler yielded similar quality assemblies with Unicycler yielding fewer small (<1Kbp) contigs. From the synthetic reads we found diminished returns on NG50 improvement beyond 25x coverage at 250bp, and assembled a single contig genome using 50Kbp reads at 60x coverage. Using our high quality genomes we are able to identify core and accessory genes and investigate clinically relevant genotype-phenotype relationships. As an example, we will share findings from a case study of bacterial genomic evolution during a long-term pulmonary infection.



P35
PredHPI: an integrated web-server platform for the prediction and visualization of host-pathogen interactions

Subject: web services

Presenting Author: Rakesh Kaundal, Utah State University, United States

Co-Author(s):
Cristian Loaiza, Utah State University, United States

Abstract:

Understanding the mechanisms underlying infectious diseases is fundamental to develop prevention strategies. Host-pathogen interactions, which includes from the initial invasion of host cells by the pathogen through the proliferation of the pathogen in their host, have been studied to find potential genomic targets for the development of novel drugs, vaccines, and other therapeutics. Few in silico prediction methods have been developed to infer novel host-pathogen interactions, however, there is no single framework which combines those approaches to produce and visualize a comprehensive analysis of host-pathogen interactions. We present a web server platform named PredHPI available at http://bioinfo.usu.edu/PredHPI/. PredHPI is composed of independent sequence-based tools for the prediction of host-pathogen interactions. The Interolog module, including some of the IMEX databases (HPIDB, MINT, DIP, BioGRID and IntAct), provides three comparison flavors using the BLAST homology results (best-match, ranked-based and generalized). The Domain module, which performs the predictions of the domains, using Pfam and HMMer, and the interactions using the 3DID and IDDI databases. And the GO Similarity module which uses some of the Bioconductor species databases to calculate similarities using GOSemSim R package of the GO terms detected using InterProScan. PredHPI incorporates the functionally to visualize the resulting interaction networks plus the integration of several databases with enriched information about the proteins involved in it. To our knowledge, PredHPI is the first system to build and visualize interaction networks from sequence-based methods as well as curated databases. We hope that our prediction tool will be useful for researchers studying infectious diseases.



P36
An Automated Case Notes System for Psychiatrists Using Text Mining

Subject: Machine learning

Presenting Author: Nazmul Kazi, Montana State University, United States

Co-Author(s):
Indika Kahanda, Montana State University, United States

Abstract:

Current health care systems require clinicians to spend a substantial amount of time to digitally document their interactions with their patients through the use of electronic health records (EHRs), limiting the time spent on face-to-face patient care. Moreover, the use of EHRs is known to be highly inefficient due to additional time it takes for completion, which also leads to clinician burnout. In this project, we explore the feasibility of developing an automated case notes system for psychiatrists using text mining techniques that will listen to doctor-patient conversations, generate digital transcripts using speech-to-text conversion, classify information from the transcripts by identifying important keywords, and automatically generate structured case notes.

In our preliminary work, we develop a human powered doctor-patient transcript annotator and obtain a gold standard dataset through National Alliance of Mental Illness (NAMI) Montana. We model the task of classifying parts of conversations in to six broad categories such as medical and family history as a supervised classification problem and apply several popular machine learning algorithms. According to our preliminary experimental results obtained through 5-fold cross validation, Support Vector Machines are able to classify an unseen transcript with an average AUROC (area under the receiver operating characteristic curve) score of 89%. Currently, we are working on developing information extraction techniques to generate structured case notes from these classified information. At the same time, we are investigating the recording environment most effective for automatically transcribing a doctor-patient conversation using existing speech-to-text tools with built-in multi-speaker detection capability.



P37
A systems biology approach to define essential kinases in small cell lung cancer

Subject: other

Presenting Author: Jihye Kim, University of Colorado Denver Anschutz Medical Campus, United States

Co-Author(s):
Daniel Foster, National Jewish Health, United States
Rangnath Mishra, National Jewish Health, United States
James Finigan, National Jewish Health, United States
Jeffrey Kern, National Jewish Health, United States
Aik Choon Tan, University of Colorado Denver, United States

Abstract:

Small cell lung cancer (SCLC) is a deadly cancer where its five-year survival rate is < 7% and kills approximately 30,000 lives this year. Treatment of SCLC using the chemotherapy combination of cisplatin and etoposide with radiation therapy has not changed in almost 30 years. Therefore, novel therapies are needed for this disease. Building on the role of kinases and their regulation of cell growth and survival, we hypothesized that kinases regulate cell survival pathways in SCLC (essential kinases) and they may be effective targets as novel monotherapy, or act synergistically with standard chemotherapy, and improve therapeutic outcome. To test this hypothesis, we employed a systems biology approach to identify essential kinases in SCLC. We performed in vivo kinome-wide screening using an shRNA library targeting human kinases on seven chemonaïve SCLC patient derived xenografts (PDX). We developed a suite of bioinformatics tools to deconvolute the kinome screening data, and identified 23 essential kinases found in two or more PDX models. The top essential kinases were RET, MTOR and ATM. We connected these kinases to our drug database to identify specific inhibitors as potential therapy and performed in vitro and in vivo validation of their efficacy. Notably, monotherapy with a small molecule inhibitor targeting mTOR significantly reduced SCLC tumor growth in vivo proving mTOR’s essential kinase function. In addition, mTOR inhibition synergized with standard chemotherapy to significantly augment tumor responses in SCLC PDX models. These results warrant the further investigation of MTOR inhibitors combined with chemotherapy as novel treatment for SCLC.



P38
Comparative Analysis of Germline Microsatellites in the 1,000 Genomes Project

Subject: Metogenomics

Presenting Author: Nicholas Kinney, Virginia College of Osteopathic Medicine, United States

Co-Author(s):
Kyle Titus-Glover, Virginia Tech, United States
Jonathan Wren, Oklahoma Medical Research Foundation, United States
Robin Varghese, Edward Via College of Osteopathic Medicine, United States
Pawel Michalak, Edward Via College of Osteopathic Medicine, United States
Han Liao, Virginia Tech, United States
Ramu Anandakrishnan, Edward Via College of Osteopathic Medicine, United States
Arichanah Pulenthiran, Edward Via College of Osteopathic Medicine, United States
Lin Kang, Edward Via College of Osteopathic Medicine, United States
Harold Garner, Edward Via College of Osteopathic Medicine, United States

Abstract:

Abstract publication declined



P39
Omic profiling in healthy volunteers taking celecoxib reveals novel biomarkers regulated by cyclooxygenase-2

Subject: System integration

Presenting Author: Nicholas Kirkby, Imperial College London, United Kingdom

Co-Author(s):
Sarah Mazi, Imperial College London, United Kingdom
Timothy Warner, Queen Mary University of London, United Kingdom
Jane Mitchell, Imperial College London, United Kingdom

Abstract:

Introduction: Nonsteroidal anti-inflammatory drugs (NSAIDs) work by blocking cyclooxygenase (COX)-2 and are amongst the most commonly taken drugs worldwide but they also cause cardiovascular toxicity. Because of their widespread use, these are a major concern but no biomarkers or detailed mechanistic pathways are known. To begin to address this we have performed a first-of-its-kind transcriptomic and proteomic analysis of blood samples from healthy volunteers taking an NSAID.

Method: Blood was collected from n=8 healthy male volunteers pre/post 7 days treatment with celecoxib (200mg b.i.d.). The transcriptome was measured with Illumina HumanHT-12v4 arrays and proteome using label-free UPLC-MS/MS with fragments identified using Mascot software. Data were analysed using Genespring and R/Limma by moderated t-test and interpreted using a p<0.05 threshold for transcriptomics and a p<0.1 discovery threshold for proteomics. Pathway analysis was performed using g:Profiler.

Results: Transcription of 104 mapped genes were altered by celecoxib treatment. Pathway analysis revealed enrichment of genes associated with type I interferon responses, cholesterol metabolism and vasoconstriction. Levels of 26 plasma proteins (of ≈460 identifiable proteins) were also altered. In agreement with the interferon signature seen in the transcriptome, pathway analysis of the proteome data revealed altered proteins mapping to changes in acute inflammatory and acute-phase response networks.

Conclusion: This study is the first to apply unbiased ‘omic’ profiling to question of NSAID cardiovascular toxicity. This proof-of-concept study has provided viable novel targets for generation of mechanistic hypotheses as well as potential biomarkers to identify those most at risk of cardiovascular side effects.



P40
Poster Withdrawn


P41
Searching for translatable alternative splice isoforms in the human proteome

Subject: other

Presenting Author: Maggie Pui Yu Lam, University of Colorado Anschutz Medical Campus, United States

Co-Author(s):
Edward Lau, Stanford University, United States

Abstract:

The human genome contains over 100,000 alternative splice isoform transcripts, but the biological functions of most isoform transcripts remain unknown and many are not translated into mature proteins. A full appreciation of the biological significance of alternative splicing therefore requires knowledge of isoforms at the protein level, such as using mass spectrometry-based proteomics. One described is to perform in-silico translation of alternative transcripts, and then to use the resulting custom FASTA protein sequence databases with a database search engine for protein identification in shotgun proteomics. However, challenges remain as custom protein databases often contain many sequences that are in fact not translated as proteins inside the cell, thus contributing to a high false discovery rate in proteomics experiments.
We describe here a computational workflow and software to generate custom protein databases of alternative isoform sequences using RNA-seq data as input. The workflow is designed with the explicit goal to minimize untranslated sequences to rein in false positives. To evaluate its performance, we processed public RNA sequencing data from ENCODE to build custom FASTA databases for 10 human tissues (adrenal gland, colon, esophagus, heart, lung, liver, ovary, pancreas, prostate, testis). We applied the databases to identify unique splice junction peptides from public mass spectrometry data of the same human tissues on ProteomeXchange. We identified 1,984 protein isoforms including 345 unique splice-specific peptides not currently documented in common proteomics databases. We suggest that the described proteotranscriptomics approach may help reveal previously unidentified alternative isoforms, and aid in the study of alternative splicing.



P42
A human disease network from gene-publication relationships on PubMed

Subject: inference and pattern discovery

Presenting Author: Edward Lau, Stanford University, United States

Co-Author(s):
Cody Thomas, University of Colorado AMC, United States
Maggie Pui Yu Lam, University of Colorado AMC, United States

Abstract:

Human diseases can be represented as a network connecting similar disorders based on their shared phenotypic and molecular characterizations. Network analysis of disease-disease relationships can yield insights into important biological processes and pathogenic pathways. We recently described a method to determine the semantic similarity between a gene or protein and the literature publications related to a disease, by combining PubMed web queries and curated/text-mined annotations of gene-PMID links from NCBI. We devised a weighted co-publication distance metric to score gene-disease co-occurrences in PubMed, where genes with many non-specific publications are down-ranked whereas recent and high-impact publications are given more weight. We show that this method outperforms existing bibliometric analysis in predicting benchmark gene lists of disease terms. Using this method, we have now compiled significant protein lists from over 20,000 human disease or disease phenotype terms from three standardized vocabularies, namely Disease Ontology (DO), Human Phenotype Ontology (HPO), and Pathway Ontology (PWO). We find that disease terms are associated with specific popular protein lists that inform on protein-disease relationships. The PubMed-based disease network recapitulates several known properties from previous "diseasomes" constructed from OMIM or phenotypic similarity data (e.g., Barabási 2007), including the centrality of metabolic diseases and clustering of related diseases around high-level hub terms. We discuss applications for the disease network, including (i) finding commonly associated diseases from a list of differentially expressed genes in a RNA-seq experiment, and (ii) using gene-disease relationship to predict hidden disease genes in a particular disease



P43
Unbiased Pathway Detection Expands Cancer Pathways

Subject: Networking

Presenting Author: Chih-Hsu Lin, Baylor College of Medicine, United States

Co-Author(s):
Stephen Wilson, Baylor College of Medicine, United States
Teng-Kuei Hsu, Baylor College of Medicine, United States
Minh Pham, Baylor College of Medicine, United States
Olivier Lichtarge, Baylor College of Medicine, United States

Abstract:

Pathways are functional gene groups and represent how signals are transmitted/received and which genes/proteins interact. Conventionally, domain experts annotate pathways based on the literature. Thus, the unbiased detection of functional gene groups solely based on the gene-gene interaction network structure may provide novel insights. Here, we hypothesized that gene members in a functional gene group interact within the group more than outside the group. We developed Recursive Louvain algorithm to detect communities (i.e., clustered gene groups) on a human protein-protein interaction network. 81.9 % of the communities overlapped with known pathways significantly compared to random controls, whereas 622 communities didn’t and may be novel gene groups. In addition, variants of genes overlapping with communities are more likely to be pathogenic in ClinVar and have high evolutionary impact quantified (p <<0.0001). As a case study in head and neck cancer, we found 38 communities are under significant mutational selection (q<0.1). By integratively clustering patients on the mutation, copy number variation, RNA and miRNA expression, one community separated patient survival (q=0.008). Furthermore, we designed neural network (NN) model architectures based on communities to predict human papillomavirus status. The results showed that NN based on communities outperformed NN based on random gene groups and performed similarly if not better than fully connected NN. In conclusion, these data suggest that the communities recover known functional and disease pathways, could be used as cancer survival predictors, and could capture underlying gene compositions of biological phenotypes to make NN models more interpretable. This study will help understanding of cancer pathways and provide biomarkers for cancer patients.



P44
Identifying HCV-Host Interactions from Amino Acids Sequences Using SVM Identifying HCV-Host Interactions from Amino Acids Sequences Using SVM

Subject: Machine learning

Presenting Author: Xin Liu, XuZhou Medical University, China

Co-Author(s):
Wei Geng, XuZhou Medical University, China
Dan Wang, Xuzhou Medical University, China
Xue Piao, Xuzhou Medical University, China
Ting Yang, Xuzhou Medical University, China

Abstract:

Detecting the interactions between the hepatitis C virus (HCV) and human proteins will facilitate our understanding of the pathogenesis and is helpful in searching for new drug targets. Many researchers have focused on the computing perspective to study the protein–protein interactions (PPIs), but most of them have been designed for PPIs within the same species, which is not fit for different species. In this paper, we developed a novel computational model to predict interaction between HCV and human proteins. As the position specific scoring matrix (PSSM) not only preserves the positional information of the sequence, but also retains the chemical information of the protein, we used the local directional texture pattern (LDTP) to further extract information from the PSSM. Then, support vector machine (SVM) was used to implement the classification. When performed on the HCV dataset, the accuracy of the proposed model could achieve 86.7%, which was superior than most of the previous methods. When performed on an independent dataset, the accuracy achieved 73.9%. We also made a comparison between some state-of-the-art algorithms with our method, and the results showed that the proposed method is simple, effective, and can be used for future proteomics research.



P45
Modeling the Structure of BioGRID PPI Networks

Subject: Qualitative modeling and simulation

Presenting Author: Sridevi Maharaj, University of California-Irvine, United States

Co-Author(s):
Pedro Silva, University of California-Irvine, United States
Zarin Ohiba, University of California-Irvine, United States
Wayne Hayes, University of California-Irvine, United States

Abstract:

Protein-protein interaction (PPI) networks are being continuously updated but are still incomplete, sparse, and have false positives and negatives. Amongst the heuristics employed to describe network topology, graphlets have emerged successful in quantifying local structure of biological networks. Some studies analyzing the graphlet degree distributions and relative graphlet frequency, found Geometric (GEO) networks to be a reasonable basis for modeling PPI networks. However, all extensive studies to model PPI networks as a whole utilized older PPI network data. While there are numerous techniques through which PPI data can be curated, in this study, we re-evaluate these models on the newest PPI data available from BioGRID for the following nine species: AThaliana, CElegans, DMelanogaster, EColi, HSapiens, MMusculus, RNorvegicus, SCerevisiae, and SPombe. To the best of our knowledge, this has not yet been performed, as the data is relatively new. We compare the graphlet distributions of several models to distributions of the updated networks and analyze their fit using several measures that have been shown to be suitable for measuring network distances (or similarities): RGFD, GDDA, Graphlet Kernel, and GCD. Despite minor behavioral differences amongst the comparison measures, we find that other than the Sticky model, the Scale-Free Gene Duplication and Divergence (SFGD) and Scale-Free (SF) models unanimously outperform other traditional models (including GEO and GEOGD) in matching the structure of these 9 BioGRID PPI networks. We further corroborate these results using machine learning classifiers to categorize each species as a network model and visualize these results using t-SNE plots. *



P46
Select: A SQL-based, high-resolution selection scanning tool to Identify genomic selection signals using next-generation sequencing data

Subject: inference and pattern discovery

Presenting Author: Hannah Maltba, Brigham Young University, United States

Co-Author(s):
Sean Beecroft, Brigham Young University, United States
Spencer Smith, Brigham Young University, United States

Abstract:

Next-generation sequencing (NGS) enables high-resolution, genomic-based evolution studies, but an exhaustive, genome-wide selection scan on NGS data is computationally intensive. We developed a comprehensive software Selection Test (SelecT) to identify specific loci under selection using NGS data at base pair resolution. SelecT calculates five statistics to identify evolutionary patterns in allele frequency (Fst, ∆DAF) and haplotype homozygosity (EHH, iHS, XPEHH) that locate regions with multiple, strong evolutionary signals. Data and results are stored in a database, making it easy to visualize results and run additional queries on the data. Multiple tests can be run on already imported populations without reloading any files and users may choose to run statistics one at a time or all at once. SelecT has an intuitive user interface, runs in only minutes, and can be used to identify selection in any organism.



P47
Codon Pairs are Phylogenetically Conserved: Codon pairing as a novel phylogenetic character state for parsimony and alignment-free methods

Subject: inference and pattern discovery

Presenting Author: Lauren McKinnon, Brigham Young University, United States

Co-Author(s):
Justin Miller, Brigham Young University, United States

Abstract:

Identical codon pairing and co-tRNA codon pairing increase translational efficiency within genes when two codons that encode the same amino acid are located within a ribosomal window. By examining identical and co-tRNA codon pairing independently and combined across 23 423 species, we determined that both pairing techniques are phylogenetically informative using either an alignment-free or parsimony framework in all domains of life. We also determined that the minimum optimal window size for conserved codon pairs is typically smaller than the length of a ribosome. We thoroughly analyze codon pairing across various taxonomic groups. We determined which codons are more likely to pair and we analyze the frequencies of codon pairings between species. The alignment-free method does not require orthologous gene annotations and recovers species relationships that are more congruent with established phylogenies than other alignment-free techniques in all instances. Parsimony recovers trees that are more congruent with the established phylogenies than the alignment-free method in four out of six taxonomic groups. Four taxonomic groups do not have sufficient ortholog annotations and are excluded from the parsimony and/or maximum likelihood analyses. Using only codon pairing, the alignment-free or parsimony-based approaches recover the most congruent trees compared with the established phylogenies in six out of ten taxonomic groups. Since the recovered phylogenies using only codon pairing largely match established phylogenies, we propose that codon pairing biases are phylogenetically conserved and should be considered in conjunction with current techniques in future phylogenomic studies.

Availability: https://github.com/ridgelab/codon_pairing



P48
ExtRamp: A novel algorithm for extracting the ramp sequence based on the tRNA adaptation index or relative codon adaptiveness

Subject: inference and pattern discovery

Presenting Author: Justin Miller, Brigham Young University, United States

Co-Author(s):
Logan Brase, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Abstract:

Abstract publication declined



P49
Metabolic profiling using a UHPLC-MS/MS-based platform to quantify amines, amino acids and methylarginines in plasma from cyclooxygenase-2 knockout mice

Subject: other

Presenting Author: Jane Mitchell, Imperial College London, United Kingdom

Co-Author(s):
Elizabeth Want, Imperial College London, United Kingdom
Blerina Ahmetaj-Shala, Imperial College London, United Kingdom
Micahel Olanipekun, Imperial College London, United Kingdom
Abel Tesfai, Imperial College London, United Kingdom
Yu He, Imperial College London, United Kingdom
Rolf Nüsing , Imperial College London, United Kingdom
Nicholas Kirkby, Imperial College London, United Kingdom

Abstract:

Amine quantification is an important area in biomedical research and in patient stratification for personalised medicine. One important pathway critically reliant on amine levels is nitric oxide synthase (NOS). NOS forms NO which mediates processes in the cardiovascular, immune and nervous systems. NO release is regulated by levels of (i) the substrate, arginine, (ii) amino acids which cycle with arginine and (iii) methylarginine inhibitors of NOS. However, measurement of a wide range of amines, including methylarginines, on a common platform has been challenging. To address this, we have recently reported on an analytical method where a wide range of amines including amino acids and methylarginines can be measured in a common plasma sample. Using high-throughput ultra-high-performance liquid chromatography-tandem mass spectrometry (UHPLC-MS/MS) ≈40 amine analytes, including arginine and methylarginines were detected and quantified on a molar basis (Ahmetaj-Shala et al., 2018; Scientific reports, 13987). Our previous work using transcriptomic analysis revealed a relationship between the enzyme cyclooxygenase-2 in the kidney, renal function and genes that regulate arginine and methylarginine levels (Ahmetaj-Shala et al., 2015, Circulation, 132, 633-642). In the current study we applied the UHPLC-MS/MS platform to the measurement of amines and methylarginines in plasma from cyclooxygenase-2 knockout mice. Principle component analysis showed separate clustering of the two groups and quantification of analytes confirmed increases in methylarginine levels in plasma from cyclooxygenase-2 knockout mice. These results illustrate the useful application of this platform and indicate that genetic deletion of cyclooxygenase-2 increases methylarginine levels and disrupts the amine metabolome.



P50
The characterization of different cell types using the Benford law

Subject: inference and pattern discovery

Presenting Author: Sne Morag, Ariel University, Israel

Co-Author(s):
Mali Salmon-Divon, Ariel University, Israel

Abstract:

Classification of cells and specifically their cell-type and origin are essential for science and medicine. Approaches for cell-type identification were previously limited to the detection of cell-surface’s known markers. With the development of high throughput sequencing technologies and single-cell sequencing in particular, novel methods for cell-type classification emerged, lacking the requirement for specific markers. In this study we propose to develop a novel algorithm for cell-type classification based on the Benford law. The Benford law states that within a large numerical data, the leading digit’s occurrence probability drops as its value increases. Previously in our lab, it was shown that a cell's complete digital gene expression data has a Benford-like distribution and that genes' obedience to the Benford law can indicate their housekeeping or tissue-specific characteristics. In this study, the correlation of Single-cell RNA-seq (scRNA-seq) with the Benford distribution is established and examined for its cell-type classification ability. ScRNA-seq data's adherence to Benford is measured using mean absolute error (MAE)and Machine-learning approaches are used for the Benford-based algorithm's assessment to classify single cells into their corresponding cell-type. Our results show that scRNA-seq data follows Benford distribution and that Benford-based calculations yields a better cell type separation in comparison to expression-based analysis. To summarize, this study may yield a robust in-silico cell’s characterization tool that could be used in various medical applications including identification of metastasis origin, evaluate a cell’s potency level and enable identification of rare cells within cell population. These could contribute to cancer diagnosis, treatment and regenerative medicine.



P51
Poster Withdrawn


P52
Mining Heterogenous Relationships from Pubmed Abstracts Using Distant Supervision

Subject: Text Mining

Presenting Author: David Nicholson, University of Pennsylvania, United States

Co-Author(s):
Daniel Himmelstein, University of Pennsylvania, United States
Casey Greene, University of Pennsylvania, United States

Abstract:

Identifying mechanisms underlying disease and finding drugs that intervene or prevent such mechanisms is an important task in biomedical sciences. One approach to identify these drug targets is to combine evidence from multiple sources, including scientific publications, to model potential relationships between drugs, genes and diseases and interpolate novel relationships. Previously, a heterogeneous (hetnet) network, called hetionet pioneered such efforts by integrating various relationships from multiple data sources; however, building such network requires hours upon hours of manual curation, which is not feasible in a larger scale. We aim to remedy this bottleneck by extracting multiple relationships from Pubmed abstracts via a distant supervision approach. This approach circumvents the time-consuming task of obtaining “ground-truth” training labels via the data programming paradigm, which consists of using a simple set of programs, also called label functions, to probabilistically label large training datasets. Using these datasets, we then train machine learning classifiers to classify whether or not a sentence mentions a relationship. We evaluated this approach by assessing label function accuracy, determining if label functions can transfer between different relationship types and measuring the value of these estimated labels via downstream analyses. These analyses consisted of comparing bag of words, logistic regression and neural network-based methods. Overall performance remains modest, suggesting that label functions may need to be improved or that abstracts may not be sufficient for high-accuracy relationship extraction; however, our results also suggest that label functions can transfer across various relationships suggesting that hetnet construction through this approach may be viable.



P53
Mutational impact on protein structure and function in endometrial cancer

Subject: other

Presenting Author: Amanda Oliphant, Brigham Young University, United States

Co-Author(s):
Emily Hoskins, Brigham Young University, United States
Daniel Cui Zhou, Washingto University in St. Louis, United States
David Adams, Brigham Young University, United States
Sean Beecroft, Brigham Young University, United States
Li Ding, Washington University in St. Louis, United States
Samuel Payne, Brigham Young University, United States

Abstract:

DNA mutation is a well-known driver for cancers, including endometrial and uterine cancer. Although many mutation sites have been discovered in large population studies like TCGA and CPTAC, the functional impact of these mutations often remains unclear. The sites of mutation in endometrial cancer are often not shared between individuals in a cohort, making it difficult to interpret genomic data. Here we show how integrating protein three-dimensional structure information can provide insight into the effects of mutations on protein function. We use bioinformatics tools to locate clustered mutations in a cohort of individuals with endometrial cancer. These mutational hotspots point to functional areas of a protein that are frequently disrupted in cancerous cells. Identifying these hotspots allows us to see how mutations that are not located near each other on a genomic scale may have the same effect on protein function. For example, we found a cluster of mutations in an F-box/WD repeat-containing protein and an E3 ubiquitin ligase which may interfere with its ability to regulate target proteins such as Cyclin-E. Patients exhibiting any of the mutations in a functionally significant cluster may be treated in a similar manner. We anticipate that our findings will aid in classifying the specific type of endometrial cancer present in an individual, allowing for more personalized treatment strategies. Because generalized treatment is often ineffective, these results have the potential to help us understand hard-to-treat forms of endometrial cancer, leading to better end results.



P54
A platform for community-scale transcriptome-wide association studies

Subject: Data management methods and systems

Presenting Author: YoSon Park, Perelman School of Medicine University of Pennsylvania, United States

Co-Author(s):
Casey Greene, Perelman School of Medicine University of Pennsylvania, United States

Abstract:

Transcriptome-wide association studies (TWAS) infer causal relationships between genes, phenotypes and tissues using strategies such as 2-sample Mendelian randomization (MR). Such methods largely eliminate the need to access individual-level data and allow openly sharing data and results. Nonetheless, to our knowledge, there are no public platforms automating quality assurance and continuous integration of TWAS results. Consequently, finding, replicating, and validating causal relationships among millions of similar non-causal relationships remain enormously challenging and are often time- and resource-consuming with many duplicated efforts.

To address this shortcoming, we develop a platform that uses version control software and continuous integration to construct a data resource for the components of TWAS. Community members can contribute additional association studies or methods. We use automated testing to catch formatting mistakes and use pull request functionality to review contributions. We provide a set of tools, available in a Docker container, that perform common downstream analyses using these resources.

Researchers who contribute summary-level datasets substantially increase the impact of their work by making it easy to integrate with complementary datasets. Those who contribute analytical tools will benefit by providing users with numerous off-the-shelf use cases. For this proof-of-concept, we integrate a set of eQTLs provided by the Genotype-Tissue Expression (GTEx) project and a set of curated GWAS summary statistics using 2-sample MR. Our long-term goal for this project is a public community-driven repository where users contribute new summary-level data, download complementary data, and add new analytical methods that enables the field to rapidly translate new studies into actionable findings.



P55
Good Nomen: An Interactive Web Application for Cleaning Clinical Data Using Standardized Terminologies

Subject: Graphics and user interfaces

Presenting Author: Alyssa Parker, Brigham Young University, United States

Co-Author(s):
Stephen Piccolo, Brigham Young University, United States

Abstract:

Terms used to describe medical conditions, treatments, tests, and outcomes vary widely within and across datasets. This creates difficulties when analyzing such data. For example, The Cancer Genome Atlas uses 13 different terms to describe cyclophosphamide, a drug commonly prescribed to breast-cancer patients. To address this problem, researchers have produced terminologies and thesauri that define standardized terms and synonyms for biomedical concepts. While these resources contain valuable information, restructuring clinical data to conform to these standards may be time consuming and require computational expertise to do this systematically. We have developed Good Nomen, a Web application that allows users to standardize data interactively in a high-throughput manner. Good Nomen accepts data files in a CSV, TSV, or Excel format. Then it asks the user to select a terminology to use in the standardization process. Currently, we support the National Cancer Institute Thesaurus, ICD-10-CM, and HGNC database. The user selects a column containing categorical data to standardize. Good Nomen then examines the data and uses regular expressions to suggest standardized terms and synonyms from the terminology. If the user accepts these matches, the data values are modified to match the specified terms. The user can also manually standardize the data to ensure that misspellings and additional synonyms are not overlooked. It is our hope that harnessing the power of these terminologies into an intuitive, user-friendly Web application will enable researchers to more easily standardize their data and expedite the process of clinical data analysis.



P56
Proteomics of natural bacterial isolates powered by deep learning-based de novo identification

Subject: Machine learning

Presenting Author: Samuel Payne, Brigham Young University, United States

Co-Author(s):
Joon-Yong Lee, Pacific Northwest National Laboratory, United States
Hugh Mitchell, Pacific Northwest National Laboratory, United States
Meagan Burnet, Pacific Northwest National Laboratory, United States
Sarah Jenson, Pacific Northwest National Laboratory, United States
Eric Merkley, Pacific Northwest National Laboratory, United States
Anil Shukla, Pacific Northwest National Laboratory, United States
Ernesto Nakayasu, Pacific Northwest National Laboratory, United States

Abstract:

The fundamental task in proteomic mass spectrometry is identifying peptides from their observed spectra. Where protein sequences are known, standard algorithms utilize these to narrow the list of peptide candidates. If protein sequences are unknown, a distinct class of algorithms must interpret spectra de novo. Despite decades of effort on algorithmic constructs and machine learning methods, de novo software tools remain inaccurate when used on environmentally diverse samples. Here we train a deep neural network on 5 million spectra from 55 phylogenetically diverse bacteria. This new model outperforms current methods by 25-100%. The diversity of organisms used for training also improves the generality of the model, and ensures reliable performance regardless of where the sample comes from. Significantly, it also achieves a high accuracy in long peptides which assist in identifying taxa from samples of unknown origin. With the new tool, called Kaiko, we analyze proteomics data from six natural soil isolates for which a proteome database did not exist. Without any sequence information, we correctly identify the taxonomy of these soil microbes as well as annotate thousands of peptide spectra



P57
Text Mining Novel Disease- and Drug-Specific Pathways

Subject: Text Mining

Presenting Author: Minh Pham, Baylor College of Medicine, United States

Co-Author(s):
Stephen Wilson, Baylor College of Medicine, United States
Chih-Hsu Lin, Baylor College of Medicine, United States
Olivier Lichtarge, Baylor College of Medicine, United States

Abstract:

In response to the exponential growth of scientific publications, text mining is increasingly used to extract biological pathways and processes. Though multiple tools explore individual connections between genes, diseases, and drugs, not many extensively examine contextual biological pathways for specific drugs and diseases. In this study, we extracted more than 3,000 functional gene groups for specific diseases and drugs by applying a community detection algorithm to a literature network. The network aggregated co-occurrences of Medical Subject Headings (MeSH) terms for genes, diseases, and drugs in publications. The detected literature communities were groups of highly associated genes, diseases, and drugs. The communities significantly captured genetic knowledge of canonical pathways and recovered future pathways in time-stamped experiments. Furthermore, the disease- and drug-specific communities recapitulated known pathways for those given diseases and drugs. In addition, diseases in same communities had high comorbidity with each other and drugs in same communities shared great numbers of side effects, suggesting that they shared mechanisms. Indeed, the communities robustly recovered mutual targets for drugs (AUROC = 0.75) and shared pathogenic genes for diseases (AUROC = 0.82). These data show that the literature communities not only represented known biological processes but also suggested novel disease- and drug-specific mechanisms, facilitating disease gene discovery and drug repurposing.



P58
A Case Study on the Effects of Noisy, Long-read Correction Approaches on Assembly Contiguity

Subject: other

Presenting Author: Brandon Pickett, Brigham Young University, United States

Co-Author(s):
Justin Miller, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Abstract:

Third-generation sequencing technologies are advancing our ability to sequence increasingly long DNA sequences in a high-throughput manner. Pacific Biosciences (PacBio) Single-molecule, Real-time (SMRT) sequencing and Oxford Nanopore Technologies (ONT) nanopore sequencing routinely produce raw sequencing reads averaging 20-30kbp in length. Maximum read lengths have, in some cases, exceeded 100kbp. Unfortunately, these long reads are expensive to generate and have a high error rate (10-15%) when compared with Illumina short reads (1%). The limitation on assembly from high error rates can be mitigated by (a) co-assembling high-error, long reads with low-error, short reads (e.g., MaSuRCA) or (b) correcting the errors prior to assembly. Pre-assembly error correction typically happens by either (a) self-correction or (b) hybrid correction. Self-correction requires increased sequencing depth (and thus expense) and can be done with stand-alone software (e.g., Racon) or via a module in an assembler (e.g., Canu). Hybrid correction involves alignment of low-error, short reads to the raw long reads to generate the consensus (e.g., CoLoRMap). Note that low-error, short reads can also be used to polish the assembled contigs, i.e., correct misassemblies and errors. To investigate how self-correction, hybrid correction, or both correction methods affect assembly contiguity, we tried each approach in a case study. Bonefish (Albula glossodonta) DNA was extracted and sequenced on PacBio Sequel to theoretical 70x coverage and on Illumina HiSeq 2500 to theoretical 100x coverage with paired-end (PE) 2x250 in Rapid run mode. Our assembly results demonstrate that a combination of both approaches generates the most contiguous bonefish assembly.



P59
Measuring chromosome conformation

Subject: Simulation and numeric computing

Presenting Author: Brian Ross, University of Colorado Anschutz Medical Campus, United States

Co-Author(s):
James Costello, University of Colorado Anschutz Medical Campus, United States

Abstract:

The in-vivo conformation of chromosomes is an outstanding unsolved problem in structural biology. Most structural information is currently inferred indirectly from Hi-C data, as direct measurements of chromosomal positioning have not been possible for more than a handful of genetic loci. We have previously demonstrated a computational method for scaling direct positioning measurements up to the whole-chromosome scale. Here we present our latest results from simulations and experiments.



P60
Challenges Using Electronic Medical Record for Pharmacokinetic Analysis

Subject: Simulation and numeric computing

Presenting Author: Matthew Shotwell, Vanderbilt University Medical Center, United States

Co-Author(s):
Hannah Weeks, Vanderbilt University, United States

Abstract:

Hospitalized patients may benefit from individualized drug dosing that is informed by real-time blood sampling and pharmacokinetic analysis. The additional necessary dosing history information and other clinical and demographic factors can be extracted from the electronic medical record (EMR). However, these data are prone to errors. We consider the impact of incorrect entry of dose administration times and blood sampling times. We further show that, for the intravenously infused antibiotic piperacillin, estimation of a clinically informative measure of drug exposure - the fraction of the dosing cycle in which the blood concentration of drug is above a given efficacy threshold - is robust to many types of error that occur in the EMR. In addition we demonstrate that certain drug administration techniques, including long infusion duration, ensure greater robustness.



P61
Measuring Transcription Factor Activity with Nascent RNA Sequencing

Subject: inference and pattern discovery

Presenting Author: Rutendo Sigauke, University of Colorado Anschutz Medical Campus, United States

Co-Author(s):
Jonathan Rubin, University of Colorado Boulder, United States
Jacob Stanley, University of Colorado Boulder, United States
Robin Dowell, University of Colorado Boulder, United States

Abstract:

Transcription factor (TF) proteins control cellular states and functions by regulating the transcription of genes. In order to measure the activity of TFs most studies have taken advantage of Chromatin Immunoprecipitation (ChIP) Assays (Gade and Kalvakolanu, 2012). However, not all TF ChIP-bound sites result in TF activity. In this study we present Transcription Factor Enrichment Analysis (TFEA), a method to identify TF activity using nascent RNA sequencing. Nascent RNA sequencing has allowed for the sequencing of short-lived enhancer RNAs (eRNAs). Previous studies have shown that transcription of eRNAs is a direct measure of TF activity (Hah et al. 2013, Allen et al. 2014). TFEA extends the Motif Displacement Score (MDS) method which uses the colocalization of eRNAs with TF motifs as a measure of TF activity (Azofeifa et al. 2018). TFEA takes advantage of differential transcription of eRNAs, in addition to colocalization of eRNAs with TF motifs, to assess TF activity. TFEA gives a summary report of TFs predicted to be enriched in a given experiment. TFEA was able to identify TFs known to be enriched in several case studies.



P62
Addressing the compositional data problem in sequencing with a novel, robust normalization method

Subject: Simulation and numeric computing

Presenting Author: James St. Pierre, University of Toronto, Canada

Co-Author(s):
John Parkinson, Hospital for Sick Children, Toronto, Canada

Abstract:

A problem that faces high-throughput sequencing datasets is that raw sequencing data is semi-quantitative due to the random sampling procedure of the sequencing process itself. The raw counts produced only give relative abundances of various genes and must be appropriately normalized to give an approximation of the absolute abundances of genes in the samples. This ‘compositional data problem’ in sequencing is especially apparent in the microbiome field. Normalization methods developed for RNA-seq data have been shown to fail when used on 16S microbiome sequencing data, leading to inflated false discovery rates when performing differential abundance analysis. Moreover, the effectiveness of these normalization techniques when used on metagenomics and metatranscriptomics data has yet to be systematically evaluated. We present a novel normalization method that shows improved performance over previous methods (DESeq2, edgeR, and metagenomeSeq) when applied to simulated sequencing data. All current normalization methods have the statistical assumption that most genes (or taxa) are not differentially abundant between experimental groups. The new technique does not have this assumption and is the only method that successfully controls false positive rates during differential abundance testing on a simulated 16S dataset where 50% of taxa were set to be differentially abundant. Even ANCOM and ALDEx2, two compositional data analysis tools previously shown to be more robust than other methods, are shown here to have inflated false positive rates. This new normalization method will be an asset to microbiome researchers, leading to more robust discoveries.



P63
Governance Innovations for Promoting Cross-institutional Electronic Health Data Sharing

Subject: Data management methods and systems

Presenting Author: Kari Stephens, University of Washington, United States

Co-Author(s):
Adam Wilcox, University of Washington, United States
Philip Payne, Washington University, United States
Jason Morrison, University of Washington, United States
Jennifer Sprecher, University of Washington, United States
Rania Mussa, University of Washington, United States
Randi Foraker, Washington University, United States
Sarah Biber, Oregon Health Sciences University, United States
Sean Mooney, University of Washington , United States

Abstract:

Cross institutional electronic health data sharing is an essential requirement for health innovation research. Healthcare organizations across the country are governed separately by state and local laws and policies that complicate research related data sharing. Electronic health record (EHR) data are not only highly protected via federal laws (i.e., HIPAA) and regional Internal Review Boards (IRBs), but are also often protected as assets by individual organizations. No clear pathway exists for organizations to execute governance for rapid EHR data sharing, stifling research efforts ranging from simple observational studies to complex multi-institutional trials. Universal governance solutions are essential to provide pathways for data sharing to address the rapid pace of research. The Clinical Translational Science Award (CTSA) Program Data to Health (CD2H) Coordinating Center has launched a cloud data sharing pilot project to begin addressing this complex issue. In order to configure a web-based data sharing software tool, Leaf, that can cross-query comprehensive harmonized EHR data generated by multiple healthcare organizations, we are exploring a singular governance solution (i.e., embodied in a data use agreement (DUA) and Internal Review Board (IRB) solution) to accommodate both a general and research specific use. While DUAs and IRBs are not streamlined governance solutions, this is an essential first step in creating broader sustainable national governance solutions (i.e., master consortium agreements, access governance policies).



P64
Use of metadata and Bag-of-words to map measurements across observational study data

Subject: inference and pattern discovery

Presenting Author: Laura Stevens, University of Colorado Anschutz Medical School, United States

Co-Author(s):
Tiffany Callahan, University of Colorado Anschutz Medical School, United States
Sonia Leach, University of Colorado Anschutz Medical School, United States
David Kao, University of Colorado Anschutz Medical School, United States

Abstract:

Data integration is an important strategy for validating research results or increasing sample size in biomedical research. Integration is made challenging by metadata and data differences between studies, and is often done manually by a clinical expert for a highly select set of measurements. Unfortunately, this process is rarely documented, and when it is, the details are not accessible, interoperable, or reusable. We explored the utility of using bag-of-words, an information retrieval model, to map medical conditions, characteristics, and lifestyle measurements among multiple studies such as diabetes, age, blood pressure, or alcohol intake. We hypothesized applying cosine similarity to features extracted as a bag-of-words model from observational study measurement annotations would yield accurate recommendations for mapping measurements within and between studies and increase scalability compared to manual mapping. Each measurement’s metadata, including descriptions, units, and value-coding, were extracted and then combined for all 105,611 measurements in four cardiovascular-health observational studies. The measurement’s combined metadata was input to the bag-of-words model. Cosine similarity of word vectors was used to score similarity between measurement pairs. The highest scoring matches for each measurement were compared to 612 unique expert-vetted, manual mappings. Among the vetted measurement pairings, 99.8% had the correct mapping in the top-10 scored matches, 92.5% had the correct mapping in the top-5, and 55.7% had the correct mapping as the top score. This approach provides a scalable method for recommending measurement mappings in observational study data. Next steps include incorporating additional metadata such as measurement type or a synonyms dictionary for concept recognition.



P65
Visualization Tool for interactive deciphering complex genetic regulation from multi-omic data

Subject: Graphics and user interfaces

Presenting Author: LIN TING-WEI, Linkou Chang Gung Memorial Hospital, Taiwan


Abstract:

To decipher the complexity of the regulation relationship within certain biological intervention, an integrative analysis of the data with multiple annotation steps have to be conducted . One of the challenges behind these multi-steps analysis approach is that requirement of hypothesis generated from biologist across variety steps to modify and concentrate to certain details of the result. We provides a framework from visualizing tree relationship between pathways, which using signaling pathway impact analysis for direction. Then, a multilayer network layout to carry the master regulator analysis from prior result for representation. In the end, the biologist can use this framework and interaction with the visualization result to decipher the possible key regulation in their experimental intervention or generate new hypothesis for further experiment validation.



P66
anexVis: visual analytics framework for analysis of RNA expression

Subject: Graphics and user interfaces

Presenting Author: Diem-Trang Tran, University of Utah, United States

Co-Author(s):
Tian Zhang, University of Utah, United States
Ryan Stutsman, University of Utah, United States
Matthew Might, University of Alabama at Birmingham, United States
Umesh Desai, Virginia Commonwealth University, United States
Balagurunathan Kuberan, University of Utah, United States

Abstract:

Although RNA expression data are accumulating at a remarkable speed, gaining insights from them still requires laborious analyses, which hinder many biological and biomedical researchers. We introduce a visual analytics framework that applies several well-known visualization techniques to leverage understanding of an RNA expression dataset. Our analyses on glycosaminoglycan-related genes have demonstrated the broad application of this tool, anexVis (analysis of RNA expression), to advance the understanding of tissue-specific glycosaminoglycan regulation and functions, and potentially other biological pathways.
The application is publicly accessible at https://anexvis.chpc.utah.edu/, source codes deposited on GitHub.



P67
Toxicant-protein relation extraction

Subject: Text Mining

Presenting Author: Ignacio Tripodi, University of Colorado, Boulder, United States

Co-Author(s):
Lawrence Hunter, University of Colorado, Denver, United States

Abstract:

The interaction between chemicals and proteins provides essential information regarding how exposure to certain chemicals affects cell functions. In particular, knowing how chemicals that result in toxicity are associated to the up- or down-regulation of transcription factors, can help elucidate the mechanistic details of such adverse outcomes. Some of this information can be inferred indirectly by chemical-to-gene interactions present in public databases. These resources are, however, updated at varying frequencies and generally incomplete, just as our knowledge of which of the many transcription factors regulate which genes. We propose a text-mining approach where we explore an open-access body of literature, to determine using machine learning and a set of heuristics which chemicals from a list of known toxicants are associated to an increase or decrease of specific transcription factors' activity.



P68
LOINC2HPO: Improving translational informatics by standardizing EHR phenotypic data using the Human Phenotype Ontology

Subject: Text Mining

Presenting Author: Nicole Vasilevsky, Oregon Health & Science University, United States

Co-Author(s):
Aaron Zhang, The Jackson Laboratory, United States
Jean-Philippe Gourdine, Oregon Health & Science University, United States
Amy Yates, Oregon Health & Science University, United States
Melissa Haendel, Oregon Health & Science University, United States
Peter Robinson, The Jackson Laboratory, United States

Abstract:

Electronic Health Record (EHR) data are often encoded using Logical Observation Identifier Names and Codes (LOINC), which is a universal standard for coding clinical laboratory tests. LOINC codes encode clinical tests and not the phenotypic outcomes, and multiple codes can be used to describe laboratory findings that may correspond to one phenotype. However, LOINC encoded data is an untapped resource in the context of deep phenotyping with the Human Phenotype Ontology (HPO). The HPO describes phenotypic abnormalities encountered in human diseases, and is primarily used for research and diagnostic purposes. As part of the Center for Data to Health (CD2H)’s effort to make EHR data more translationally interoperable, our group developed a curation tool that is used to convert EHR observations into HPO terms for use in clinical research. To date, over 1,000 LOINC codes have been mapped to HPO terms. To demonstrate the utility of these mapped codes, we performed a pilot study with de-identified data from asthma patients. We were able to convert 70% of real-world laboratory tests into HPO-encoded phenotypes. Analysis of the LOINC2HPO-encoded data showed that the HPO term eosinophilia was enriched in patients with severe asthma and prednisone use. This preliminary evidence suggests that LOINC data converted to HPO can be used for machine learning approaches to support genomic phenotype-driven diagnostics for rare disease patients, and to perform EHR based mechanistic research.



P69
Exploratory Analysis of Diseased Male and Female Gene Expression Levels

Subject: inference and pattern discovery

Presenting Author: Clarissa White, Brigham Young University, United States


Abstract:

Background
Most genetic diseases have complex genetic effects and are still not figured out. Amyloidosis is a rare and fatal disease in which an abnormal protein is produced. Transthyretin Amyloidosis (ATTR) is a type of this disease caused by inheriting a gene mutation. This study looks at the the role that gender plays in gene expression levels of subjects with and without ATTR. First, we compared gene expression level differences between males and females with the disease. Second, we used gene expression and gender to predict membership to the asymptomatic and control groups.

Methods and Findings
We performed exploratory analyses on a publicly-available dataset of 309 patients. Each patient was either asymptomatic, symptomatic, treated, or a control. The gender and gene expression levels for over 21,000 genes were included. We performed a t-test between males and females to find differences in gene expression levels. We then used Random Forests to predict between asymptomatic patients and controls. Finally, we looked at whether the Random Forest predictions changed when including gender.

Conclusions
We found that gene expression levels were not significantly different for different genders. We also found that the Random Forest model correctly predicted disease approximately 60% of the time, evidence of slightly better predictions than a random guess. Future research should conduct a study looking for a specific molecular signature, as our results suggested there might be one, but we did not do any analysis to look at what it might be as this research was exploratory.



P70
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge

Subject: web services

Presenting Author: Chunlei Wu, The Scripps Research Institute, United States

Co-Author(s):
Jiwen Xin, The Scripps Research Institute, United States
Cyrus Afrasiabi, The Scripps Research Institute, United States
Sebastien Lelong, The Scripps Research Institute, United States
Marco Alvarado Cano, The Scripps Research Institute, United States
Ginger Tseung, The Scripps Research Institute, United States
Trish Whetzel, EMBL-EBI, United States
Shima Dastgheib, NuMedii Inc, United States
Amrapali Zaveri, Maastricht University, Netherlands
Michel Domontier, Maastricht University, Netherlands
Andrew I. Su, The Scripps Research Institute, United States
Chunlei Wu, The Scripps Research Institute, United States

Abstract:

Building a web-based API (Application Programming Interface) has been rapidly adopted in the bioinformatics field as a new way of disseminating the underlying biomedical knowledge. While researchers benefit from the simplicity and the high accessibility (A) of available APIs, the findability (F), interoperability (I) and reusability (R) across APIs are largely not well-handled by the community. BioThings API project (http://biothings.io) is tasked to build a FAIR API ecosystem to better serve the underlying inter-connected biomedical knowledge. BioThings API provides three components in its API development ecosystem. First, it provides a family of high-performance APIs for accessing up-to-date annotations for genes, genetic variants, chemicals and drugs. Second, BioThings API packages its API-development best practice into a reusable SDK (Software Development Kit) to help other bioinformaticians to build the same high-quality API to distribute their own specific knowledge. Third, BioThing API provides a platform to foster the findability and interoperability across the community-developed biomedical APIs. Through the SmartAPI application (http://smart-api.info), it provides tools for authoring API metadata following the community supported OpenAPI standard and hosts standardized interactive API documentation. It also defines a set of OpenAPI extensions to provide biomedical-specific semantic annotations, such as what specific biomedical identifiers an API parameter accepts and what specific biomedical entity types an API response contains. Powered by these semantic annotations, a web application called BioThings Explorer (http://biothings.io/explorer) was developed to allow researchers to navigate the scope of the distributed biomedical API landscape and build the desired knowledge extraction workflows by identifying and combining required APIs.