Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

ROCKY 2019 | Dec 5 – 7, 2019 | Aspen/Snowmass, CO | HOME - ROCKY 2019

POSTER PRESENTATIONS



P01
Evolutionary Action is a unifying framework for assessing missense variant structures within and across phyla

Subject: Metogenomics

Presenting Author: Nicholas Abel, Baylor College of Medicine, United States

Co-Author(s):
Harley Peters, Baylor College of Medicine, United States
Panos Katsonis, Baylor College of Medicine, United States

Abstract:

The quantification and consequences of non-synonymous polymorphism at the exome level in natural populations has yet to be fully defined to enable correcting pathogenic variants through Precision Medicine. Furthermore, a non-statistical method for defining mutation load and dynamics of individuals, populations and species at the exome level has yet to be fully developed. Tools exist for predicting the impact a mutation has on displacing a protein in in its fitness landscape, namely Evolutionary Action (EA). In order to interrogate the spectrum of naturally occurring fitness effects we applied the EA equation, and its selection constant λ, to natural populations of humans and relevant model organisms’ population data. We found species specific mutation constants at both the individual and species level. Additionally, we utilized machine learning on Drosophila melanogaster populations to identify pathways and gene groups under differential selection globally. We found groups involved in Signal Transduction, Translation, Transcription Factors, Transport and Catabolism, and the Spliceosome to be highly ranked in separating the populations. These data demonstrate the nuanced aspects of mutable areas in replication and translation machinery as well as protein trafficking and recycling which occurred in Drosophila during adaptation to novel habitats. These findings establish EA as a crucial metric for machine learning and add to the nascent field of population exomics.



P02
Discovering Subclones in Tumors Sequenced at Standard Depths

Subject: other

Presenting Author: Navid Ahmadinejad, Arizona State University, United States


Abstract:

Understanding intratumor heterogeneity is critical to designing personalized treatments and improving clinical outcomes of cancers. Such investigations require accurate delineation of the subclonal composition of a tumor, which to date can only be reliably inferred from deep-sequencing data (>300x depth). To enable accurate subclonal discovery in tumors sequenced at standard depths (30-50x), we develop a novel computational method that incorporates an adaptive error model into statistical decomposition of mixed populations, which corrects the mean-variance dependency of sequencing data at the subclonal level. Tested on extensive computer simulations and real-world data, this new method, named model-based adaptive grouping of subclones (MAGOS), consistently outperforms existing methods on minimum sequencing depth, decomposition accuracy and computation efficiency. MAGOS supports subclone analysis using single nucleotide variants and copy number variants from one or more samples of an individual tumor. Applications of MAGOS to whole-exome sequencing data of 376 liver cancer samples discovered a significant association between subclonal diversity and patient overall survival.



P03
Meta-analysis and Machine Learning Classification for Dilated Cardiomyopathy Using Cardiac Transcriptomics Data

Subject: Machine learning

Presenting Author: Ahmad Alimadadi, Program in Bioinformatics and Department of Physiology and Pharmacology, University of Toledo College of Medicine and Life Sciences, United States

Co-Author(s):
Ahmad Alimadadi, University of Toledo College of Medicine and Life Sciences, United States
Bina Joe, University of Toledo College of Medicine and Life Sciences, United States
Xi Cheng, University of Toledo College of Medicine and Life Sciences, United States

Abstract:

One of the most common causes of heart failure is dilated cardiomyopathy (DCM). DCM might not cause symptoms but could be life-threatening. Several studies have used the RNA-seq approach to profile differentially expressed genes (DEGs) associated with DCM. In this study, we aimed to profile gene expression signature and identify novel genes associated with DCM through a quantitative meta-analysis of four independent RNA-seq studies using human left ventricle tissues. A total number of 46 DCM and 26 non-DCM samples were used for the meta-analysis. Our analysis identified 533 DEGs including 399 downregulated and 134 upregulated genes. Several DCM-related genes, including MYH6, DES, NKX2-5, PTN, and ATP2A2, were among the top 50 DEG list. Our meta-analysis also identified 23 DEGs that were not detected as DEG using the individual RNA-seq dataset. Some of those genes, such as PTH1R, PDGFD, ATP1A1, S100A4, and IRAK1, confirm previous reports of associations with DCM. The Ingenuity Pathway Analysis identified seven activated toxicity pathways, including failure of heart as the most significant pathway. Besides, we evaluated the performance of various supervised machine learning methods, such as Support Vector Machine (SVM) and Neural Networks (NN), to classify DCM samples from the non-DCM controls using the RNA-seq read count data. Our training results showed that SVM could achieve over 90% classification accuracy followed by NN (>80% accuracy). Overall, our meta-analysis successfully identified a core set of genes associated with DCM and the machine learning models can be efficiently trained to classify the clinical DCM patients.



P04
A web application for annotating tabular data with terms from biomedical ontologies

Subject: Instrumentation interfaces and data acquisition

Presenting Author: Elizabeth Anderson, Brigham Young University, United States


Abstract:

Good Nomen is a web application designed for data curators who wish to quickly and efficiently map tabular data to standardized terms found in ontologies. Researchers often use inconsistent terms to describe biomedical phenomena, including diagnoses, symptoms, medications, and treatments. For example, Trastuzumab, an antibody used to treat breast cancer may be described using its generic name, trade names, target molecule, etc. Although much work is underway to develop methods for annotating terms in electronic health records, biomedical data in summarized form are increasingly being released as tabular data files (delimited as rows and columns). However, before statistical and visualization methods can be applied to such data, terms must be standardized, a tedious and laborious effort when performed manually. To address this problem, we created Good Nomen, an interactive web application that accepts comma-separated value, tab-separated value, and Excel formatted files. After uploading a file, users select from among hundreds of ontologies that have been published in BioPortal. Using the selected ontology, a user can match one or more terms (or column names) in the dataset to a standardized term. This matching can occur through manual directions from the user or via auto-matching based on regular-expression patterns. Good Nomen interfaces with BioPortal via its Application Programming Interface. Good Nomen supports reproducibility by generating R scripts that enable users to repeat annotation steps. This interactive application promises to save valuable time for researchers and make data more standardized and thus easier to integrate with other datasets.



P05
Enabling structure-based data-driven selection of targets for cancer immunotherapy

Subject: Qualitative modeling and simulation

Presenting Author: Dinler Antunes, Rice University, United States

Co-Author(s):
Jayvee Abella, Rice University, United States
Sarah Hall-Swan, Rice University, United States
Kyle Jackson, UT MD Anderson Cancer Center, United States
Gregory Lizée, UT MD Anderson Cancer Center, United States
Lydia Kavraki, Rice University, United States

Abstract:

Understanding the molecular triggers to an immune response is essential to fields such as vaccine development and cancer immunotherapy. In this context, a central step is the activation of T-cell lymphocytes by peptides displayed by Human Leukocyte Antigen (HLA) receptors. For instance, a tumor-derived peptide such as MAGEA3 can be used in a vaccine, triggering an immune response capable of eliminating melanoma cells. Cancer vaccines and T-cell-based therapies have been tested in several clinical trials, with remarkable results. However, in a few patients the therapeutic T-cells mistakenly recognized unrelated peptides, expressed by healthy cells, causing lethal off-target reactions. Molecular mimicry was shown to be the key factor determining these side-effects, making structural analyses an important component for designing safer immunotherapies. In addition, structural data of peptide-HLA complexes can be the key for better methods for neoantigen discovery, and immunogenicity prediction. Unfortunately, there is a irreconcilable mismatch between the diversity of HLA molecules (above 17,000 alleles) and the scarceness of experimentally-determined peptide-HLA structures (about 700). To address this problem, we implemented a fast method for structural modeling of peptide-HLA complexes (APE-Gen), and we are now conducting a large-scale modeling of all peptides deposited in the SysteMHC Atlas (i.e., more than 100,000 experimentally-determined peptides from immunopeptidomics projects). Our database of 3D models will be used for a range of data-driven applications, including the prediction of binding affinity, complex stability, and off-target toxicity. In turn, these new methods can be directly applied to enable personalized selection of peptide-targets for safer cancer immunotherapy treatments.



P06
Poster Withdrawn


P07
Poster Withdrawn


P08
Pathway-Regularized Matrix Factorization

Subject: inference and pattern discovery

Presenting Author: Aaron Baker, University of Wisconsin-Madison, United States


Abstract:

Non-negative matrix factorization is a popular tool for decomposing high-dimensional data into its constitutive parts.<br>Recent research has incorporated manifold regularization to select parts which are consistent on a manifold, a mathematical structure that describes how features in the data are related to one another.<br>One source of hidden structure in genomic data relates to the effect of interactions among groups of genes in the cell.<br>Applications of manifold-regularized matrix factorization to these datasets have revealed cancer sub-types with different biomarkers and effective treatments.<br>We propose a more focused version of this method which reformulates the global gene interaction network manifold as a set of sub-manifolds, each associated with a biological pathway.<br>Biological pathways are important sets of interactions because they define how pairs of interacting genes influence broader physiological processes.<br>These processes also capture tissue- and context-specific relationships among the genes under investigation.<br>By constraining matrix factorization techniques to respect these underlying structures and emphasizing pathway edges instead of nodes, we gain biological insight when examining the factors.



P09
“The Use of Machine Learning for Modeling a Clinical Decision Support for Predicting Postpartum Depression”

Subject: Machine learning

Presenting Author: Houda Benlhabib, University of Washington, Biomedical Informatics and Medical Education, United States

Co-Author(s):
Sean Mooney, University of Washington, United States
Ian Bennett, University of Washington, United States
Peter Tarczy-Hornoch, University of Washington, United States

Abstract:

Postpartum depression (PPD) is a depression that occurs after childbirth. Importantly, if left untreated PPD can have a severe outcome on the mother and offspring. Current statistics show that most of the maternal deaths in the US were in postpartum period. Toxicology testing revealed that women in postpartum period are at risk of suicide, accidental drug overdose and homicide. PPD can also lead to infanticide, decreased maternal sensitivity and attachment to infants which leads to poor child development. Importantly, the clinical diagnosis of PPD remains challenging due in part to high percentage of women with the disorder that fail to report it and do not seek the appropriate interventions. One of the main contributors of the latter is the lack of a routine protocol to screen women for depression during and after pregnancy. This indicates the need of better tools to screen women for PPD. The objective of the current work is to develop computational tools using machine learning and natural language processing to predict women who develop PPD using Electronic Medical Health Records (EHR). Here we report the use of deidentified aggregated reporting query tool, Leaf, to identify a population of pregnant women who delivered at the UW Medicine that suffer from PPD and to assess and characterize attributes associated with it using EHR. These attributes can serve develop data science approaches for decision support. In this presentation, we will describe the study population and our approach toward developing new methodology using structured EHR data and clinical text.



P10
Bridging the Bioinformatics Knowledge Gap in the Pediatric Cancer Research Community with the Childhood Cancer Data Lab workshops

Subject: Networking

Presenting Author: Chante Bethell, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation), United States

Co-Author(s):
Candace Savonen, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation), United States
Deepashree Prasad, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation), United States
Casey Greene, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation), United States
Jaclyn Taroni, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation), United States

Abstract:

Biomedical researchers with limited to no bioinformatics training face hurdles when it comes to utilizing their data. Many researchers rely on bioinformaticians to answer biological questions with their genomic data as a result of this knowledge gap. This collaboration process can be protracted as need often outpaces demand. The Childhood Cancer Data Lab (CCDL), an initiative of the Alex’s Lemonade Stand Foundation, has implemented hands-on three day bioinformatics workshops to help address the computational skills gap in the pediatric cancer research community. The goal of this workshop is to equip pediatric cancer researchers with the necessary tools to independently perform basic analyses on their own experimental data and to gain confidence for continued self-directed learning. Our 2019 workshops included four modules on tidyverse, RNA-seq, single-cell RNA-seq, and machine learning. Workshop participants run analyses on their own laptops in a versioned Docker container prepared by CCDL staff and leave with their own machines equipped for future analyses. We introduce reproducible research practices, such as literate programming via R notebooks. On the final day of the workshop, researchers apply their newly developed skills to their own data with support from CCDL’s data science team members. We anticipate that our workshops will helpbridge the bioinformatics knowledge gap and promote communities of practice in the pediatric cancer community.



P11
Map and model — moving from observation to prediction in toxicogenomics

Subject: inference and pattern discovery

Presenting Author: Wibke Busch, Helmholtz Centre for Environmental Research - UFZ, Germany

Co-Author(s):
Andreas Schüttler, Helmholtz-Centre for Environmental Research - UFZ, Germany

Abstract:

Chemicals induce compound-specific changes in the transcriptome of an organism (toxicogenomic fingerprints). This provides potential insights about the cellular or physiological responses to chemical exposure and adverse effects, which is needed in assessment of chemical-related hazards or environmental health. In this regard, comparison or connection of different experiments becomes important when interpreting toxicogenomic experiments. Owing to lack of capturing response dynamics, comparability is often limited. We developed an experimental design and bioinformatic analysis strategy to infer time- and concentration-resolved toxicogenomic fingerprints. We projected the fingerprints to a universal coordinate system (toxicogenomic universe) based on a self-organizing map of toxicogenomic data retrieved from public databases. Genes clustering together in regions of the map indicative for functional relations due to co-expression under chemical exposure. To allow for quantitative description and extrapolation of the gene expression responses we developed a time- and concentration-dependent regression model. We applied the analysis strategy in a microarray case study exposing zebrafish embryos to three selected model compounds including two cyclooxygenase inhibitors. After identification of key responses in the transcriptome we could compare and characterize their association to developmental, toxicokinetic, and toxicodynamic processes using the parameter estimates for affected gene clusters. The design and analysis pipeline described here could serve as a blueprint for creating comparable toxicogenomic fingerprints of chemicals. It integrates, aggregates, and models time- and concentration-resolved toxicogenomic data. https://doi.org/10.1093/gigascience/giz057



P12
Data Discovery Engine: A web-based toolset for maximizing data discoverability and promoting reusable data-sharing best practices

Subject: Data management methods and systems

Presenting Author: Marco Cano, Scripps Research, United States

Co-Author(s):
Xinghua Zhou, Scripps Research, United States
Jiwen Xin, Scripps Research, United States
Chunlei Wu, Scripps Research, United States
Sebastien LeLong, Scripps Research, United States
Matthew B. Carson, Northwestern Medicine, United States
Kristi L. Holmes, Northwestern Medicine, United States
Sara Gonzales, Northwestern Medicine, United States

Abstract:

Biomedical research community has a wealth of data and opportunities for collaboration, yet it is challenging to identify existing datasets that can be leveraged to help power investigation. Data Discovery Engine (http://discovery.biothings.io) is a web application providing a pathway and tooling for data providers to define and expose their data resources in an easy and reusable way so that others can find them via multiple portals including Google Dataset Search and other domain-specific portals. The application includes two components:<br><br>The Discovery Guide component helps data providers organize their dataset metadata in a structured JSON-LD format, following the schema.org/Dataset schema. This ensures the basic metadata fields can be properly indexed by major search engines like Google. Using the same mechanism, we can extend JSON-LD metadata to include additional biomedical specific fields, which can be subsequently captured by domain-specific discovery portals. In addition to sharing well-formed metadata through our application, the guide also allows data providers to embed an one-liner to their existing dataset page and turn their own website as a structured metadata provider. <br><br>The Schema Playground component focuses on enabling developers to build schema.org compatible schemas to encode their dataset metadata. By building on top of existing schemas, developers can include additional biomedical fields they need while keeping the interoperability of their metadata with the general-purpose search engines. The playground provides tools to extend, visualize and host user-defined metadata schemas. These schemas can then be used in the Discovery Guide to cover additional metadata types.<br>



P13
A Machine Learning tool can assign function to phage proteins

Subject: Machine learning

Presenting Author: Vito Cantu, San Diego State University, United States

Co-Author(s):
Robert Edwards, san diego state university, United States
Anca Segall, san diego state university, United States
peter salamon, san diego state university, United States

Abstract:

Phages, or viruses that infect bacteria, are the most common biological entity on the Earth. Yet, we are unable to assign function to 50-90% of their genes. This is mainly due to the fact that most methods to elucidate gene function are based on homology , but phages’ sequences are highly divergent as a consequence of the natural selection pressures imposed by the parasite-<br>host/predator-prey relationship between phages and their obligatory cellular hosts. At the same time, viruses are highly effective agents of horizontal gene transfer, known to carry toxins and other virulence factors in both eukaryotes and prokaryotes. Nevertheless, phages across distinct groups encode analogous structural proteins that performs the same function.<br><br>Artificial Neural Networks (ANN) are proven universal approximators of linear functions, including the function that maps features extracted from a phage protein sequence to its structural class. In this work, we construct a manually curated database of phage structural proteins and use it to train a feed-forward ANN to assign any phage protein to one of eleven classes (ten structural plus "others"). Furthermore, we developed a web server where protein sequences can be uploaded for classification. <br>



P14
Identification of a lead compound for selective inhibition of Nav1.7 to treat chronic pain

Subject: other

Presenting Author: Sharat Chandra, Duke University, United States

Co-Author(s):
Andrey Bortsov, Duke University, United States

Abstract:

Chronic pain (CP) therapeutic approaches have limited efficacy. Therefore, the development of effective and safe CP drugs remains an unmet medical need. Voltage-gated sodium (Nav) channels act as cardiovascular and neurological disorder’s molecular targets. Nav channels selective inhibitors are hard to design because there are nine closely-related isoforms (Nav1.1-1.9) that share the protein sequence segments. We are targeting the Nav1.7 found in the peripheral nervous system and engaged in the perception of pain. In this study, we designed a protocol for identification of isoform-selective inhibitors of Nav1.7. First, a similarity search was performed; then the identified hits were docked into a binding site on the fourth voltage-sensor domain (VSD4) of Nav1.7. We used the FTrees tool for similarity searching and library generation; the generated library was docked in the VSD4 domain binding site using FlexX and compounds were shortlisted using a FlexX score and SeeSAR hyde scoring. Finally, the top 25 compounds were tested with molecular dynamics simulation (MDS). We reduced our list to 9 compounds based on the MDS root mean square deviation plot and obtained them from a vendor for in vitro and in vivo validation. Whole-cell patch-clamp recordings in HEK-293 cells and dorsal root ganglion neurons were conducted. One of the compound reduced the peak sodium currents in Nav1.7-HEK-293 stable cell line in a dose-dependent manner, with an IC50 values at 0.74 µM. In summary, our computer-aided analgesic discovery approach allowed us to develop pre-clinical analgesic candidate with significant reduction of time and cost.



P15
Deep Learning based Multi-view model for deciphering gene regulatory keywords

Subject: Machine learning

Presenting Author: Pramod Bharadwaj Chandrashekar, Arizona State University, United States

Co-Author(s):
Li Liu, Arizona State University, United States

Abstract:

Motivation: All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes transcription factors, histone modifications, gene sequences, and mutations, etc. Gene expression profiling can be used in clinical settings as prognostic, diagnostic, and therapeutic markers. Deciphering and cataloging these regulatory codes of gene and its effect on the expression level is one of the biggest and key challenges in precision medicine and genetic research. <br><br>Results: In this study, we propose a novel multi-view deep learning tool to use genetic and epigenetic markers to classify and predict tissue-specific gene expression levels into high and low expression. We use the same model to untangle and visualize the regulatory codes which contribute towards gene regulation. Our system achieved an F1-score of 0.805 which outperforms the existing methods. Our proposed model can not only identify highly enriched regions but also identify TF binding motifs in these regions. We believe that our model can help in detecting various mechanisms affecting gene regulation.



P16
Landmark and Cancer-Relevant Gene Selection of RNA Sequencing Data for Survival Analysis

Subject: other

Presenting Author: Carly Clayman, Penn State University - Great Valley, United States

Co-Author(s):
Satish Srinivasan, Penn State University - Great Valley, United States
Raghvinder Sangwan, Penn State University - Great Valley, United States

Abstract:

Dimensionality reduction methods are used to select relevant features, and clustering performs well when applied to data with low effective dimensionality. This study utilized clustering to predict categorical response variables using Illumina Hi-Seq ribonucleic acid (RNA) Sequencing (RNA-Seq) data accessible on the National Cancer Institute Genomic Data Commons. The dimensionality of the dataset was reduced using several methods. One method selected genes for analysis using a set of landmark genes, which have been previously shown to predict expression of the remaining target genes with low error. Genes were also selected by mining cancer-relevant genes from the literature using the DisGeNET package in R. Groups within the dataset were characterized using clinical data to assess whether landmark genes would improve clustering results, compared to established cancer-relevant genes from the literature. Cancer-relevant genes and landmark genes with the most significant correlations with the clinical outcome of overall survival were also assessed in Kaplan Meier survival analysis. While individual gene expression levels, including TP53, and clinical variables were significant predictors of overall survival when assessed separately, the combination of genes along with clinical variable levels provided the most predictive power for overall survival. Important landmark genes selected by the Boruta random forest algorithm resulted in improved clustering consistent with high vs. low overall survival, compared to important disease-relevant genes. These findings indicate dimensionality reduction techniques may allow for selection of features that are predictive of clinical outcomes for cancer patients. This study has implications for assessing gene-environment interactions for multiple cancer types.



P17
Polygenic Risk Score Knowledge Base: A Web-based Application for Calculating Polygenic Risk Scores

Subject: Data management methods and systems

Presenting Author: Matthew Cloward, Brigham Young University, United States

Co-Author(s):
Elizabeth Ward, Brigham Young University, United States
Louisa Dayton, Brigham Young University, United States
Joseph Peterson, Brigham Young University, United States
Justin Miller, Brigham Young University, United States
John Kauwe, Brigham Young University, United States

Abstract:

Large genetic cohorts have established phenotype-specific databases consisting of thousands of whole genome sequencing data. However, confounding effects of elevated genetic risk to other diseases in these datasets often go unnoticed. Therefore, there exists a critical need to evaluate the overall genetic risk of these individuals to ensure the integrity of these large cohorts. A polygenic risk score is used to evaluate the overall genetic risk of an individual to develop a disease. Individuals can be subtyped based on polygenic risk scores across multiple diseases, removing confounding effects for drug trials and genome-wide association (GWA) studies. We developed a web server and application programming interface (API) to calculate polygenic risk scores for various diseases. Our server connects to a database of odds ratios from a variety of large genome-wide association studies that were manually curated to provide the most comprehensive and complete analysis of genetic risk. We allow users to calculate multiple polygenic risk scores from a single input file, which enables a fast, comprehensive analysis of genome-wide variants. We also provide an option for users to upload their own GWA studies to be included in the knowledge base, after manual validation. We provide a variety of options on the web server, including disease specification, study selection, p-value cutoff, reference genome conversion, and download options with various formats. By simplifying the process to calculate polygenic risk scores, we anticipate that researchers will be able to identify confounding variables, cohort subtypes, and individuals with elevated genetic risk for developing disease.



P18
A comprehensive analysis of orthologous genes across all domains

Subject: inference and pattern discovery

Presenting Author: Lauren Cutler, Brigham Young University, United States

Co-Author(s):
Justin Hunt, Brigham Young University, United States
Justin Miller, Brigham Young University, United States
Lauren McKinnon, Brigham Young University, United States
John Kauwe, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Abstract:

Despite their importance in recovering phylogenetic trees and inferring protein function, identifying most orthologous relationships relies heavily on multiple sequence alignments. However, we have previously shown that other systematic biases in orthologs exist, including coding sequence length and dinucleotide percentages. Here, we present a comprehensive analysis of genetic sequence features in all orthologs in the NCBI database in order to quantify the extent of variation that exists. Specifically, we conducted analyses of sequence lengths, number of coding regions, lengths of coding regions, nucleotide percent composition, k-mer composition, nucleotide and amino acid percent identity, codon aversion, codon pairing, ramp sequences, and phylogenetic tree construction in each orthologous group. We calculate statistics for each of these analyses for 356,994 orthologs in 23,428 species across all domains of life. These analyses show that the number of species is highly variable for archaea, bacteria, fungi, and viruses. Archaea, bacteria, and viruses usually contain only one coding sequence, whereas vertebrates average more than nine coding sequences per orthologs. Lengths of each coding sequence are generally highest in archaea and bacteria. Nucleotide and amino acid percent identity were highest among vertebrates. We anticipate that these analyses and information will aid in future ortholog annotation. We also expect that these data will provide a useful reference for researchers to compare divergence rates in ortholog evolution and identify potentially mislabeled orthologs.



P19
Using Mendelian Randomization to Assess Disease Causality

Subject: inference and pattern discovery

Presenting Author: Louisa Dayton, Brigham Young University, United States

Co-Author(s):
Justin Miller, Brigham Young University, United States
Matthew Cloward, Brigham Young University, United States
Robert Seymour, Brigham Young University, United States
Monica Mackay, Brigham Young University, United States
Keoni Kauwe, Brigham Young University, United States
Elizabeth Ward, Brigham Young University, United States

Abstract:

Mendelian randomization identifies causal relationships between genetic variation and disease by using genetic variants as the modifiable exposure. Although many genetic variants are associated with Alzheimer’s disease, the extent to which other diseases contribute to cognitive decline remains elusive. Therefore, we developed a technique to assess the contributions of other diseases to Alzheimer’s disease by using polygenic risk scores and Mendelian randomization. Our model first assesses the genetic risk of all individuals in the Alzheimer’s Disease Genetic Consortium (ADGC) and bins individuals into high risk groups (i.e., top 10% of all individuals in ADGC) based on their polygenic risk scores. These bins calculate the overall genetic risk of Alzheimer’s Disease, Amyotrophic lateral sclerosis, Coronary Heart Disease, Attention-Deficit/Hyperactive Disorder, and Depression. Patients belonging to multiple high-risk bins are then evaluated using Mendelian randomization to determine causal relationships between the diseases. We built a model for analysis in R using the standard error of the log odds ratios included in the genome-wide association studies used to calculate the risk scores. MR Egger is then used to evaluate causality between diseases. We anticipate that this method will identify causal relationships between diseases.



P20
Poster Withdrawn


P21
A Novel One-Class classification Approach to Accurately Predict Disease-Gene Association

Subject: Machine learning

Presenting Author: Abdollah Dehzangi, Morgan State University, United States

Co-Author(s):
Akram Vasighizaker, Tarbiat Modares University, Iran
Alok Sharma, RIKEN Center for Integrative Medical Sciences, Japan

Abstract:

Disease causing gene identification is considered as an important step towards drug design and drug discovery. In disease gene identification and classification, the main aim is to identify disease genes while identifying non-disease genes are of less or no significant. Hence, this task can be defined as a one-class classification problem. Existing machine learning methods typically take into consideration known disease genes as positive training set and unknown genes as negative samples to build a binary-class classification model. Here we propose a new One-class Classification Support Vector Machines (OCSVM) method to precisely classify candidate disease genes. Our aim is to build a model that concentrate its focus on detecting known disease-causing gene to increase sensitivity and precision. We investigate the impact of our proposed model using a benchmark consisting of the gene expression dataset for Acute Myeloid Leukemia (AML) cancer. Compared with the state-of-the-art methods, our experimental result shows the superiority of our proposed method in terms of precision, recall, and F-measure to detect disease causing genes for AML. OCSVM codes and our extracted AML benchmark are publicly available at: https://github.com/imandehzangi/OCSVM.



P22
Pipelines, Workflows and Virtualization to Build Institutional Informatics Capacity

Subject: Data management methods and systems

Presenting Author: Aaron Dickey, Unites States Department of Agriculture, United States

Co-Author(s):
Danny Nonneman, Unites States Department of Agriculture, United States
Harvey Freetly, Unites States Department of Agriculture, United States
Aspen Workman, Unites States Department of Agriculture, United States
Larry Kuehn, Unites States Department of Agriculture, United States

Abstract:

Volmers et al. 2017 defined the “bioinformatics middle class” as being comprised of competent and informed users rather than tool developers. Increasingly, these middle class bioinformaticians are being employed in supporting roles where they can collaborate to advance the research programs of multiple principal investigators across an institution to access, manipulate and analyze large datasets. The daily routine of the middle class bioinformatician may vary with both individual strengths as well as client and institutional needs but will often comprise a variety of activities. Such activities might include delivering trainings, scripting, developing pipelines and curating databases. Middle class bioinformaticians occupy a similar role to departmental statisticians and may face some of the same professional challenges; among them, maintaining an active publication record.<br>A data analysis pipeline is an end-to-end multi-step data management solution where the data is not manually inspected between each step. In contrast, a user interacts with the data at each step of a data analysis workflow. Either methodological class can take advantage of computer platform virtualization. The purpose of this presentation is to summarize multiple research projects where the bioinformatic support was in the form of a pipeline or workflow with the goal of highlighting differences between these two approaches. Use of virtualization in different projects is also highlighted. Workflows have greater real-time flexibility for integrating custom statistics and outputs whereas pipelines offer greater speed.



P23
Transcriptional changes observed in a mouse contained infection model of TB identify human LTB+ individuals at low risk of progression to active disease.

Subject: other

Presenting Author: Fergal Duffy, Seattle Children's Research Institute, United States

Co-Author(s):
Johannes Nemeth, University Hospital Zurich, Switzerland
Courtney Plumlee, Seattle Children's Research Institute, United States
Alan Diercks, Seattle Children's Research Insitute, United States
Alan Aderem, Seattle Children's Research Insitute, United States
Kevin Urdahl, Seattle Children's Research Insitute, United States
John Aitchison, Seattle Children's Research Insitute, United States

Abstract:

Whole-blood transcriptional signatures of TB risk have been previously described, comprising interferon response genes upregulated in disease. Applying these signatures to animal TB challenge models revealed that they primarily correlate with bacterial burden, rather than any protective phenotype. Recently, we developed a contained-infection TB mouse model protective against TB re-challenge. By dissecting transcriptional responses in these mice and contrasting with transcriptional profiles of human active and latent TB, we aimed to discover correlates of natural immune protection to TB.<br><br>We obtained previously published blood transcriptional profiles from human active TB, LTB and healthy individuals, along with longitudinal profiles from LTB+ individuals, some of whom later progressed to active TB. Contained mouse infection was established by injection of TB into the mouse ear, and whole blood RNAseq was performed 10, 28, and 42 days post infection. <br><br>Hierarchical clustering of differentially expressed genes revealed that contained infection mice represented a transcriptional phenotype intermediate between changes induced by human active TB vs healthy individuals. Protection-associated genes were identified as genes with significantly altered expression patterns in both human LTB and mouse contained infection vs human active TB. Protective genes significantly discriminated, LTB+ control individuals compared to TB progressors in the independent adolescent cohort study dataset.<br><br>This signature is the first to explicitly harness a natural protective phenotype discovered in an animal model of TB and subsequently validated in a large human cohort. Applying mechanistic insights from animal models directly to human TB cohorts represents a promising future direction for the TB field. <br><br>



P24
Apollo: an efficient tool to collaboratively refine and attribute genome-wide genomic annotations

Subject: Graphics and user interfaces

Presenting Author: Nathan Dunn, University of California, Berkeley, United States

Co-Author(s):
Nomi Harris, Lawrence Berkeley National Lab, United States
Colin Diesh, University of California, Berkeley, United States
Robert Buels, University of California, Berkeley, United States
Ian Holmes, University of California, Berkeley, United States

Abstract:

Dissemination of the most accurate genome annotations is important to provide an understanding of biological function. An important final step in this process is the manual assessment and refinement of genome annotations. Apollo (https://github.com/GMOD/Apollo/) is a real-time collaborative web application (think Google docs) used by hundreds of genome annotation projects around the world, ranging from single species to lineage-specific efforts supporting the annotation of dozens of genomes as well as several endeavors focused around undergraduate and high school education. <br><br>To support efficient curation Apollo offers drag-and-drop editing, a large suite of automated structural edit operations, the ability to pre-define curator comments and annotation status to maintain consistency, attribution of annotation authors, fine-grained user and group permissions, and a visual history of revertible edits. Additionally, Apollo is built atop the dynamic genome web browser JBrowse (http://jbrowse.org/), which is performant, customizable, and has a large registry of plugins (https://gmod.github.io/jbrowse-registry/).<br><br>The most recent Apollo enhancements have focused on automated upload of genomes (FASTA) and genomic evidence (GFF3, FASTA, BAM, VCF) for annotations to make them readily available for group annotation, the ability to manually test variant effects when annotating variants, and the annotation and export of gene ontology (GO) terms.



P25
Global phylogeography and ancient evolution of the widespread human gut virus crAssphage

Subject: Metogenomics

Presenting Author: Robert Edwards, San Diego State University, United States

Co-Author(s):
Bas Dutilh, Universteit Utrecht, Netherlands

Abstract:

Microbiomes are vast communities of microbes and viruses that populate all natural ecosystems. Viruses have been considered the most variable component of microbiomes, supported by virome surveys and examples of high genomic mosaicism. However, recent evidence suggests that the human gut virome is remarkably stable compared to other environments. Here we investigate the origin, evolution, and epidemiology of crAssphage, a widespread human gut virus. Through a global collaboratory, we obtained DNA sequences of crAssphage from over one-third of the world’s countries, showing that its phylogeography is locally clustered within countries, cities, and individuals. We also found colinear crAssphage-like genomes in wild old-world and new-world primates, challenging rampant viral genomic mosaicism and suggesting that the association of crAssphage with hominids may be millions of years old. We conclude that crAssphage is a benign globetrotter virus that has co-evolved with the human lineage, and an integral part of the healthy human gut virome.



P26
Urine pellet heterogeneity requires meticulous balancing of RNAseq libraries

Subject: Instrumentation interfaces and data acquisition

Presenting Author: Felix Eichinger, University of Michigan, United States

Co-Author(s):
Bradley Godfrey, University of Michigan, United States
Celine C. Berthier, University of Michigan, United States

Abstract:

RNAseq of urine has great application in a variety of diseases. As “liquid biopsy” it promises insight into internal processes a well as noninvasive access to biomarkers in cancer, cardiovascular, or kidney disease. However, due to both technical variability and effects of the disease process, the amount of material in the urine is highly variable, leading to up to 100 fold differences in read numbers when using a standard processing pipeline. This greatly reduces efficiency, reproducibility and ultimately lowers confidence in the validity of results. An additional quality control step used to balance the libraries is key to generation of consistent readouts. With a standard protocol including measuring and balancing the libraries based on Agilent TapeStation and qPCR, a dataset of 45 urine pellets showed high read quality, but equally highly variant results: a median read number of 46 million (mio) reads, but ranging from 1.5 to 154 mio reads. Inclusion of a miSeq run offering more complete representation of anticipated sequencing results allows for improved re-balancing prior to the full sequencing run. A trial dataset with 6 pellets from urine samples showed greatly decreased heterogeneity: with a median of 14.6 mio read, the range is reduced to 11.3 - 35.5 mio reads. This represents a reduction from a 100 fold to a 3 fold difference in read numbers, enabling meaningful sample comparisons.



P27
Using Mutual Information to Validate Functional Interactions Between Clusterin and Amyloid Precursor Protein

Subject: inference and pattern discovery

Presenting Author: Austin Gale, Brigham Young University, United States

Co-Author(s):
Keoni Kauwe, BYU, United States
Justin Miller, BYU, United States
Katrisa Ward, BYU, United States

Abstract:

Functional and physical interactions between protein residues facilitate almost all biological processes. Mutual information is often used to identify these functional interactions through protein coevolution. We anticipate that disease-causing variants alter these functional interactions and may provide an additional method to analyzing disease. We recognize that mutual information might not identify the top interacting residues because residues must co-evolve, and unchanging residues do not produce high mutual information. However, residues in close proximity (e.g., within five residues) are likely to be co-inherited, so coevolving networks can provide researchers with a general location of functional interactions. The Clusterin (CLU) protein (length=501 amino acids) is known to directly interact with the amyloid precursor protein (APP; length=660 amino acids) and have direct implication in Alzheimer's disease. We calculated mutual information of each residue in CLU and APP to determine the extent to which these proteins coevolve. Specifically, we analyzed the residue affected by rs7982 on CLU, which has previously been significantly associated with increased Aβ deposition from the sequential cleavage of APP. We found that the highest predicted functional interaction between CLU and APP was within two amino acids of rs7982 and located on the same exon as rs7982. Although our prediction was a post hoc validation of a previously-described interaction, it shows the utility of mutual information in predicting regions of functional interactions that may be associated with disease. This technique can be applied to future disease research by using amino acid coevolution to identify protein functional interactions disrupted by disease-associated variants.



P28
Computational Discovery of Novel Phages in the Human Gut Metagenome

Subject: Metogenomics

Presenting Author: Melissa Giluso, San Diego State University, United States


Abstract:

Viruses are the most abundant biological entities on the planet, but the high levels of variability between viral sequences and lack of known hosts have left a large majority of these sequences undocumented using traditional laboratory techniques. Through metagenomic sampling and computational analysis, it is now possible for DNA sequences to be isolated and identified as phages. This was proven in the 2014 discovery of crAssphage, a highly abundant bacteriophage present in the majority of human guts. The amount of publicly available sequencing data since then has grown exponentially, and there is now over 4,000 human fecal metagenomes available on the Sequence Read Archive (SRA). We developed a computational pipeline to extract data from the SRA, perform a cross-assembly, identify abundant, co-occurring contigs, and sort these contigs into genome bins. The contigs contained in these genome bins can then be compared to a database of all documented phage genes to date. All of the metagenomes in the SRA can then be scanned for this collection of contigs, and runs containing the contigs most similar to phage genes are repeatedly put through the pipeline. This method exploits the fact that contigs belonging to the same phage-like entity will occur together, and cross assembly of runs containing the largest abundance of such contigs should reveal the complete genome. These methods are promising for the discovery of novel phages in the human gut, which can improve our knowledge of human health.



P29
Benchmarking Viral Identification Tools in Complex Simulated Metagenomes

Subject: Machine learning

Presenting Author: Cody Glickman, University of Colorado Anschutz, United States

Co-Author(s):
Michael Strong, National Jewish Health, United States

Abstract:

Bacteriophages are environmental viruses that infect bacteria and can integrate into bacterial genomes. Bacteriophages are capable of transferring bacterial genes between hosts in a process termed transduction. While transduction is a rare event, given the scale of interactions between bacteriophages and bacteria this phenomenon is estimated to occur at scale in the environment. In order to examine bacteriophages and the associated hitchhiking bacterial genes, we can interrogate shotgun metagenomic sequencing data, since bacteriophages lack a universal marker gene for targeted amplification. Identification of bacteriophages in shotgun metagenomic sequencing data has traditionally relied on homology-based methods. Recently, a machine learning approach using sequence based k-mers as features was shown to perform well at identifying short viral sequences, including novel viral sequences. Our group has developed a hybrid viral identification methodology that uses an initial homology-based filtration step followed by machine learning, using a combination of sequence based k-mers and protein features. We have tested our method against other methods in the field to identify bacteriophages in simulated metagenomes with varying complexity and distributions of reads between species. <br>Our study benchmarks the performance of leading viral identification tools on simulated metagenomes, to determine the performance of such tools in a variety of circumstances. <br>Benchmarking in silico viral identification tools has the potential to facilitate the discovery of a large number of new viruses, previously hidden within mixed metagenomes. These discoveries will advance our understanding of transduction in disease states and further our understanding of the relationships between bacteriophages and their bacterial hosts.



P30
The design of an interactive lung map for studying premalignant lesions in the lung over time

Subject: Graphics and user interfaces

Presenting Author: Carsten Görg, Colorado School of Public Health, United States

Co-Author(s):
Wilbur Franklin, University of Colorado School of Medicine, United States
Daniel Merrick, University of Colorado School of Medicine, United States

Abstract:

Lung squamous cell carcinoma and adenocarcinoma are the most common subtypes of lung<br>cancer; both are associated with recognized and unique premalignant lesions, and<br>their pre-cancers have distinct histologic appearances, tissue distribution, and molecular driver events. To facilitate a comprehensive analysis of these precancerous lesions, and ultimately the understanding of mechanisms of progression and identification of risk markers as well as targets for inhibition, we designed an interactive lung map that represents the lesions in the context of anatomic findings, genomic and microenvironmental features, and the patient’s overall clinical history. The map design supports two use cases: (1) pathologists and radiologists, usually analyzing one patient at a time, can explore the spatial and temporal context of multiple identified lesions by comparing lesions at different sites to each other as well as comparing the progression of lesions over time; (2) for secondary research purposes, users can utilize histologic diagnoses, genomic features, and data elements in the clinical history to define patient cohorts and study the heterogeneity and progression of lesions within and across cohorts. Using either histologic or radiographic images, the maps will provide a visual tool to facilitate understanding of anatomic and temporal relationships between lesions. Our lung map design will be implemented as part of the analysis component of a Data Commons for the NCI’s Lung Pre-Cancer Atlas project to facilitate the analysis of samples from both retrospective and prospective studies.



P31
pipelineTools: RStudio based NGS workflow, reporting and teaching system

Subject: Data management methods and systems

Presenting Author: Graham Hamilton, University of Glasgow,


Abstract:

The R package pipelineTools has been designed to combine open source command line software and R software in RStudio to create an environment that streamlines the building, automation of workflows and report generation for NGS data analysis. PipelineTools has a modular design with each R function designed to run a command line tool with a common interface for all data analysis software. Each function has the option to set a core set of parameters, these can readily be extended to include more rarely used or pipeline specific software settings. Currently, pipelineTools has functions for sequence quality control , read mapping and alignment reporting and visualisation using ggplot2.<br><br>Analysis templates for several workflows are provided, initially, RNASeq, microRNASeq and ribosomal profiling. All work flows have been designed to perform quality control checks on the raw sequence data (FastQC and fastq-screen), sequence quality and adapter trimming (Cutadapt and FastP) and alignment quality checks using MultiqQC and Picard tools. All the commands used to for each of the steps is written to a separate file for each step in the workflow, for inclusion in publications. This approach has proven to save a considerable amount of time in the creation of analysis reports. The package has been designed to be used inRStudio on either a local computer or a server, dependant on the amount of data to be analysed.



P32
Hetnet connectivity search provides rapid insights into how two biomedical entities are related

Subject: Graph Theory

Presenting Author: Daniel Himmelstein, University of Pennsylvania, United States

Co-Author(s):
Michael Zietz, Columbia University, United States
Vincent Rubinetti, University of Pennsylvania, United States
Benjamin Heil, University of Pennsylvania, United States
Kyle Kloster, North Carolina State University, United States
Michael Nagle, Pfizer, United States
Blair Sullivan, University of Utah, United States
Casey Greene, University of Pennsylvania, United States

Abstract:

Hetnets, short for “heterogeneous networks”, contain multiple node and relationship types and offer a way to encode biomedical knowledge. For example, Hetionet connects 11 types of nodes — including genes, diseases, drugs, pathways, and anatomical structures — with over 2 million edges of 24 types. Previously, we trained a classifier to repurpose drugs using features extracted from Hetionet. The model identified types of paths between a drug and disease that occurred more frequently between known treatments.<br><br>For many applications however, a training set of known relationships does not exist; Yet researchers would still like to know how two nodes are meaningfully connected. For example, users may want to know not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. Therefore, we developed hetnet connectivity search to propose the most important paths between any two nodes.<br><br>The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We implemented the method on Hetionet and provide an online interface at https://het.io/search. Several optimizations were required to precompute significant instances of node connectivity at scale. We provide an open source implementation of these methods in our new Python package named hetmatpy.<br><br>To validate the method, we show that it identifies much of the same evidence for specific instances of drug repurposing as the previous supervised approach, but without requiring a training set.



P33
CUBAP: An Interactive Web Portal for Analyzing Codon Usage Bias Across Populations

Subject: Data management methods and systems

Presenting Author: Matthew Hodgman, Brigham Young University, United States

Co-Author(s):
Justin Miller, Brigham Young University, United States
Taylor Meurs, Brigham Young University, United States
John Kauwe, Brigham Young University, United States

Abstract:

Although synonymous codon usage does not change the protein primary structure, codon choice has significant implications in overall translational efficiency, gene expression, protein folding, and RNA secondary structure. Despite recent advances in analyzing synonymous codon usages, population-specific differences in codon usages remain largely unexplored. We present a web server, Codon Usage Bias Across Populations (CUBAP), to facilitates analyses of codon usage biases across different human population groups. Using data from the 1000 Genomes Project, we calculated the frequencies of codon usage and codon pairing in 17 635 genes and their isoforms. We present comprehensive comparisons of codon usages across 2 504 individuals spanning 26 subpopulations in five superpopulations. We allow users to perform adaptive gene and population queries online at www.bit.ly/cubap. Users can interact with graphics on the website to view any human gene and analyze its codon usages, codon aversion, identical codon pairing, and co-tRNA codon pairing. The specific frequencies can be compared across different populations to identify global trends. These analyses enable researchers to quickly identify potential implications of synonymous variants identified in genome-wide association studies, the effects of codon usage biases on genetic diseases and disorders, codon usage dynamics within specific genes, and biases associated with ancestry and population stratification. This database will enable more in-depth analyses of selective pressures that act on codon usage biases, why they exist, and how those pressures vary between populations.



P34
Food Preservatives Induce Proteobacteria Dysbiosis in Human-Microbiota Associated Nod2-deficient Mice

Subject: Metogenomics

Presenting Author: Tomas Hrncir, The Czech Academy of Sciences, Czech Republic

Co-Author(s):
Lucia Hrncirova, The Czech Academy of Sciences, Czech Republic
Vladimira Machova, The Czech Academy of Sciences, Czech Republic
Eva Trckova, The Czech Academy of Sciences, Czech Republic

Abstract:

The worldwide incidence of many immune-mediated and metabolic diseases, initially affecting only the wealthy Western countries, is increasing rapidly. Many of these diseases are associated with the compositional and functional alterations of gut microbiota, i.e., dysbiosis. The most consistent markers of the dysbiosis are a decrease in microbiota diversity and an expansion of Proteobacteria. The role of food preservatives as potential triggers of gut microbiota dysbiosis has been long overlooked. Using a human microbiota-associated mouse model, we demonstrate that a mixture of common antimicrobial food additives induces dysbiosis characterised by an overgrowth of Proteobacteria phylum and a decrease in the Clostridiales order. Remarkably, human gut microbiota in a Nod2-deficient genetic background is even more susceptible to the induction of Proteobacteria dysbiosis by additives than the microbiota in a wild-type background. To conclude, our data demonstrate that antimicrobial food additives trigger gut microbiota dysbiosis in both wild-type and Nod2-deficient backgrounds and at the exposure levels reached in European populations. Whether this additive-modified gut microbiota plays a significant role in the pathogenesis of immune-mediated and metabolic diseases remains to be elucidated.



P35
Predicting Clinical Dementia Rating Using Blood RNA Levels

Subject: Machine learning

Presenting Author: Erik Huckvale, Brigham Young University, United States

Co-Author(s):
Justin Miller, Brigham Young University, United States
John Kauwe, Brigham Young University, United States

Abstract:

INTRODUCTION: The Clinical Dementia Rating (CDR) is commonly used to assess cognitive decline in Alzheimer's disease patients.<br><br>METHODS: We divided 741 participants with blood microarray data in the Alzheimer's Disease Neuroimaging Initiative (ADNI) into three groups based on their most recent CDR assessment: cognitive normal (CDR=0), mild dementia (CDR=0.5), and probable AD (CDR≥1.0). We then used machine learning to predict cognitive status using only blood RNA levels.<br><br>RESULTS: One chloride intracellular channel 1 (CLIC1) probe was significant. By combining nonsignificant probes with p-values less than 0.1, we averaged 87.87 (s = 1.02)% predictive accuracy in classifying the three groups, compared to a 55.46% baseline for this study.<br><br>DISCUSSION: We identified one significant probe in CLIC1. However, CLIC1 levels alone were not sufficient to determine dementia status. We propose that combining individually suggestive, but nonsignificant, blood RNA levels can significantly improve diagnostic accuracy.



P36
Polly Discover: Integration of public omics data and metadata using machine learning to enable asset discovery

Subject: Machine learning

Presenting Author: Brian Dranka, Elucidata, United States

Co-Author(s):
Saksham Malhotra, Elucidata, India
Shashank Jatav, Elucidata, India
Swetabh Pathak, Elucidata, India
Soumya Luthra, Elucidata, India

Abstract:

INTRODUCTION<br>There is a wealth of omics data available in the public domain but extracting information from them to generate hypotheses is an unsolved problem. Extracting insights from datasets takes a long time and many of the important insights are missed due to huge numbers of studies available. <br><br>METHODS and RESULTS<br>Here we present a technology - Polly Discover, which integrates multiple omics datasets on a large scale to come up with actionable insights in a biological context. As a part of Polly Discover, we have built and curated a context-specific data-lake with more than 40000 transcriptomics datasets from a variety of sources including GEO and mining it for insights using machine learning models. <br><br>We have used the WGCNA algorithm and NLP to make a knowledge graph that can be used to find known and novel connections between biological entities. This, in turn, can be used as a hypothesis generation tool to improve our understanding of complex diseases and enable asset discovery.<br><br>As a validation, Polly Discover has been able to pick known and novel insights about the hypoxia signature and LMNA-knockdown signature in muscular dystrophy.



P37
Hypergraph Analytics for Computational Virology

Subject: Graph Theory

Presenting Author: Cliff Joslyn, Pacific Northwest National Laboratory, United States

Co-Author(s):
Emilie Purvine, Pacific Northwest National Laboratory, United States
Brett Jefferson, Pacific Northwest National Laboratory, United States
Brenda Praggastis, Pacific Northwest National Laboratory, United States
Song Feng, Pacific Northwest National Laboratory, United States
Hugh Mitchell, Pacific Northwest National Laboratory, United States
Jason McDermott, Pacific Northwest National Laboratory, United States

Abstract:

Multi-omic data sets capture multiple complex and indirect interactions and networked structures. In virology, for example, measured response of host protein levels in response to viral infection, or experimentally identified protein complexes and pathways, contain multi-way interactions among collections of proteins as evidenced across multiple experimental conditions and pathways. Graphs are a standard tool to represent such connected interactions. But both mathematically and methodologically, graphs are limited to represent only pairwise interactions natively, for example between pairs of proteins in protein-protein interaction networks. Representing multi-way interactions in graphs requires additional coding, which is of sufficient burden that higher-order interactions above "primary effects" are commonly ignored. Hypergraphs are mathematical structures which explicitly generalize graphs precisely to represent such multi-way interactions natively. This talk will explore our recent work to understand how analogs of traditional network science concepts, like centrality and spectral clustering, can be used in the context of hypergraphs for discovery of central biological pathways, characterization of unknown transcription factors, and comparison of responses to viral infection with different pathogenesis.



P38
Analyzing the Signature of GPCR conformational Changes

Subject: Simulation and numeric computing

Presenting Author: rafeed khleif, California State University Northridge, United States

Co-Author(s):
Ravinder Abrol , California State University Northridge , United States
Erik Serrano, California State University Northridge , United States

Abstract:

G protein coupled receptors (GPCRs) are integral membrane proteins that allow a cell to convert extracellular stimuli like light, small molecules, peptides, and proteins into intracellular signals. These receptors are conformationally very malleable, which is necessary for their different functional states. Recent developments in the structural biology of membrane proteins and utilization of other biophysical techniques for GPCRs is generating rich evidence of the existence of multiple receptor conformations. The structures are being generated in the inactive, active, and intermediate states for many GPCRs, however, there is no unbiased way to characterize the functional states of GPCRs. We are developing general higher-order GPCR topology parameters (TMHTOP2) that can find functional-state specific structural signatures and can be applied to any alpha-helical transmembrane protein. The newly developed TMHTOP2 topology parameters were able to identify unique conformational signatures of the pre-active state of the A2A receptor compared to its inactive and active states, that can now be used to identify the pre-active states of other receptors as well. This method can be used in identifying and classifying structures from experiments or simulations as active, inactive, and intermediates.



P39
HIGH RESOLUTION PROTEOMICS AND GENOMICS OF CNDP1 REPEAT VARIANTS LINKED TO DIABETIC NEPHROPATHY

Subject: Qualitative modeling and simulation

Presenting Author: Nicholas Kinney, Edward Via College of Osteopathic Medicine, United States

Co-Author(s):
Parviz Shabane, Virginia Tech , United States
Arichanah Pulenthiran, Edward Via College of Osteopathic Medicine, United States
Robin Varghese, Edward Via College of Osteopathic Medicine, United States
Ramu Anandakrishnan, Edward Via College of Osteopathic Medicine, United States
Harold Garner, Gibbs Cancer Center &amp; Research Institute, United States

Abstract:

Leucine repeat variants in carnosine dipeptidase 1 (CNDP1) are linked to diabetic nephropathy; in particular, individuals homozygous for the five leucine (Mannheim) allele have reduced risk to develop diabetic end stage renal disease. We preform molecular dynamics (MD) simulation and genomic analysis of CNDP1 variants harboring four, five, six, and seven leucine residues, respectively. MD simulations of the protein product show that the N-terminal tail – which includes the leucine repeat – adopts a bound or unbound state. Tails with four or five leucine residues remain in the bound state anchored to residues 200-220; tails with six or seven leucine residues remain in the unbound state exposed to the solvent. The unbound state is maintained by a bridge of two hydrogen bonds between residues neighboring the leucine repeat; the bridge is not present in the bound state. Functionally important residues in each state are inferred using betweenness centrality; differences are observed for residues 200-220 and 420-440, which in turn affect the active site. Exome sequencing of 5,000 individuals is used to stratify CNDP1 genotypes by super-population (African, American, East Asian, South Asian, and European) and disease status (diabetes type-II and non-diabetes). Distributions of genotypes differ significantly across super-population but not disease status.



P40
Combining the Evolutionary Trace Algorithm and Covariation Metrics Yields Improved Structural Predictions

Subject: inference and pattern discovery

Presenting Author: Daniel Konecki, Baylor College of Medicine, United States

Co-Author(s):
Benu Atri, Baylor College of Medicine, United States
Jonathan Gallion, Baylor College of Medicine, United States
Angela Wilkins, Baylor College of Medicine, United States
Olivier Lichtarge, Baylor College of Medicine, United States

Abstract:

Understanding protein structure and function are vital to monitoring and controlling the activities of proteins for diagnostic and therapeutic purposes. However, many protein structures remain unsolved, and for those which are the relationships between residues are not yet known. While sequence-based covariation metrics exist to address these issues, few directly take into account phylogenetic information. Previously, we paired the Evolutionary Trace (ET) algorithm, which explicitly captures phylogenetic information, and the Mutual Information metric in order to identify evolutionarily coupled pairs of residues. This algorithm identified residues linked to allosteric signaling, confirmed by experiments, in the dopamine D2 receptor. Here we present a new implementation of the ET algorithm that provides efficient scaling to larger proteins and alignments. We characterize the effects of different sequence distances, phylogenetic trees, and covariation metrics on the ability of ET to predict covariation data, in addition to information gained traversing the phylogenetic tree. Characterization on a set of twenty-three proteins, and validation on a set of ~1000, benchmarked all methods on the ability to predict short range structural contacts. From this structural validation we show that by limiting computations to specific levels of a phylogentic tree, the new algorithm often improves accuracy, even when compared with state-of-the-art non-machine learning covariation methods. Examining highly ranked residue pairs not in close contact, reveals enrichment for epistatic interactions. In the future we will apply this method for both structural and functional predictions to guide biological experiments as well as test its usefulness as a machine learning feature set.



P41
Poster Withdrawn


P42
Expanding polygenic risk scores to include automatic genotype encodings and gene-gene interactions

Subject: Machine learning

Presenting Author: Trang Le, University of Pennsylvania, United States

Co-Author(s):
Hoyt Gong, University of Pennsylvania, United States
Patryk Orzechowski, University of Pennsylvania, United States
Elisabetta Manduchi, University of Pennsylvania, United States
Jason Moore, University of Pennsylvania, United States

Abstract:

Polygenic Risk Scores (PRS) are aggregation of genetic risk factors of specific diseases and have been successfully used to identify subgroups of individuals who are more susceptible to those diseases. Most existing statistical models focus on the marginal effect of the variants on the phenotypic outcome but do not account for the effect of gene-gene interactions. Here, we propose a novel calculation of the risk score that expands beyond marginal effect of individual variants on the phenotypic outcome, addressing important challenges in high-dimensional data risk profiling. The Multilocus Risk Score (MRS) method is the first to afford automatic genotype encodings while incorporating epistatic interactions among SNPs. MRS infers alternative genotype encodings, enabling flexible accommodation for different effect types such as additive or co-dominant within the same model. Moreover, by utilizing the advantages of model-based multifactor dimensionality reduction, MRS efficiently detects gene-gene interactions. On a diverse collection of datasets, MRS outperforms the standard PRS in the majority of the cases, especially when interactions between genes are present. Our open source implementation is computationally economical and can be extended to be applied to high-dimensional bioinformatics datasets, and we believe MRS will open the door to more accurate and effective risk profiling approaches that expand beyond static models of marginal effects.



P43
The impact of undesired technical variability on large-scale data compendia

Subject: Simulation and numeric computing

Presenting Author: Alexandra Lee, Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA; Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania,, United States

Co-Author(s):
YoSon Park, University of Pennsylvania, United States
Georgia Doing, Geisel School of Medicine, Dartmouth, United States
Deborah Hogan, Geisel School of Medicine, Dartmouth, United States
Casey Greene, University of Pennsylvania, Philadelphia, United States

Abstract:

Motivation: In the last two decades, scientists working in different labs have assayed gene expression from millions of samples. These experiments can be combined into a compendium in order to gain a systematic understanding of biological processes. However, combining different experiments introduces technical variance, which could distort biological signals in the data leading to misinterpretation. As the scale of these compendia increases, it becomes crucial to evaluate how integrating multiple experiments affects our ability to detect biological patterns.<br><br>Objective: To determine the extent to which underlying biological signals are masked by technical artifacts via simulation of a large compendia.<br><br>Method: We used a generative multi-layer neural network to simulate a compendium of P. aeruginosa gene expression experiments. We performed a pairwise comparison of a simulated compendium containing one experiment versus a simulated compendium containing varying number of experiments, up to a maximum of 6000 experiments, using multiple assessments.<br><br>Results: We found that it was difficult to detect the simulated signals of interest in a compendium containing 2 to 100 experiments unless we applied batch correction. Interestingly, as the number of experiments increased, it became easier to detect the simulated signals without correction. Furthermore, when we applied batch correction, we reduced our power to detect the signal of interest. <br><br>Conclusion: When combining a few experiments, it is best to perform batch correction. However, as the number of experiments increases, batch correction becomes unnecessary and indeed harms our ability to extract biological signals.



P44
Poster Withdrawn


P45
PyFBA: from Genomics to Metabolomics

Subject: Simulation and numeric computing

Presenting Author: Shane Levi, San Diego State University, United States


Abstract:

Flux-Balance Analysis (FBA) is a mathematical approach for modeling metabolic networks and measuring the flow of metabolites through them. Using sequenced and annotated genomic data it is possible to reconstruct an organism’s complete metabolic map and simulate phenotypic response to imposed environmental constraints. PyFBA is an open-source python package designed to generate, gap-fill, and test these metabolic models from functionally annotated genomes.



P46
Poster Withdrawn


P47
Effective Targeted Drug Prediction for Cancer Based on Genetic Mutations

Subject: Machine learning

Presenting Author: Darsh Mandera, Jesuit High School, United States

Co-Author(s):
Anna Ritz, Reed College, United States

Abstract:

Predicting the response to a particular drug for specific cancer, despite known genetic mutations still remains a huge challenge in modern oncology and precision medicine. Today, prescription of a drug for a cancer patient is based on a doctor’s analysis of various articles and previous clinical trials; it is an extremely time-consuming process. A novel general-purpose machine learning classifier has been designed and implemented to overcome this challenge for a carcinogenic gene mutation in any cancer. Breast Invasive Carcinoma Dataset from The Cancer Genome Atlas (TCGA) was used in the machine learning model. Feature selection was undertaken using ExtraTreeClassifier algorithm. Each patient’s data was wrangled to create a feature vector consisting of HUGO Symbol and Variant Type. Labels were created consisting of patient identity and drug response. K-Fold, Decision Tree, Random Forest and Ensemble Learning classifiers were used to predict best drugs. Results show that Ensemble Learning yielded prediction accuracy of 66% on the test in predicting the best drug. To validate that the model is general-purpose, Lung Adenocarcinoma (LUAD) data and Colorectal Adenocarcinoma (COADREAD) from TCGA was trained and tested, yielding prediction accuracies of 55% and 50% and 66% and 66% respectively. The resulting accuracy indicates a direct correlation between prediction accuracy and cancer data size. More importantly, the results of LUAD and COADREAD show that the implemented model works on any cancer type. This novel method will offer oncologists significant time-saving compared to their current approach of extensive background research, and offers personalized patient care for cancer patients.



P48
Identifying Protein-Metabolite Networks associated with COPD Phenotypes

Subject: Networking

Presenting Author: Emily Mastej, University of Colorado Anschutz Medical Campus, United States

Co-Author(s):
Lucas Gillenwater, National Jewish Health, United States
Yonghua Zhuang, University of Colorado Anschutz, United States
Katherine Pratte, National Jewish Health, United States
Russell Bowler, National Jewish Health, United States
Katerina Kechris, University of Colorado Anschutz, United States

Abstract:

Chronic obstructive pulmonary disease (COPD), a progressive lung disease, is associated with airflow obstruction in lungs making it difficult for patients to breathe. Although COPD occurs predominantly in smokers, 75% of smokers don’t develop COPD. Therefore, there are still deficits in our understanding of the disease develop. To gain a deeper understanding of COPD progression, we identified protein–metabolite networks associated with lung function and emphysema. While efforts have been made to use separate omics data to construct networks in parallel then combine the networks, the best method to combine the individual analysis is unclear. To overcome this issue, we used SmCCNet, a recently developed tool that uses sparse multiple canonical correlation analysis, to integrate proteomic and metabolomic data simultaneously. 1,317 protein biomarkers (SOMAscan, SomaLogic) and 995 metabolite features (Metabolon Global Metabolomics Panel) were obtained from blood samples of 1010 participants from the COPDGene study. We found a network consisting of 13 proteins and 7 metabolites that had a -0.34 (p-value < 0.001) correlation to forced expiratory volume in one second (FEV1.) Troponin T and phosphocholine were hubs. Another network consisting of 13 proteins and 10 metabolites was found to have a -0.27 (p-value < 0.001) correlation to emphysema. Adiponectin, troponin T, and neurotrophin-3 were hubs. <br>Canonical correlation-based analysis can be used for network discovery by integrating multiple omics data with a quantitative disease phenotype. Different networks had different phenotypic correlations which could assist in a deeper understanding of COPD development and further subclassification of COPD phenotypes.<br>



P49
Identification of cell signaling pathways based on biochemical reaction kinetics repositories

Subject: inference and pattern discovery

Presenting Author: Gustavo Matos, University of Sao Paulo, Brazil

Co-Author(s):
Hugo Armelin, Butantan Institute, Brazil
Marcelo Reis, Butantan Institute, Brazil

Abstract:

<br>Cell signaling pathways are important mechanisms for the orchestration of many cell functions. It is possible to create computational dynamic models for these signaling pathways with experimental data yielded by probing one or more chemical species of a given pathway. To design such models, it is necessary to determine the set of chemical reactions that are relevant in the inferred pathway. Recently, a method was introduced to systematically propose models by adding reactions to a kernel model. In this approach, with the definition of a score metric for models, it was possible to solve the identification of signaling pathways as a feature selection problem, in which the set of features is a set of all possible chemical reactions. However, this method presented some shortcomings that impair the selected model; among them were the usage of naive search heuristics and the unsatisfactory penalization of overfitted models. Therefore, we propose a new methodology that also tackles the identification of signaling pathways as a feature selection problem, which relies on a more general search algorithm and on a score metric that uses Bayesian inference to estimate the marginal likelihood of experimental data given a candidate model. Initial results with toy models show that a Python-based implementation of the chosen score metric indeed penalizes overly complex models, thus constraining the search space. The next step in this research is to use this novel methodology in the identification of mitogenic pathways for our case study, the mouse Y1 tumor cell line.



P50
Using Ramp Sequences to Identify Causes of Disease Association

Subject: inference and pattern discovery

Presenting Author: Lauren McKinnon, Brigham Young University, United States

Co-Author(s):
Justin Miller, Brigham Young University, United States
Josue Gonzalez, Brigham Young University, United States
Gage Black, Brigham Young University, United States
Elizabeth Vance, Brigham Young University, United States
Jace Webster, Brigham Young University, United States
John Kauwe, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Abstract:

Ramp sequences are slowly-translated regions at the beginnings of gene sequences that regulate translation and affect overall resource utilization. As genome-wide association studies identify an ever-increasing number of genetic variants lacking known functional implications, we considered how some of these variants might affect ramp sequences. Using 349 variants that are either associated with Alzheimer’s disease or in linkage disequilibrium with variants associated with Alzheimer’s disease, we identified 13 variants within 12 genes that affected the coding sequences. We used ExtRamp to determine whether each mutation disrupted a ramp sequence, compared with the human reference genome, GRCh38. One of these variants, rs2405442:T&gt;C, caused a ramp sequence to be destroyed in the Paired Immunoglobulin-Like Type 2 Receptor Alpha (PILRA) gene. A further analysis using random mutations in 1000 randomly-selected genes showed that PILRA is more susceptible to changes in ramp sequences than most genes. Despite the change from a suboptimal to an optimal codon (rs2405442:T&gt;C), mutant expression of PILRA is significantly lower than wildtype (p-value=0.0449). These results support our prediction, indicating that the ramp sequence in PILRA increases translational efficiency. We propose that rs2405442:T&gt;C directly affects Alzheimer&#039;s disease by destroying the ramp sequence in PILRA, which decreases overall protein levels. This type of analysis can be applied to future disease research and may aid in annotating functional implications of genome-wide associated variants



P51
Predicting progression-free interval for cancer patients based on heterogeneous combinations of high-throughput molecular data

Subject: Machine learning

Presenting Author: Nathan Mella, Brigham Young University, United States


Abstract:

Being able to accurately estimate a cancer patient's prognosis has potential to aid treatment- and disease-monitoring protocols. Each patient's tumor likely holds clues about that patient's prognosis; however, the standard approach of humans evaluating pathology slides has so far not provided sufficiently accurate prognoses, leading researchers to explore whether molecular data—combined with machine learning (ML) techniques—are key to improving prediction accuracy. Generating high-throughput molecular data is still expensive, so researchers must prioritize which types of data to collect. This is problematic because little evidence is available about which types and combinations of molecular data are most predictive. Furthermore, little is known about which ML algorithms work best with different types of molecular data.<br><br>We performed a benchmark analysis using clinical and molecular data from 1489 cancer patients. We applied 8 ML algorithms to 6 molecular data types across 10 cancer types to predict the progression-free interval (PFI) of cancer patients. We also tested combinations of up to 6 molecular data types and clinical features.<br><br>The algorithms predicted PFI with an AUROC as high as 0.80. However, these results differed considerably by molecular data type, cancer type, and ML algorithm. Half of the ML algorithms improved prediction accuracy as molecular data were added (the optimal number of data types was 3), while the remainder performed best with clinical features only. The “glmnet” algorithm was the most accurate algorithm when applied to heterogeneous data combinations. Researchers can use these benchmark results to prioritize resource allocation, potentially moving closer to translational relevance.



P52
Codon Usage Biases Have Significant Implications in Population Stratification

Subject: inference and pattern discovery

Presenting Author: Taylor Meurs, Brigham Young University, United States

Co-Author(s):
Justin Miller, Brigham Young University, United States
Matthew Hodgman, Brigham Young University, United States
John S.K. Kauwe, Brigham Young University, United States

Abstract:

Despite population-specific genetic differences having physical and disease implications, the extent to which these phenotypes relate to population stratification remains unknown. Codon usage biases play a major role in protein expression and translational efficiency. We assessed codon pairing (i.e., two codons encoding the same amino acid are located within a ribosomal window) and codon aversion (i.e., codons not present in a gene) for five superpopulations in the 1000 Genomes Project. We adapted phylogenetic algorithms to predict the population of each person in the 1000 Genomes Project using only codon aversion or codon pairing. Codon pairing alone was sufficient to evaluate population stratification in the East Asian population (n=504) with 100% accuracy and the African population (n=661) with 98.8% accuracy, with an analysis of variance of the average number of codon pairings across the five superpopulations (n=2504) being highly significant (p-value= 2.56 x 10-189). We propose that these phylogenetic algorithms were able to assess population stratification because of the homogeneity, possibly due to the founder effect, in the East Asian population. Although the African population is the most genetically diverse, Africans vary significantly from other populations, which enables their clustering. A broader group containing American and European populations was identified with 91.6% accuracy. Population stratification is vital to uncovering ancestry, personalizing medicine, and predicting protein expression. Since systematic differences in codon usage dynamics are not pathogenic in East Asian and African populations, they may be used for additional screening of genetic variants with unknown effects.



P53
Containerized pipeline for the identification of compound heterozygous variants in trios

Subject: System integration

Presenting Author: Dustin Miller, Brigham Young University, United States


Abstract:

In most children who are diagnosed with pediatric cancer or a structural birth defect, the underlying cause is unknown. It is likely that in many cases, inherited DNA mutations cause such diseases, but researchers have found such evidence for relatively few pediatric diseases; thus there is an urgent need to identify alternative mechanisms of disease development. We hypothesize that compound heterozygous (CH) variants play an important role in disease development; little attention has been given to these variants, in part because whole-genome phasing and annotation requires integration of specialized software and annotation files. Using datasets from the Gabriella Miller Kids First Data Resource Center, we seek to improve identification of CH variants in pediatric patients. We have created a Docker-based computational pipeline that simplifies the process of CH variant identification and have validated our pipeline using idiopathic scoliosis genotype data from 16 trios. Our pipeline encapsulates various programs used to process (GATK4), phase (Shapeit2), annotate (SnpEff), and explore variants (Gemini). We provide open-source, reproducible scripts that allow other researchers to examine our methodology in detail and apply it to their own data. Encapsulating our code within containers helps control what software versions are used, what system libraries are used, and creates a cohesive computational environment. The use of containerization technology in genome analysis is in relatively early stages; our work helps to set a precedent for using containerized pipelines in pediatric-genome studies. In addition, our work helps to identify the impact of CH variants in pediatric disease.



P54
Pathogenic Synonymous Variants Are More Likely to Affect Codon Usage Biases than Benign Synonymous Variants

Subject: inference and pattern discovery

Presenting Author: Justin Miller, Brigham Young University, United States

Co-Author(s):
John Kauwe, Brigham Young University, United States

Abstract:

Several codon usage biases within genes directly affect translational efficiency. Ramps of slowly translated codons at initiation, pairing codons that encode the same amino acid within a ribosomal window, and complete aversion to codons lacking cognate tRNA significantly increase translational speed or decrease resource utilization. Although many mechanisms affecting codon usage biases are well-established, the effects of synonymous codon variants on disease remain largely unknown. We identified changes in codon usage dynamics caused by each of the 65,691 synonymous variants in ClinVar, including 14,923 highly supported benign or pathogenic synonymous variants (i.e., variants with multiple submitters or reviewed by an expert panel). We found that pathogenic synonymous variants are 2.4x more likely to affect any codon usage dynamic (i.e., codon aversion, codon pairing, or ramp sequences), 8.5x more likely to affect multiple codon usage dynamics, and 69.5x more likely to affect all three codon usage dynamics than benign synonymous variants. Although significant differences exist between the number of variants affecting most codon usage dynamics, changing only codon aversion or only ramp sequences was nonsignificant, indicating that disrupting only a ramp sequence or only codon aversion may not be sufficient to identify variant pathogenicity. However, a strong indicator of pathogenicity occurs when a synonymous variant affects at least two codon usage biases (p-value=5.23x10^-10) or all three codon usage biases (p-value=1.30x10^-25). We anticipate utilizing these results to improve variant annotation by prioritizing synonymous variants that are most likely to be pathogenic.



P55
Filtering, classification, and selection of new knowledge for model assembly and extension

Subject: Text Mining

Presenting Author: Natasa Miskov-Zivanov, University of Pittsburgh, United States


Abstract:

The amount of published material produced by experimental laboratories is increasing at an incredible rate, limiting the effectiveness of manually analyzing all available information, and highlighting the need for automated methods to gather and extract the vast knowledge present in the literature. Machine reading coupled with automated assembly and analysis of computational models is expected to have a great impact on understanding and efficient explanation of large complex systems. State-of-the-art machine reading methods extract, in hours, hundreds of thousands of events from the biomedical literature; however, many of the automatically extracted biomolecular interactions are incorrect or not relevant for computational modeling of a system of interest. Therefore, automated methods are necessary to filter and select accurate and useful information from the large machine reading output. We have developed several tools to efficiently filter, classify, and select the best candidate interactions for model assembly and extension. Specifically, the tools that we have built include: (1) a filtration method that utilizes existing databases to select from the extracted biochemical interactions only those with high confidence; (2) a classification method to score selected interactions with respect to an existing model; (3) several model extension methods to automatically extend and test models with respect to a set of desirable system properties. Our tools help reduce the time required for processing machine reader output by several orders of magnitude, and therefore, enable very fast iterative model assembly and analysis.



P56
HiC-Pipeline: a Kepler and Spark-based scalable workflow for normalized contact map creation

Subject: Metogenomics

Presenting Author: Sanjay Nagaraj, Baylor College of Medicine, United States


Abstract:

Hi-C and similar genome-wide assays map pairwise interactions between fragments of chromosomes that are close in three-dimensional space. Because of the nature of mapping pairwise interactions, the read depth of HiC datasets scales with the square of the genome size, and therefore a computational bottleneck of HiC analysis is the mapping of small fragment reads to generate normalized contact maps. Additionally, many methods of performing this step require frequent user input and expertise at each step of the process, which can often be confusing and laborious for the user, and require an intensive training process. Here we provide an efficient and accurate workflow model, HiC-Pipeline, that is designed with scalability in mind to process large volumes of raw Hi-C data. HiC-Pipeline is implemented using Apache Spark, a distributed data processing framework, and the Kepler framework for parallel processing and ease of access. It additionally can be utilized in parallel with widely used visualization software such as Juicebox to produce high quality manuscript-ready maps. We find that HiC-Pipeline outperforms state-of-the-art workflows in terms of speed while maintaining comparable accuracy and is scalable with regards to both the size of a dataset and the number of compute nodes available.



P57
Poster Withdrawn


P58
Robust discovery of causal gene networks via measurement error estimation and correction

Subject: inference and pattern discovery

Presenting Author: Manikandan Narayanan, Indian Institute of Technology Madras, India

Co-Author(s):
Rahul Biswas, Indian Institute of Technology, IIT Madras, India
Brintha V P, Indian Institute of Technology, IIT Madras, India
Amol Dumrewal, Indian Institute of Technology, IIT Madras, India

Abstract:

Discovering causal relations among genes from observational data is a fundamental problem in systems biology, especially in humans where direct gene perturbations are unethical/infeasible. “Mediation or Mendelian-Randomization” based methods can infer causality from gene expression and matched genotype data, but expression measurement errors are prevalent (e.g., RNA-seq counts of low/moderate expressed genes) and can mislead most such methods into making wrong causal inferences (Hemani et al., PLoS Genetics 2017). We propose a two-step framework to discover causal gene networks under measurement noise. The first step predicts the magnitude (variance) of measurement errors in RNA-seq read counts of all genes when no technical replicates are available, with a machine learning model trained on gene/sample-specific features (like average expression, gene length, and other published correlates of RNA-seq technical noise) and using technical replicates from the RNA-seq Quality Control Consortium. Our framework’s second step incorporates the estimated measurement errors to correct/extend mediation-based causality methods like CIT. For instance, our newly proposed RobustCIT method conducts four regression-based statistical tests verifying a chain of conditions of causality as the original CIT, but with regression coefficients, residuals and associated P-values appropriately corrected using measurement error variances. In simulation studies, RobustCIT clearly outperforms original CIT with accuracy approaching that on error-free data, thereby holding promise for extending our framework to other mediation-based causal network learning algorithms such as Trigger and PC, and applying them to discover reliable causal gene networks underlying complex cellular behaviors. (This work is supported by MN's Wellcome Trust/DBT India Alliance Intermediate Fellowship IA/I/17/2/503323.)



P59
Reusing label functions to extract multiple types of relationships from biomedical abstracts at scale

Subject: Text Mining

Presenting Author: David Nicholson, University of Pennsylvania, United States

Co-Author(s):
Daniel Himmelstein, University of Pennsylvania, United States
Casey Greene, University of Pennsylvania, United States

Abstract:

Knowledge bases are repositories designed to store and share information. They support multiple research efforts such as providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. Some knowledge bases are automatically constructed, but most are populated via some form of manual curation. Manual curation is time consuming and difficult to scale in the context of an increasing publication rate. “Data programming” paradigm seeks to circumvent this process by combining distant supervision approaches with simple rules and heuristics written as labeling functions that can be automatically applied to inputs. Unfortunately, writing useful label functions requires several nontrivial tasks including substantial error analyses. Empirically, we found that creating an informative label function takes a few days to produce. This makes populating a biomedical knowledge base with 47,031 nodes and 2,250,197 edges require hundreds or possibly thousands of label functions. Hetionet (v1; heterogeneous network for biomedical sciences) was constructed using curated knowledge about nodes and edge types for diseases, disease-treating compounds and disease-associated genes. With a subset of the Hetionet as the positive control and baseline, we evaluated the extent to which label functions could be reused across edge types. We compared a distant supervision model adjusted by edge-type-specific label functions, edge-type-mismatch label functions, and all-label functions. We confirmed that adding additional edge-type-specific label functions improves performance. Overall, our analysis suggests that label functions can be reused to improve the “data programming” paradigm and rapidly update knowledge bases without extensive need of manual curation.



P60
Model selection for clinical metabolomics: comparing the power of different optimization approaches for coronary artery disease diagnosis prediction.

Subject: Machine learning

Presenting Author: Alena Orlenko, University of Pennsylvania, United States

Co-Author(s):
Jason Moore, University of Pennsylvania, United States
Daniel Kofink, University Medical Center Utrecht, Netherlands
Folkert Asselberg, University Medical Center Utrecht, Netherlands

Abstract:

Automated machine learning (AutoML) is emerging as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of clinical endpoints. AutoML approaches are important because it isn’t always clear what ML methods are optimal for any given dataset. Our AutoML tool - the tree-based pipeline optimization tool (TPOT) employs evolutionary computations to select the most optimized machine learning model out of the pool of various possible combinations of data preprocessors and supervised ML algorithms. Here we apply TPOT to predict angiographic diagnoses of coronary artery disease (CAD) in the Angiography and Genes Study with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. In addition, we provide a guideline for TPOT-based model pipeline selection based on various clinical phenotypes and high-throughput metabolic profiles. We performed a comparative analysis of TPOT-generated classification pipelines with a grid search approach, applied to two CAD phenotypic profiles: no CAD vs non-obstructive CAD and obstructive CAD (P1); no CAD and non-obstructive CAD vs obstructive CAD (P2). As a result, TPOT automatically produced classification ML model pipelines that outperformed grid search optimized model pipelines across multiple performance metrics including balanced accuracy and area under the precision-recall curve for both phenotypic profiles (balanced accuracy 0.77 for P1 and balanced accuracy 0.78 for P2). We used selected models to show that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision and subsequently has a different subset of predictive features.



P61
The Utility of Polygenic Risk Scores in High-risk Pedigrees

Subject: inference and pattern discovery

Presenting Author: Madeline Page, Brigham Young University, United States

Co-Author(s):
Elizabeth Ward, Brigham Young University, United States
Justin Miller, Brigham Young University, United States
John S.K. Kauwe, Brigham Young University, United States

Abstract:

Background: Polygenic risk scores predict overall genetic risk of developing a disease by combining the individual risk of various genome-wide associated loci. These scores are often applied independent of pedigrees, meaning that the extent of polygenetic risk score heritability is unknown. Familial-based studies lower the cost of identifying disease-associated variants by capitalizing on heritability to increase the power of the analyses.<br><br>Methods: We used the Utah Population Database (UPDB) to identify 19 first or second cousin pairs with a statistically higher incidence rate of Alzheimer's disease in their pedigrees. We calculated their individual polygenic risk score using risk scores from the largest genome-wide association study of Alzheimer's Disease (Lambert et al., 2013). We compared these risk scores to the risk scores of cases and controls in the Alzheimer’s Disease Genetic Consortium (ADGC).<br><br>Results: We failed to detect significant differences between average polygenic risk scores for the high-risk cousin pairs and the ADGC cases and controls.<br><br>Conclusions: Although the high-risk cousin pairs contain a statistical excess of Alzheimer's disease in their pedigrees, the overall risk for Alzheimer's disease caused by common variants is no different than the control group. Therefore, most of their genetic risk is likely caused by rare variants that are difficult to assess using a polygenic risk score. We propose that familial-based studies should be used to identify rare variants that may be overlooked in a polygenic risk score.



P62
Poster Withdrawn


P63
Medication Profiling through Tensor Factorization: A case study on commercial pharmacy claims

Subject: Machine learning

Presenting Author: Yubin Park, WithMe Health, Inc., United States

Co-Author(s):
Joyce Ho, Emory University, United States
Ash Damle, WithMe Health, Inc., United States
Meghan Sapia, WithMe Health, Inc., United States

Abstract:

Medication profiling is critical for 1) correctly identifying the medication needs, 2) strategizing suitable medication policies and guidelines, and 3) assessing the business impacts of various disease management programs. In practice, Generic Product Identifier (GPI) or Anatomical Therapeutic Chemical Classification (ATC) are used to group different medications and summarize the number of claims and members and total costs. However, these analyses are not capable of highlighting the relationship between members, medications, and costs, thus often they fail to address population-level pharmacy strategies. We propose a novel approach for medication profiling using a non-negative tensor factorization technique. Our approach starts by constructing a 3-dimensional tensor, where each axis of the tensor represents members, medications (with GPI-4 or ATC-4), and medication cost buckets. Using a regularized non-negative tensor factorization method, we obtain 1) medication profiles described by the types and costs of medications, and 2) members' soft-membership to the estimated medication profiles. Our experimental results on commercial pharmacy claims show that our approach can provide valuable member-level insights and clues for population-level strategies compared to standard clustering techniques.



P64
Evolutionary Action as a Tool for Quantifying Differentiation Across the Primate Family Tree

Subject: inference and pattern discovery

Presenting Author: Harley Peters, Baylor College of Medicine, United States

Co-Author(s):
Nicholas Abel, Baylor College of Medicine, United States
Panagiotis Katsonis, Baylor College of Medicine, United States
Olivier Lichtarge, Baylor College of Medicine, United States

Abstract:

Selection is the driving force behind evolution. Evidence of selection on a given protein is often found in analyzing fixed differences between species orthologs as accumulated missense mutations. To this end most methodologies make use of the ratio of nonsynonymous mutations to the ‘background rate’ of synonymous mutations, or the dN/dS ratio. A major limitation of this approach is that each nonsynonymous mutation is assumed to have the same selection coefficient; however, not all mutations are created equal. The impact of a missense mutation on a protein’s fitness can be estimated using the Evolutionary Action (EA) equation. Here, we investigate selection pressures acting across the primate lineage using EA to quantify the divergence between the genomes of each species relative to human. We find the distribution of EA scores in each species to be largely skewed toward variants of low-impact. In each species the sum of all EA scores strongly correlates with time of divergence, providing evidence that functional changes within exomic regions of the genome accumulate at a constant rate. We also find that EA correlates well with the current gold standard of dN/dS, adding a dimension of phenotypic impact to existing methodologies. We used EA score cutoffs to identify genes with high-impact changes specific to humans alone, or shared by both humans and Neandertals. GO term analysis of these genes finds significant enrichment for several pathways; including muscle development, the JNK cascade, and skin keratinization. This work demonstrates EA's application as a tool for selection pressure analysis.



P65
Assessing long-read correction and polishing strategies for genome assembly

Subject: other

Presenting Author: Brandon Pickett, Brigham Young University, United States

Co-Author(s):
Hannah Maltba, Brigham Young University, United States
John Kauwe, Brigham Young University, United States
Perry Ridge, Brigham Young University, United States

Abstract:

Modern de novo genome assembly strategies typically leverage long reads from third-generation DNA sequencing technologies. Pacific Biosciences (PacBio) Single-molecule, Real-time (SMRT) sequencing can generate two types of long reads: (a) continuous long reads (CLR) and (b) high-fidelity long reads (HiFi) using circular consensus sequencing (CCS). PacBio CLR and Oxford Nanopore Technologies (ONT) nanopore sequencing reads (NP) are routinely produced with average lengths of 20-30Kbp. In a given sequencing run, maximum read lengths can exceed 100Kbp, and in some cases, with the right library preparation, NP maximum lengths have exceeded 1Mbp. The error rate for these reads is generally 10-15%, but early reports with newer chemistries suggest error rates can drop towards 5%. PacBio CCS HiFi averages 10-15Kbp reads with an error rate of 1%. With the probable exception of HiFi reads, the high error rates from long-read technologies require a correction step to reduce the error. This is typically done before (correction) or after (polishing) assembly. Either approach can rely on read depth to correct errors and/or additional information from Illumina short reads (typically paired-end with lengths 125-250bp and error rates <1%). To what extent measures of assembly quality are affected by choice of (a) long read type, (b) short read inclusion, and (c) error correction strategy (i.e., correction, polishing, or both) remains uncertain. We explore these effects in a case study with a human genome (Ashkenazim Son HG002, Genome in a Bottle (GIAB) Consortium) with permutations of these data types and strategies.



P66
Poster Withdrawn


P67
Different Metabolic RNA Levels Exist in Alzheimer’s Disease Brains

Subject: inference and pattern discovery

Presenting Author: Karl Ringger, Brigham Young University, United States

Co-Author(s):
Justin Miller, Brigham Young University, United States
Erin Saito, Brigham Young University, United States
Benjamin Bikman, Brigham Young University, United States
Carlos Cruchaga, Washington University School of Medicine, United States
Oscar Harari, Washington University School of Medicine in St. Louis, United States
Kathie Mihindu, Washington University School of Medicine in St. Louis, United States
John Kauwe, Brigham Young University, United States

Abstract:

Alzheimer’s disease (AD) is a devastating form of dementia that is characterized by progressive neurodegeneration that impairs memory and behavior. Recent studies indicate that impaired insulin signaling and impaired glucose metabolism occur early in disease progression, and AD shares much of the same etiology as diabetes. We analyzed RNA-seq in different brain regions obtained from the Mayo Clinic, the Mount Sinai Brain Bank, and the Knight Alzheimer's Disease Research Center (Knight-ADRC). We filtered for gender-related expression differences in both AD cases and controls. We performed an analysis of variance (ANOVA) on each gene in the glycolysis (20 genes) and ketolysis (six genes) pathways and calculated both an empirical and Bonferroni-corrected p-value for each gene. We found that glucose metabolism is impaired in brain regions relevant to memory formation in AD brains. We also show that ketolytic gene expression is also impaired in AD patients. These impairments were not found in patients with progressive supranuclear palsy. These results indicate that both glycolytic and ketolytic pathways are disrupted in AD brains and metabolic strategies, including diets and other interventions, may be effective in treating AD.



P68
EFFECT OF AN AXIALLY ORIENTED ELECTRIC DIPOLE MOMENT ON THE AMYLOID-ß(25-35) AGGREGATION AND CYTOTOXICITY

Subject: Qualitative modeling and simulation

Presenting Author: Eduardo Romero, university of central florida, United States

Co-Author(s):
Florencio Hernandez, University of central florida, United States

Abstract:

Alzheimer’s disease (AD) is one of the most common neurodegenerative disorder worldwide. AD is characterized by the formation of neurofibrillary tangles and plaques. The plaques are conformed of Amyloid-ß peptides mainly containing 40 to 42 fragments. However, the peptide fragment Aß25-35 has caught researchers’ attention, recently, due to its high cytotoxicity and fast aggregation. To understand the fast aggregation and high cytotoxicity of Aß25-35, we performed the theoretical study of the Aß25-35 monomer, stacked oligomers, and protofibrillar structures at the theoretical level HF/STO-3G in gas phase. Our findings show that the monomer presents a hairpin conformation and presents an intense utterly aligned electric dipole moment perpendicular to the peptide plane. We have found that the dipole moment can promote the aggregation of Aß25-35 in two ways (stacking and protofibrillar aggregation) through a cascade dipole interaction, which increases the dipole moment magnitude as the oligomer growths. The growth act as self-catalytic driven force for the aggregation process. The protofibril shows a pore-like structure with its dipole moment located at the center of the pore aligned with the fibril axis. The dipole is more intense than in the case of the Aß1-42 protofibril which we uncovered recently. The stabilization energies, as well as the electric dipole orientation, suggest that the fibril formation mechanism does not require reaching nucleation to achieve a fast aggregation as in the case of Aß1-42. Additionally, the protofibrillar pore-like structure and its strong dipole moment explain why Aß25-35 can form ionic channels that are more cytotoxic that in the case of Aß1-42



P69
Deep learning enables in silico chemical-effect prediction

Subject: Machine learning

Presenting Author: Jana Schor, Helmholtz Centre for Environmental Research - UFZ, Germany


Abstract:

All living species are exposed to a plethora of chemical substances. In addition to food and endogenous chemicals there are drugs and pollutants. Many chemicals are associated to the risk of developing severe diseases due to their interaction with bio-molecules, like proteins or nucleic acids. Hundreds of thousands of chemicals are listed in public databases worldwide, and there are similarly many bio-molecules encoded in the genomes of species. The advances in high-throughput sequencing technologies in genomics and high-throughput robotic testing in toxicology provide a great source of complex data (for a fraction of chemicals) that must be integrated on a large scale towards in silico prediction of disease risks, improved chemical design, and improved risk assessment. We present our deepFPlearn approach that uses deep learning to associate the molecular structure of chemicals with target genes involved in endocrine disruption - an interference with the production, metabolism or action of hormones in the body which is associated to the development of many severe diseases and disorders in humans. Trained on ~7,000 thousand chemicals for which an interaction with 6 target genes of interest has been measured, the program reached 92% prediction accuracy. Its application to the 700,000 toxCast chemicals identified a plethora of additional candidates and explainable AI is used on our model to identify responsible (combinations of) sub structures in the chemical-gene interaction. With deepFPlearn we demonstrate that transforming the enormous quantity of data in genomics and toxicology into value using deep learning will pave the way towards predictive toxicology.



P70
An online end-to-end pipeline for virus phylogeography that leverages Natural Language Processing for finding host locations

Subject: Graphics and user interfaces

Presenting Author: Matthew Scotch, Arizona State University, United States

Co-Author(s):
Arjun Magge, Arizona State University, United States
Davy Weissenbacher, University of Pennsylvania, United States
Karen O'Connor, University of Pennsylvania, United States
Graciela Gonzalez, University of Pennsylvania, United States

Abstract:

To study virus spread and genome evolution, researchers often leverage DNA sequences from the NCBI GenBank database as well as corresponding metadata such as host type (e.g. human or Anas acuta) or location of the infected host (e.g. Denver, CO or Australia). However, as we have shown, location metadata is often missing or incomplete. This can create difficulty for researchers in creating robust datasets for analysis. In our prior work, we demonstrated the value of incorporating sampling uncertainty in the geospatial assignment of taxa for virus phylogeography, a field which estimates evolution and spread of pathogens. Here, sampling uncertainty relates to possible locations of the infected host that can be derived from the GenBank metadata or the full-text article that is linked to the record. To automate this task, we have developed a Natural Language Processing (NLP) pipeline for extracting possible locations from GenBank as well as journal articles and assign probabilities for each location. The probabilities can then be utilized in the phylogeography software, BEAST, for producing models of virus spread. In this work, we describe an online portal of an end-to-end system that integrates virus DNA sequences and metadata from GenBank with probabilities from our NLP system (https://zodo.asu.edu/zoophy/ and https://zodo.asu.edu/zodo/). The portal then implements phylogeography models in BEAST and sends the results to the user in the form of trees, geospatial maps, and graphs. We make this portal available to researchers and epidemiologists studying the spread of RNA viruses.



P71
Open PBTA: Collaborative analysis of the Pediatric Brain Tumor Atlas

Subject: Data management methods and systems

Presenting Author: Joshua Shapiro, Childhood Cancer Data Lab (Alex's Lemonade Stand Foundation), United States

Co-Author(s):
The OpenPBTA Contributors Consortium, _, United States

Abstract:

Pediatric brain tumors are the leading cause of cancer-related death in children, but our ability to understand and successfully treat these devastating diseases has been challenged by a lack of large, integrated data sets. To address this limitation, The Children's Brain Tumor Tissue Consortium and the Pediatric Pacific Neuro-Oncology Consortium recently released the Pediatric Brain Tumor Atlas (PBTA) as part of the Gabriella Miller Kids First Data Resource Center. The PBTA dataset includes WGS and RNA-Seq data from nearly 1,000 tumors. Analysis of this dataset is being conducted through the OpenPBTA project, an open science initiative to comprehensively define the molecular landscape of these tumors through shared analyses and collaborative manuscript production. The current state of analyses is continuously available and visible to the public through GitHub at https://bit.ly/openPBTA, and we encourage contributions from community participants through pull requests. To ensure reproducibility, analyses are performed within a Docker container, with continuous integration applying each added analysis to test datasets. The corresponding manuscript is collaboratively written using the Manubot system, also hosted on GitHub and available to the public as it evolves. The OpenPBTA managing scientists include members of the Alex's Lemonade Stand Foundation's Childhood Cancer Data Lab and the Children's Hospital of Philadelphia's Center for Data Driven Discovery in Biomedicine. Through OpenPBTA, we will advance discovery of new mechanisms contributing to the pathogenesis of pediatric brain tumors and promote a standard for open research efforts to accelerate understanding and clinical translation on behalf of patients.



P72
Technical Bias Correction of Sequencing Libraries Using Wavelet Transform Analysis and Clustering

Subject: inference and pattern discovery

Presenting Author: Rutendo Sigauke, Univeristy of Colorado, United States

Co-Author(s):
Robin Dowell, University of Colorado, United States
Jacob Stanley, University of Colorado, United States

Abstract:

Advancements in sequencing technologies have allowed researchers to investigate various biological processes. However, there are known batch effects in these datasets. Any sequencing dataset contains information from both the biological source of interest as well as any technical processing. These technical contributions include the choice of assay, library amplification, sequencing platform, etc, as well as what lab prepares the library. In principle, the signal associated with the technical processing confounds the interpretation of the biological signal. Various methods of batch correction for genomic data exist; however, they require knowledge of possible confounders. Such sources of variation are difficult to exhaustively test, since one can not know all possible technical differences. Here we employ a new approach, using the discrete wavelet transform (DWT)---which separates various length scales of oscillation present in the data, as an agnostic batch correction alternative. We present wavelet transformations of two common nascent transcription library protocols (GRO-seq and PRO-seq) performed across various laboratories and cell lines as a possible solution to removing technical noise. Clusters based on the DWT suggest that there are signature noise profiles specific to library protocol and research group. Our preliminary results indicate that the DWT approach may provide a strategy for removing systematic noise in nascent RNA-sequencing data.



P73
geneHarmony: An Interactive Web Application that Automates the Manual Process of Bacteriophage Genome Annotation

Subject: Instrumentation interfaces and data acquisition

Presenting Author: Erica Suh, Brigham Young University, United States


Abstract:

Bacteriophages are diverse, actively-evolving entities that represent the majority of organisms in the biosphere. Every year, students and faculty members at universities across the nation participate in the Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science (SEA-PHAGES) program to identify and annotate novel bacteriophage genomes for GenBank submission. However, current annotation tools do not optimize gene prediction and therefore require hours of manual reexamination of the entire genome. Researchers must painstakingly parse through each open reading frame using visual comparisons, while making sure that all coding-potential regions are accounted for. This process is often prone to error. We are creating an interactive web application called geneHarmony, which reduces these manual challenges by combining functionalities of current tools and outputting submission-ready GenBank files. geneHarmony will efficiently automate and dramatically accelerate the pace of bacteriophage genome annotation. We hope that phage hunters will then be able to spend less time annotating and more time contributing to our knowledge on the genetic diversity of bacteriophages and their biological and technological implications.



P74
Natural Language Processing (NLP) in FDA Adverse Event Reporting System (FAERS) to Improve Extraction of Rare, Severe Adverse Drug Event-Product Pairs from Unstructured Data: Stevens - Johnson Syndrome and Toxic Epidermal Necrolysis (SJS/TEN) as an E

Subject: Text Mining

Presenting Author: Katherine Sullivan, University of Colorado, Anschutz Medical Campus, United States


Abstract:

Question: Can NLP improve detection of product-event pairs for the rare, severe adverse drug event, SJS/TEN, in FAERS? <br><br>Proposal: Natural Language Processing (NLP) can improve extraction of rare, severe adverse drug event-product pairs from unstructured data in FDA Adverse Event Reporting System (FAERS); use Stevens - Johnson syndrome and toxic epidermal necrolysis (SJS/TEN) as an example.<br><br>Background: SJS/TEN are rare, severe adverse drug events. Given their rarity, it is difficult to identify medications that trigger these reactions. In FAERS, disproportionality methods can be used to detect signals for adverse events that occur disproportionately more often than would be expected by chance; however, this method requires that data be structured, and can miss important information available in the large volumes of unstructured, narrative section of reports. <br><br>Methods: NLP can be used to identify product-event pair signals from unstructured data in FAERS, and build upon the signal detection methods that the FDA already uses. I propose using text mining methods to evaluate unstructured data in FAERS case reports to predict rare product-event pairs for SJS/TEN. <br><br>Discussion: The FDA receives >2 million adverse event reports per year; case reviewers are becoming overwhelmed by the manual process of assessing reports for potential causal relationships between drug products and adverse events. It is of great interest to the FDA and other stakeholders to more quickly and efficiently review reports. NLP would improve the increasing need to efficiently assess unstructured data in event reports, and identify medications that trigger rare, severe adverse drug events like SJS/TEN.



P75
MetaPro: A scalable and reproducible data processing and analysis pipeline for metatranscriptomic investigation of microbial communities

Subject: Metogenomics

Presenting Author: Billy Taj, Hospital for Sick Children, Canada

Co-Author(s):
John Parkinson, Hospital for Sick Children, Canada
Xuejian Xiong, Hospital for Sick Children, Canada
Nirvana Nursimulu, University of Toronto, Canada
Jordan Ang, University of Toronto, Canada
Mobolaji Adeolu, Hospital for Sick Children, Canada

Abstract:

Background Whole microbiome RNASeq (metatranscriptomics) has emerged as a powerful technology to functionally interrogate microbial communities. A key challenge is how best to process, analyze and interpret these complex datasets. In a typical application, a single metatranscriptomic dataset may comprise from tens to hundreds of millions of sequence reads. These reads must first be processed and filtered for low quality and potential contaminants, before being annotated with taxonomic and functional labels and subsequently collated to generate global bacterial gene expression profiles. Results Here we present MetaPro, a flexible, massively scalable metatranscriptomic data analysis pipeline that is cross-platform compatible through its implementation within a Docker framework. MetaPro starts with raw sequence read input (single end or paired end reads) and processes them through a tiered series of filtering, assembly and annotation steps. In addition to yielding a final list of bacterial genes and their relative expression, MetaPro delivers a taxonomic breakdown based on the consensus of complementary prediction algorithms, together with a focused breakdown of enzymes, readily visualized through the Cytoscape network visualization tool. We benchmark the performance of MetaPro against two current state of the art pipelines and demonstrate improved performance and functionality. Conclusion MetaPro represents an effective integrated solution for the processing and analysis of metatranscriptomic datasets. It’s modular architecture allows new algorithms to be deployed as they are developed, ensuring its longevity. To aid user uptake of the pipeline, MetaPro, together with an established tutorial that has been developed for educational purposes is made freely available at https://github.com/ParkinsonLab/MetaPro



P76
A review on multiple sequence alignment algorithms

Subject: Optimization and search

Presenting Author: Alice Tan, White Oaks Secondary School, Canada


Abstract:

Multiple Sequence Alignment (MSA) is a technique used to compare related DNA or amino acid sequences. The assembly of a multiple sequence alignment has become one of the most common tasks when dealing with sequence analysis. It can be used for determining ancestral relationships between various subject sequences. Such programs can also visually portray their molecule alignments. In bioinformatics, multiple sequence algorithms facilitate a process that would be time consuming to perform manually. MSA can be conducted using different software methods, each of which provides some benefits.<br>Although the protein multiple sequence alignment problem has been studied for several decades, many recent algorithms have been proposed to improve the accuracy or scalability. We briefly describe existing techniques and web-based tools, and expose the potential strengths and weaknesses of the most widely used methods. The various methods I have discussed are the Dynamic Programming (DP) method, the Needleman and Wunsch method, and the Smith Waterman method. The DP method is advantageous when measuring smaller specific sequences, but it is limited at reading longer sequences. On the contrary, the Needleman and Wunsch method would be best used for optimal global alignments, where the overall sequence must be considered. Further, the Smith Waterman method allows for the most accurate sequence alignment, but would also require sufficient hardware to run such a program.



P77
Computational Analysis of Kinesin Mutations Implicated in Hereditary Spastic Paraplegias

Subject: Qualitative modeling and simulation

Presenting Author: Shaolei Teng, Howard University, United States


Abstract:

Hereditary spastic paraplegias (HSPs) are a genetically heterogeneous collection of neurodegenerative disorders. The complex HSP forms are categorized by characterized various neurological features including progressive spastic weakness, urinary sphincter dysfunction, extra pyramidal signs and intellectual disability (ID). The kinesin superfamily proteins (KIFs) are microtubule-dependent molecular motors involved in intracellular transport. Kinesins directionally transport membrane vesicles, protein complexes and mRNAs along neurites, thus playing important roles in neuronal development and function. Recent genetic studies have identified kinesin mutations in patients with HSPs. In this study, we used the computational approaches to investigate the disease-causing mutations associated with ID and HSPs in KIF1A and KIF5A. We performed homology modeling to construct the structures of kinesin-microtubule binding domain and kinesin-tubulin complex. We applied structure-based energy calculation methods to determine the effects of missense mutations on protein stability and protein-protein interaction. The results revealed that E253K associated with ID in KIF1A could change folding free energy and affect the protein stability of kinesin motor domains. We showed that the HSP mutations located in complex interface, such as A255V in KIF1A and R280C in KIF5A, can alter binding free energy and impact the binding affinity of kinesin-tubulin complex. Sequence-based bioinformatics predictions suggested that many of the kinesin mutations in motor domains are involved in post-translational modifications including phosphorylation and acetylation. The computational analysis provides useful information for understanding the roles of kinesin mutations in the development of ID and HSPs.



P78
A pan-cancer 3-gene signature to predict dormancy

Subject: Machine learning

Presenting Author: Ivy Tran, Rutgers University - Camden, United States

Co-Author(s):
Anchal Sharma, Rutgers Cancer Institute of New Jersey, United States
Subhajyoti De, Rutgers Cancer Institute of New Jersey, United States

Abstract:

Tumor dormancy is characterized by the dissemination of hibernating tumor cells that do not proliferate until years after apparently successful removal of patients’ primary cancer, resulting in the late relapse of the cancer. Distinguishing between the risk of early ( 8 months) and late ( 5 years) relapse in cancer patients is important for the targeted treatment of the tumor. In this study, we identified 53 genes that were significantly up-regulated or down-regulated in dormant cells, from which three genes, CD300LG, OCIAD2, VSIG4, were determined by recursive feature elimination to be the most important features in predicting tumor dormancy. Using this three gene signature, we trained a Random Forest algorithm on a cross-validated (10 fold repeated 3 times) dataset (n=422) randomly subsetted into training data (75%) and test data (25%), consisting of seven different tumor types - testicular cancer, breast cancer, glioblastoma multiforme, lung cancer, colon rectal cancer, kidney cancer and melanoma. The tuned prediction model yielded 80.19% prediction accuracy using confusion matrix analysis, and 82.74% prediction accuracy when using AUC of a ROC curve as the accuracy metric. When independently testing the model on a validation set (n=44) of liver cancer downloaded from ICGC, confusion matrix analysis yielded a 67.44% accuracy and AUC of a ROC curve yielded a 60.48% accuracy. This identified 3-gene signature can be useful in predicting early or late relapse of cancer in patients in clinical practice.



P79
Identifying optimal mouse models for human asthma using a novel modeling approach

Subject: Machine learning

Presenting Author: Yihsuan Tsai, University of North Carolina at Chapel Hill, United States

Co-Author(s):
Lauren Donoghue, University of North Carolina, United States
Samir Kelada, University of North Carolina, United States
Joel Parker, University of North Carolina, United States

Abstract:

Asthma is a complex disease caused by both environmental and genetic factors. Many mouse models have been developed to mimic features of human asthma, mainly by exposure to allergens, such as house dust mite (HDM). To date, however, no studies have evaluated how well mouse models represent human asthma using gene expression as the criterion. We addressed this data gap using a new modeling approach. Previously, we reported human consensus asthma-associated differential expressed genes (DEGs) in airway epithelia through meta-analysis of eight human studies. Here, we used gene expression data from the same eight studies to build prediction models, which we evaluated the AUC in 1/3 of the data as validation. The final model with highest AUC includes gene expression of 52 genes. We then applied the final model to publically available mouse datasets and some unpublished data. In most studies, we observed good separation between treated vs. control mouse lung gene expression based on application of the human-based prediction model. To compare among different mouse models, we used similarity scores estimated by the correlation of human meta-effect size and the effect size of each individual mouse study. More than one third of mouse DEGs changed concordantly with human asthma genes, but approximately 20% of mouse DEGs changed discordantly. In summary, we evaluated a set of mouse models of asthma and identified sets of genes that concordantly or discordantly regulated in mice vs. humans, providing insight on how these models do and do not mimic the human disease condition.



P80
Mitochondrial DNA Deletions and Copy Number in Whole Genome Sequencing (WGS) Data: Analyses of Aging and Parkinson’s Disease using Brain and Blood

Subject: inference and pattern discovery

Presenting Author: David Tyrpak, University of Southern California, United States

Co-Author(s):
Michelle Webb, University of Southern California, United States
Ivo Violich, University of Southern California, United States
Raphael Gibbs, NIA, United States
Sonja Scholz, NINDS, United States
Dena Hernandez, NIA, United States
Mark Cookson, NIA, United States
Andrew Singleton, NIA, United States
David Craig, University of Southern California, United States
Brooke Hjelm, University of Southern California, United States

Abstract:

Mitochondria have their own DNA, a 16.6kb circular genome that is responsible for providing every nucleated cell with energy. Mitochondrial DNA exists as a polyploid feature of eukaryotic cells, and the absolute ratio of mitochondria to nuclei (mitochondrial copy number) can change in response to energy demands. Mitochondrial DNA molecules can also be affected by deletions, where large pieces of DNA and several genes are missing. Our group previously reported a workflow (i.e., Splice-Break), which uses long-range PCR, next-generation sequencing, and existing bioinformatic tools to identify the breakpoints of mitochondrial deletions and quantify their abundance. We are now extending this bioinformatics approach to catalogue mitochondrial deletions from whole genome sequencing (WGS) data, which represents an advancement as it can be paired with tools that quantify mitochondrial copy number. We have used Splice-Break to identify and quantify mitochondrial deletions in both cerebellum and frontal cortex using WGS data from the North American Brain Expression Consortium (NABEC). We report that deletions are in greater abundance in the frontal cortex, and that deletions significantly increase with age in both tissues. We are now extending this analysis to WGS data from Parkinson’s Disease (PD) cerebellum to evaluate if this disease has an enrichment of mitochondrial deletions. In addition, we are utilizing fastMitoCalc on these datasets to estimate mitochondrial copy number and determine the effect of brain aging and disease; this copy number analysis will also be extended to WGS data from blood, obtained from the Parkinson’s Progression Markers Initiative (PPMI) study.



P82
Mondo Disease Ontology: harmonizing disease concepts around the world

Subject: other

Presenting Author: Nicole Vasilevsky, Oregon Health & Science University, United States

Co-Author(s):
Julie McMurry, Oregon State University, United States
Shahim Essaid, Oregon Health & Science University, United States
Nico Matentzoglu, European Bioinformatics Institute,
Nomi Harris, Lawrence Berkeley National Laboratory, United States
Peter Robinson, Jackson Laboratory, United States
Chris Mungall, Lawrence Berkeley National Laboratory, United States
Melissa Haendel, Oregon State University, United States

Abstract:

Standards exist for describing gene variants (e.g. HGVS), but there is not a definitive standard for encoding diseases for information exchange. Existing terminologies include the NCI Thesaurus (NCIt), Disease Ontology, OMIM, SNOMED-CT, ICD, MedGen, and numerous others. However, these standards partially overlap and often conflict, making it difficult to align knowledge sources. The need to integrate information has resulted in a proliferation of mappings between disease entries in different resources; which often lack completeness, precision and provenance. Information integration is further complicated by the inconsistent representation of such mappings.<br><br>To computationally utilize our collective knowledge sources for diagnostics and to reveal underlying mechanisms of diseases, we need to understand which terms are equivalent across different resources. This will allow the integration of associated information, such as treatments, genetics, phenotypes, etc. To that end, we created the Mondo Disease Ontology (https://monarch-initiative.github.io/mondo/), which provides a logic-based structure that unifies multiple disease resources. Mondo is created by a combination of a Bayesian approach to ontology merging (k-BOOM) and expert curation. Mondo provides equivalence mappings to other disease resources, and annotates each mapping using strict semantics, so that precisely equivalent diseases can be distinguished from those that are merely closely related - allowing improved computational integration of associated data.<br><br>A number of projects, such as the Monarch Initiative, ClinGen, and the Gabriella Miller Kids First Data Resource use Mondo for standard encodings of disease descriptions; such real-world applications guide continuous refinement of this ontology that benefits the community at large.<br>



P83
Giving credit where credit is due: How to make more meaningful connections between people and their roles, work, and impact

Subject: other

Presenting Author: Nicole Vasilevsky, Oregon Health & Science University, United States

Co-Author(s):
Matthew Brush, Oregon Health & Science University, United States
Anne Thessen, Oregon Health & Science University, United States
Marijane White, Oregon Health & Science University, United States
Karen Gutzman, Northwestern University, United States
Lisa O’Keefe, Northwestern University, United States
Kristi Holmes, Northwestern University, United States
Melissa Haendel, Oregon Health & Science University, United States

Abstract:

Traditional metrics for scholarship typically measure publication records and grants received. However scholarly contributions can extend well beyond these traditional contributions to include things such as algorithm or tool development, biocuration and data analysis. In order to properly give attribution, we need improved mechanisms for recognizing and quantifying these efforts. We aim to develop a computable system to better attribute scholars for the work they do.<br><br>The Contributor Role Ontology (CRO) is a structured representation of scholarly roles and contributions. The CRO can be used in combination with research object types to develop infrastructure to understand the scholarly ecosystem, so we can better understand, leverage, and credit a diverse workforce and community.<br><br>The Contributor Attribution Model (CAM) provides a simple and tightly scoped data model for representing information about contributions made to research-related artifacts - for example when, why, and how a curator contributed to a gene annotation record. This core model is intended to be used as a modular component of broader data models that support data collection and curation systems, to facilitate reliable and consistent exchange of computable contribution metadata. Additional components of the CAM specification support implementation of the model, data collection, and ontology-based query and analysis of contribution metadata.<br><br>Beyond this technical approach, we need to address this challenge from a cultural perspective and we welcome community involvement. We encourage stakeholders from various<br>communities and roles to get involved, provide perspective, make feature requests, and help shape the future of academic credit and incentives.



P84
Leveraging Familial-based Relationships for Rare Variant Discovery

Subject: inference and pattern discovery

Presenting Author: Liz Ward, Brigham Young University, United States

Co-Author(s):
Justin Miller, Brigham Young University, United States
Lyndsay Staley, Brigham Young University, United States
John Kauwe, Brigham Young University, United States

Abstract:

Although genetic factors account for about half of Alzheimer’s disease (AD) risk, known genetic variants account for only a quarter of that heritability. In many cases, common population-based genome-wide association (GWA) studies lack sufficient power to identify rare disease-associated variants. Since the remaining heritability is likely caused by rare genetic variants, family-based approaches may effectively identify additional risk factors that GWA studies miss. We developed a novel approach to identify rare AD variants from whole exome sequencing (WES). Using the Utah Population Database (UPDB), we identified 19 families with a statistical excess of AD. Next, we collected WES data from a first- or second- cousin pair within each family, where both individuals were previously diagnosed with AD. We then identified the concordant variants with a minor allele frequency (MAF) less than 0.1% from the 19 cousin pairs. We imputed this information into Ingenuity Variant Analysis and filtered using Common Variant, Predicted Deleterious, Genetic Analysis, and Biological Context filters modified to fit our specific study design. To further aid in variant prioritization, we analyzed each variant for various associations with known AD risk factors. Using this pipeline, we identified 12 rare variants in eight genes that we propose should be prioritized for additional analyses. This unique analysis pipeline has allowed us to extract rare variants associated with AD from high risk pedigrees. We anticipate that future rare variant analyses can utilize our pipeline to identify disease-specific genetic risk factors.



P85
Identifying Functional Relationships Using Protein Coevolution

Subject: inference and pattern discovery

Presenting Author: Katrisa Ward, Brigham Young University, United States

Co-Author(s):
John Kauwe, Brigham Young University, United States
Justin Miller, Brigham Young University, United States
Brandon Pickett, Brigham Young University, United States

Abstract:

Mutual Information measures the dependence of two random variables, or the probability of predicting the value of one variable by knowing the value of another. Our algorithm calculates mutual information between two translated proteins by comparing each residue on one protein to each residue on another protein. First, two multi-sequence alignments containing orthologs from at least 100 species are read into memory. We then calculate the hamming distance between each species, and species that are not dissimilar enough to every other species are removed to prevent phylogenetic bias. Non-relevant positions (i.e., positions that are absent in the species of interest) are removed from the alignment. Mutual information is then calculated between each amino acid residue based on frequencies of amino acids at each position across all species. We compared the mutual information calculated between proteins using our algorithm and another, more established algorithm. The relative differences in mutual information were not statistically significant, indicating that both algorithms identified the same functional interactions. However, our algorithm was significantly faster and facilitates protein-protein comparisons. High mutual information scores between genes may indicate coevolution and identify functional relationships. We performed 35 million pairwise protein comparisons and developed a web interface, to display the top comparisons per gene (118,024 total comparisons). On this website users can also upload their own multiple sequence alignments and identify the top coevolving residues. We anticipate that this web server will facilitate the identification of coevolving residues with direct implications in disease, evolution, and protein function.



P86
Poster Withdrawn


P87
Optimizing storage and querying of massive biological datasets of a tabular nature

Subject: Data management methods and systems

Presenting Author: James Wengler, Brigham Young University, United States


Abstract:

As acquiring biomedical data becomes faster and cheaper, biologists increasingly confront tabular datasets that have as many as millions of rows or columns. The task of subsetting such large datasets—distinguishing relevant from irrelevant data—can be frustrating and time-consuming. Existing methods for querying tabular data might perform well on files with many rows or many columns but are often inefficient on datasets that are large in both dimensions. We performed a systematic comparison of 10 techniques for storing and querying tabular datasets. We simulated data that resemble datasets that biologists encounter now or that they might encounter in future studies, especially meta-analyses that combine datasets. We evaluated a variety of techniques for storing the data, including text- and binary-based formats. We evaluated existing methodologies for querying the data, including delimiter splitting, Python & R libraries, regular expressions, the awk & cut UNIX-based tools, and string-based indexing. As an alternative, we developed a fixed-width storage-and-query approach that outperformed all other approaches. This method allows us to calculate the exact positions of data values within a file without iterating through each row; furthermore, the files are text-based and thus human readable. Subsequently, we improved the performance of our approach using memory mapping and compiled C++. We also evaluated the trade-off between storage space and execution speed and developed a compression schema that mitigates the effects of whitespace characters, while supporting speeds that match or exceed alternative approaches. This methodology promises to help biologists query massive, tabular biologists in a fast, scalable manner.



P88
Biotherapeutic Protein Immunogenicity Risk Assessment with TCPro

Subject: Qualitative modeling and simulation

Presenting Author: Osman Yogurtcu, FDA, United States

Co-Author(s):
Joseph McGill, FDA, United States
Zuben Sauna, FDA, United States
Million Tegenge, FDA, United States
Hong Yang, FDA, United States

Abstract:

Most immune responses to biotherapeutic proteins involve the development of anti-drug antibodies (ADAs). New drugs must undergo immunogenicity assessments to identify potential risks at early stages in the drug development process. This immune response is T cell-dependent. Ex vivo assays that monitor T cell proliferation often are used to assess immunogenicity risk. Such assays can be expensive and time-consuming to carry out. Furthermore, T cell proliferation requires presentation of the immunogenic epitope by major histocompatibility complex class II (MHCII) proteins on antigen-presenting cells. The MHC proteins are the most diverse in the human genome. Thus, obtaining cells from subjects that reflect the distribution of the different MHCII proteins in the human population can be challenging. The allelic frequencies of MHCII proteins differ among subpopulations, and understanding the potential immunogenicity risks would thus require generation of datasets for specific subpopulations involving complex subject recruitment. We developed TCPro, a computational tool that predicts the temporal dynamics of T cell counts in common ex vivo assays for drug immunogenicity. Using TCPro, we can test virtual pools of subjects based on MHCII frequencies and estimate immunogenicity risks for different populations. It also provides rapid and inexpensive initial screens for new biotherapeutics and can be used to determine the potential immunogenicity risk of new sequences introduced while bioengineering proteins. We validated TCPro using an experimental immunogenicity dataset, making predictions on the population-based immunogenicity risk of 15 protein-based biotherapeutics. Immunogenicity rankings generated using TCPro are consistent with the reported clinical experience with these therapeutics.



P89
Poster Withdrawn


P90
Structuring and crawling distributed biomedical metadata using schema.org standard

Subject: Data management methods and systems

Presenting Author: Xinghua Zhou, The Scripps Research Institute, United States

Co-Author(s):
Marco Cano, The Scripps Research Institute, United States
Jiwen Xin, The Scripps Research Institute, United States
Sebastien Lelong, The Scripps Research Institute, United States
Laura Huges, The Scripps Research Institute, United States
Andrew Su, The Scripps Research Institute, United States
Chunlei Wu, The Scripps Research Institute, United States

Abstract:

Embedding structured metadata to provide explicit clues about the meaning of a web page has been increasingly embraced to facilitate the discovery of information. While a search engine could use structured data to enable special search result features and enhancements, researchers could navigate knowledge databases without learning their distinctive access interfaces.<br><br>When websites already have structured metadata embedded, extracting all the metadata commonly just require crawling, which is systematically browsing the websites for indexing information of interest. We can find all pages to browse in a website through URL patterns, following links from a starting page or iterating through a sitemap, a list of pages provided by the website.<br><br>There are also many websites that do not have contents described by structured metadata. For these sites, through web-scraping, which is typically harvesting data by HTML elements, we can handle these pages and build the structured metadata. Furthermore, we can even inject the metadata and serve the sites as if structured metadata are present from the view of other users and search engines. <br><br>When we have structured metadata, previously exclusively formatted data is now standardized to be interoperable for data discovery and analysis. Data can be aggregated according to certain criteria or indexed for full-text searching across databases. <br><br>We crawled over 60,000 datasets hosted on Zenodo, Harvard Dataverse, NCBI Geo and Omicsdi, and extracted their metadata in schema.org standard. What we can conclude from the existing collection provide valuable insights in our effort to design future schemas to describe biological discoveries.



P91
Cellular life from the three domains and viruses are transcriptionally active in a high-salt desert community

Subject: Metogenomics

Presenting Author: Gherman Uritskiy, Johns Hopkins, United States

Co-Author(s):
James Taylor, Johns Hopkins, United States
Jocelyne DiRuggiero, Johns Hopkins, United States

Abstract:

Microbial communities play essential roles in the biosphere and understanding the mechanisms underlying their functional adaptations to environmental conditions is critical for predicting their behavior. Compared to metagenomic approaches, which offer a limited view of a microbial community’s functional potential, metatranscriptomic analyses allow for interrogation of adaptations of a microbial community to specific conditions. This aspect of microbiome function has not been well characterized in extreme environments environments. To address this knowledge gap, and to build a general framework of relating the genomic and transcriptomic components of a model extremophile microbiome, we performed a meta-omic survey of extremophile communities of the Atacama Desert. To analyze this data set, we deployed our newly developed softwares metaWRAP for draft genome discovery and SnapT for metatranscriptomic sRNA annotation. We found that the major phyla of this halophilic community have very different levels of total transcriptional activity and activate different metabolic pathways in their transcriptomes. Most surprisingly, we report that an alga - the only Eukaryote found in this system – was by far the most active community member, producing the vast majority of the community’s photosynthetic transcripts despite being outnumbered by members of the Cyanobacteria phylum. The divergence in the transcriptional landscapes of these segregated communities compared to the relatively stable metagenomic functional potential suggests that microbiomes in each nodule undergo unique transcriptional adjustments to adapt to local conditions. We also report that sRNAs actively repress target transcript levels, and find several previously unknown halophilic viruses, many of which exhibit transcriptional activity indicative of host infection. <br>