13th Annual Rocky Mountain Bioinformatics Conference

POSTER PRESENTATIONS

P01

How to Estimate the Rate of Evolutionary Transpositions?

Subject: Graph Theory

Presenting Author: Max Alekseyev, George Washington University

Author(s):

Nikita Alexeev, George Washington University, United States
Rustem Aidagulov, Moscow State University, Russian Federation

Abstract:
Genome rearrangements are evolutionary events that shuffle genomic architectures. Most frequent genome rearrangements are reversals, translocations, fusions, and fissions.
While there also exist more complex rearrangements such as transpositions, they are rarely observed and believed to constitute only a small fraction of all rearrangements happening in the course of evolution.
The analysis of transpositions is further obfuscated by intractability of the underlying computational problems. We propose a computational method for estimating the transposition rate in evolutionary scenarios between given genomes. For different pairs of mammalian genomes, our method produces consistent estimations for the transposition rate, which all are about 0.26.

P02

A pilot exploration of the gut microbiome of Egyptian HCV patients

Subject: Metogenomics

Presenting Author: Abdelrahman Aly, Faculty of Post Graduate Studies for Advanced Sciences, Beni-Suef University, Beni-Suef, Egypt.

Author(s):

Abdelraheem Adel, Faculty of Post Graduate Studies for Advanced Sciences, Beni-Suef University, Beni-Suef, Egypt., Egypt
Ahmed Osama El-Gendy, Faculty of Pharmacy, Beni-Suef University, Beni-Suef, Egypt., Egypt
Tamer Essam, Faculty of Pharmacy, Cairo University, Cairo, Egypt, Egypt
Ramy Karam Aziz, Faculty of Pharmacy, Cairo University, Egypt

Abstract:
Hepatitis C virus (HCV) causes debilitating liver diseases, including cirrhosis and cancer, and claims 350,000 annual lives worldwide. While HCV epidemiology and infection are being deeply studied, rare attention is given to interactions between HCV-induced chronic liver diseases and the human gut microbiome. As Egypt has the world’s highest incidence and prevalence of HCV infections, we launched this pilot study to monitor differences in the gut microbial community composition of Egyptian HCV patients that may affect, or result from, the patients’ liver state. To this end, we collected stool samples from six HCV patients and eight age-matched healthy individuals, and analyzed their microbiomes by high-throughput 16S rRNA gene sequencing using Illumina MiSeq. Overall, the alpha-diversity of the healthy persons’ gut microbiomes was higher than those of the HCV patients. Whereas members of phylum Bacteroidetes were more abundant in HCV patients, healthy individuals had higher abundance of Firmicutes, Actinobacteria and Proteobacteria. Phylum Tenericutes could only be identified in the healthy group. Genus-level analysis showed differential abundance of Prevotella and Faecalibacterium (higher in HCV patients) vs. Bacteroides and Dialister (healthy group), indicating that the higher abundance of Bacteroidetes in HCV patients is most likely due to Prevotella overabundance. The popular probiotic genus, Bifidobacterium, was only observed in the microbiotas of healthy individuals. This study provided a first overview of major phyla and genera differentiating HCV patients from healthy individuals. Future studies will investigate the microbiome composition and functional capabilities in more patients while tracing some potential biomarker taxa (e.g., Prevotella vs. Bifidobacterium).

P03

Using graph kernels on protein-protein interaction networks for multi-loci prioritization

Subject: Graph Theory

Presenting Author: Kelsey Anderson, University of Colorado, Denver

Author(s):

Sonia Leach, National Jewish Health, United States
Yves Moreau, University of Leuven, Belgium

Abstract:
Associating genotypes with phenotypes is a major goal of modern biology. Many complex phenotypes arise from potentially rare variants in many genes, each contributing only small effects to the disease. In these cases, the sample size required by traditional statistical approaches becomes cost-prohibitive, as they are designed to find single causal genes with large effects. Multi-loci prioritization is a systems biology approach that presupposes the importance of pathways in explaining complex genotype-phenotype associations. While different individuals may harbor private mutations in disjoint sets of genes, multi-loci prioritization attempts to identify those genes that operate in the same pathway.

Here we investigate the use of prior knowledge of gene relationships encoded in protein-protein interaction (PPI) networks for the multi-loci prioritization task. The approach relies on the assumption that genes in a given pathway will be close to one another in the PPI network. We first explore multiple definitions of the notion of network proximity between any two genes using many different graph kernels on the PPI network. We then explore methods for combining pairwise proximity scores in order to simultaneously identify a cluster of genes from diverse loci, suggestive of a shared disrupted pathway. We evaluate competing methods for both subproblems, comparing prioritization performance and computation efficiency, using known disease genes from the Online Mendelian Inheritance in Man (OMIM) database.

P04

The Virome of Red Sea Brine Pool Sediments

Subject: Metogenomics

Presenting Author: Sherry Aziz, American University in Cairo

Author(s):

Mustafa Adel, American University in Cairo, Egypt
Ramy K. Aziz, Faculty of Pharmacy, Cairo University, Egypt
Rania Siam, American University in Cairo, Egypt

Abstract:
Egypt’s Red Sea brine pools are unique environments owing to their high temperature, salinity and heavy metal levels. Although the microbiomes of the brine pool sediments have been well-characterized on the bacterial level, their viromes remain unexplored. Previous viral metagenomic analyses revealed tremendous diversity that needs more sampling on a global scale. Thus, we sought to determine the Red Sea brine sediment viromes and compare them to studied marine and sediment viromes. Since different viral analysis approaches lead to different results, and each has its strengths and limitations, we implemented three different bioinformatic tools: MEGAN, MG-RAST and GAAS. The combination of PhAnToMe database with GAAS analysis reduced the number of unassigned sequences, and increased the numbers of assigned phages from 200-250 to 850-2000,and controlled sampling bias via genome length normalization. Analysis of the viromes of 14 Red Sea brine pools and two non-brine control sediments showed a universal marine signature reflected by a dominance of different phages of Prochlorococcus, Synechococcus and several Mediterranean phages. The deepest two layers of the Red Sea (ATII-1 and ATII-2) were the most divergent, with the lowest alpha-diversity (low richness and evenness). Yet, these two meataviromes have their own signature characterized by higher abundance of gokushoviruses (e.g, Gokushovirus isolate GOM and Gokushovirinae Fen672_3) than other sections. Further comparative analysis will be performed to investigate how the extreme conditions of these brine pools impacted their viromes as previous studies showed that different ecosystmes’ stressors results in different genome lengths.

P05

A genetic analysis of a complex trait in a “genetically intractable” gut microbe

Subject: Other

Presenting Author: Sena Bae, Duke University

Abstract:
Microbes mediate immune and nutrient homeostasis in the vertebrate gastrointestinal tract. The molecular basis for these host-microbe interaction is poorly understood as many gut microbes are not amenable to molecular genetic manipulation. We combined phenotypic selection after chemical mutagenesis with population-based whole genome sequencing to identify genes that are required for motility in the firmicute Exiguobacterium, a component of the vertebrate gut microbiota that contributes to lipid uptake. We derived strong associations between the loss of motility and mutations in predicted Exiguobacterium motility genes and genes of unknown function. We confirmed the genetic linkage between the predicted causative mutations and loss of motility by identifying suppressor mutations that restored motility. These results indicate that a genetic dissection of complex traits in microbes can be readily accomplished without the need to develop molecular genetic tools.

P06

CNVScan: gene copy numbers evaluation in haploid genomes

Subject: Other

Presenting Author: Johann Beghain, Institut Pasteur

Author(s):

Anne-Claire Langlois, Pasteur Institute, France
Eric Legrand, Institut Pasteur, France
Laura Grange, Institut Pasteur, France
Nimol Khim, Institut Pasteur du Cambodge, Cambodia
Benoit Witkowski, Institut Pasteur du Cambodge, Cambodia
Valentine Duru, Institut Pasteur du Cambodge, Cambodia
Laurence Ma, Institut Pasteur, France
Christiane Bouchier, Institut Pasteur, France
Didier Menard, Institut Pasteur du Cambodge, Cambodia
Richard Paul, Institut Pasteur, France
Frederic Ariey, Institut Cochin, France

Abstract:
In eukaryotic genomes deletion or amplification rates have been estimated to be a thousand more frequent than single nucleotide variation6,7. In P. falciparum relatively few transcription factors have been identified8,9, and the regulation of transcription is seemingly largely influenced by gene amplification events. Thus CNV is a major mechanism enabling parasite genomes to adapt to new environmental changes.

Currently, the detection of CNVs is based on qPCR, which is significantly limited by the relatively small number of genes that can be analyzed at once. Technological advances that facilitate whole-genome sequencing such as Next Generation Sequencing (NGS) enable us to perform deeper analyses of the genomic variation. Because the characteristics of Plasmodium CNVs need special consideration in algorithms and strategies for which classical CNV detection programs are not suited, we developed a dedicated algorithm to detect CNVs across the entire exome of P. falciparum based on a custom read depth strategy through NGS data. We call this algorithm CNVScan.

We present here an analysis of CNV identification on three genes known to have different level of amplification and which are located either in the nuclear, apicoplast or mitochondrial genomes. We show that our results correlated with the qPCR experiments, usually used for identification of locus specific amplification/deletion. We then discuss the use of such a tool for the exploration of adaptive phenomena based on whole genome data.

P07

GraDe-SVM: Graph-Diffused Classification for the Analysis of Somatic Mutations in Cancer

Subject: Machine learning, inference and pattern discovery

Presenting Author: Morteza Chalabi, SOUTHERN UNIVERSITY OF DENMARK (SDU)

Author(s):

FABIO VANDIN, UNIVERSITY OF PADOVA, Italy

Abstract:
Recent advances in next generation sequencing data have allowed the collection of somatic mutations from a large number of patients from several cancer types. One of the main challenges in analyzing such large datasets is the identification of the few driver genes having mutations that are related to the disease. This task is complicated by the fact that genes and mutations do not act in isolation, but are related by a complex interaction network. A related but unexplored challenge is the classification of the cancer type using somatic mutations, that may be relevant for early cancer detection from circulating tumor DNA or circulating tumor cells.

We propose a graph-diffused SVM (GraDe-SVM) approach for cancer type classification using somatic mutations. Our approach effectively integrates somatic mutations and information from a large-scale interaction network using a graph diffusion process. We tested our method on a cohort of 3424 cancer samples from 11 cancer types from The Cancer Genome Atlas (TCGA) project, using both single nucleotide variants (SNVs) and copy number variants (CNVs). Our results show that our method improves the classification of the cancer type using somatic mutations compared to approaches that ignore the interaction network or consider the network but do not use the diffusion process. Moreover our approach highlights a number of known driver genes and genes with mutations that distinguish different cancer types.

P08

Charting the human genome’s regulatory landscape with transcription factor binding site predictions

Subject: Machine learning, inference and pattern discovery

Presenting Author: Xi Chen, New York University

Author(s):

Richard Bonneau, New York University/Simons Foundation, United States

Abstract:
Transcription factor (TF) binding is an essential step in the regulation of gene expression. Differential binding of multiple TFs at key cis-regulatory loci allows the specification of progenitor cells into various cell types, tissues and organs. ChIP-Seq is a technique that can reveal genome-wide patterns of TF binding. However, it lacks the scalability to cover the range of factors, cell types and dynamic conditions a multicellular eukaryotic organism sees. So charting the regulatory landscape spanning multi-lineage differentiation requires computational methods to predict TF binding sites (TFBS) in an efficient and scalable manner.

We develop a method to predict binding sites for over 800 human TFs using a rich collection of DNA binding motifs. We integrate genomic features, including chromatin accessibility, motif scores, TF footprints, CpG/GC content, evolutionary conservation and the proximity of TF motifs to transcription start sites in sparse logistic regression classifiers. We label candidate motif sites with ChIP-Seq data and apply correlation-based filter and L1 regularization to select relevant features for each trained TF. Our models perform favorably in comparison to the current best TFBS prediction methods. Further, we map TFs based on feature distance to a nearest trained TF neighbor. This allows us to scale and expand the repertoire of putative TFBS to any TFs where motif data is available and to any cell types where accessibility data is obtainable. Our method has the potential to be applied in previously intractable domains to reveal the regulatory complexity of multicellular higher eukaryotes.

P09

Identification of chromatin accessibility from nucleosome occupancy and methylome sequencing

Subject: Machine learning, inference and pattern discovery

Presenting Author: Yongjun Piao, Chungbuk National University

Author(s):

Seongkeon Lee, Sungshin Women's University , Korea, Rep
Keith D. Robertson, Center for Individualized Medicine, Mayo Clinic, United States
Huidong Shi, Georgia Regents University, United States
Keun Ho Ryu, Chungbuk National University, Korea, Rep
Jeong-Hyeon Choi, Georgia Regents University, United States

Abstract:
Chromatin is a fundamental structure for compactly packaging a genome and reducing its volume in eukaryotic cells. The nucleosome is the basic repeating unit of chromatin and it is composed of ~145bp DNA wrapped around histone proteins. Positioning of nucleosomes throughout the genome, also known as nucleosome occupancy, plays a crucial role in epigenetic regulation of gene activation and silencing. It is well known that nucleosome positioning influences DNA methylation and histone modifications such as methylation, acetylation, and phosphorylation. Recently, nucleosome occupancy and methylome sequencing (NOMe-seq) has been developed to allow simultaneously profiling chromatin accessibility and DNA methylation on single molecules. However, to our best knowledge, there is no standard method for de novo identification of nucleosome occupancy from NOMe-seq data. In this paper, we presented a novel algorithm for identifying nucleosome-occupied regions (NORs) based on seed-extension approach from NOMe-seq. The proposed algorithm first identifies seeds that are very likely GCHs in NORs, next extends seeds as long as the average of GCH methylation scores is smaller than a threshold, and finally decides the end point of the extended seeds using the predicted mean and standard deviation of methylation scores based on Gaussian mixed model. It also conducts statistical tests to assess the significance of identified NORs. The efficiency and effectiveness of the proposed algorithm were tested on simulated datasets, and the experimental results showed that the proposed method outperformed the existing methods and achieved sensitivity > 0.97 and specificity > 0.99.

P10

Machine learning and genomic analysis to predict drug resistance in Mycobacterium tuberculosis

Subject: Machine learning, inference and pattern discovery

Presenting Author: Gargi Datta, University of Colorado School of Medicine, National Jewish Health

Author(s):

Rebecca Davidson, National Jewish Health, United States
Sonia Leach, University of Colorado School of Medicine, National Jewish Health, United States
Michael Strong, University of Colorado School of Medicine, National Jewish Health, United States

Abstract:
Tuberculosis, caused by Mycobacterium tuberculosis is the second leading cause of death due to an infectious disease. While the incidence of TB cases is declining, an upsurge of drug-resistant strains of M. tuberculosis is a global cause for concern. Understanding the mechanisms associated with TB drug resistance development and quick recognition of resistant strains is critical to limiting the spread of drug resistance disease. We hypothesize that a combination of genotyping and machine learning provides an accurate and efficient way to identify drug-resistance. We have created a fully automated sequence analysis and mutation identification pipeline to identify mechanisms associated with the development of drug resistance. Our training set includes 3502 M. tuberculosis genome sequences with phenotypic susceptibility information from publicly available sources. To characterize existing mutations, we feed these sequences through our mutation analysis pipeline. We have created a diverse feature set that includes different feature types and data types. To combine these different feature and data types into a non-redundant and informative feature set, we are working on a novel way to handle feature selection with mixed data with an existing simultaneous feature selection and classification algorithm algorithm for sparse and imbalanced genomic data, that uses a combination of model based and instance based methods for classification. Finally, we aim to create a publicly available web and mobile application to facilitate fast delivery of drug-resistance profiles to researchers and clinicians.

P11

Alignment free phylogenies of giant viruses

Subject: Other

Presenting Author: Shane Dorden, University of Tampa

Author(s):

Padmanabhan Mahadevan, University of Tampa, United States

Abstract:
The availability of whole genome sequences of giant
viruses provides a tantalizing opportunity to look more closely at the phylogeny and evolution of these viruses. Since the genomes of these viruses are relatively large, the speed of alignment free methods to determine their phylogeny is particularly appealing. Alignment free phylogenies of several of the largest giant viruses including Pandoravirus, Mimivirus, Moumouvirus, and Pithovirus were constructed using distance measures such as Euclidean and Jensen-Shannon. Differences were found in the phylogenies constructed using different distance measures. These phylogenies were also compared to phylogenies using the traditional alignment approach and differences were found between these methods as well. The phylogenies were also used to determine the correlation of evolutionary distance to conserved gene order between these viruses. The results provide greater insight into giant virus phylogeny and their gene order conservation, as well as the differences between using alignment free versus traditional approaches to phylogeny construction.

P12

Abstract Withdrawn

P13

Genomic Big Data: scalability challenges and solutions

Subject: Machine learning, inference and pattern discovery

Presenting Author: Faraz Faghri, University of Illinois at Urbana-Champaign

Author(s):

Roy Campbell, University of Illinois at Urbana-Champaign, United States
Sayed Hadi Hashemi, University of Illinois at Urbana-Champaign, United States
Mohammad Babaeizadeh, University of Illinois at Urbana-Champaign, United States

Abstract:
Genomics plays a role in nine of the 10 leading causes of death in the United States. For people who are at increased risk for hereditary breast and ovarian cancer, or hereditary colorectal cancer, genetic testing may reduce illness risks by guiding evidence-based interventions. Such interventions involve the emergent practice of precision medicine that uses an individual’s genetic profile to guide decisions made in regard to the prevention, diagnosis, and treatment of disease. At the nexus of precision medicine and computer science – cloud computing and machine learning – lies many research challenges for adapting and optimizing data-driven analytics to change the medical care delivered to patients in the US and beyond those borders. Focused on high-speed data analytics on large clusters for genomic data, our research applies scalable algorithms, new storage and computation designs, and aims to achieve the possibilities of precision medicine with significant improvements in performance.

In this work we visit four major challenges facing big data genomics: data acquisition, data storage, data distribution, and data analysis. We present our solutions for privacy-preserving data distribution and scalable data analytics.

P14

Virpack, a Viral Metagenomic Functional Annotation Package

Subject: Metogenomics

Presenting Author: Cody Glickman, University of Colorado - Anschutz

Abstract:
Characterizing interactions of the host mucosal epithelial layer with microbial communities are important to understanding chronic and complex diseases. Bacterial 16S and metagenomics are frequently utilized tools for characterizing bacterial microbiome communities and metabolites; however, bacteria are only a portion of the available genetic material in the microbial community composition. The virome is a grossly understudied piece of the microbial community and abundant viral particles outnumber bacterial cells forty-to-one in host mucosal layers. Currently, metagenomic shotgun sequencing is the only available methodology to study the composition of viral particles that requires no a priori knowledge of the viruses in the samples. Understanding the triad of bacterial, viral, and host interactions is dependent on further refinement of viral metagenomic methods. However, these methods are reliant on extant databases for read annotation and the databases have heavy biases towards human associated pathogens leaving many viral reads unannotated. Proper binning and assembly of viral reads increases the efficiency of viral annotations with these limited databases. To facilitate researchers interested in performing viral metagenomic analyses, we are currently developing Virpack, an R package to streamline the process of assembly, annotation, and abundance estimation in viral metagenomic datasets. The package also implements a set of methods for functional annotation of metagenomic reads and abundance estimation of enzymatic genes. The eventual goal of the Virpack package is to work in tandem with bacterial metagenomics software to provide researchers with a more accurate assessment of the microbial functional gene reservoir in human disease niches.

P15

Abstract Withdrawn

P16

Gene expression profiling to predict response to EGFR inhibition

Subject: Machine learning, inference and pattern discovery

Presenting Author: Andrew Goodspeed, University of Colorado Anschutz Medical Campus

Abstract:
Cetuximab, an antibody targeting the epidermal growth factor receptor (EGFR), is approved for treatment of colorectal and head and neck cancers. Although KRAS mutant tumors have been shown to be resistant to cetuximab, other single biomarkers are only marginally effective in predicting which patients will be sensitive to treatment. Here we evaluate if gene expression profiling can be used to identify and predict responders to cetuximab in colorectal cancer, similarly to how gene signatures can predict response to other targeted therapies. Using prediction analysis of microarrays (PAM), we identified an expression signature of 23 Affymetrix probesets that was effective in stratifying responders and non-responders to cetuximab treatment. To validate this signature in vitro, we confirmed cancer cell lines predicted to be sensitive to cetuximab are indeed more responsive to EGFR inhibition (P<0.05). We also found the cetuximab gene signature was effective in predicting overall survival in pediatric glioblastoma patients treated with this cetuximab (P=0.01). Therefore, a predictive gene signature derived from colorectal cancer is capable of predicting sensitivity to similar treatments across cancer type in vitro and in patients. This identifies a novel method using gene expression profiling to predict which individual cancer patients may benefit from cetuximab, regardless of cancer type.

P17

In-silico identification of prognostically inversely correlated miRNA - mRNA pairs in multiple cancers

Subject:

Presenting Author: Chirayu Goswami, Thomas Jefferson University

Author(s):

Zi-Xuan Wang, Thomas Jefferson University, United States

Abstract:
Despite numerous methods available to identify potential mRNA targets for miRNAs, prognostic relationship of these molecules in diseases like cancers where deregulation of gene expression is a major pathogenic factor, has not been emphasized yet. We performed in-silico identification of prognostically inversely correlated miRNA - mRNA pairs (PIC’s) in multiple cancers using expression data from The Cancer Genome Atlas. Partners in a PIC show inverse correlation of expression and opposite hazard implication. Using a three step approach, we identified a total of 1,201,301 PIC’s from 23 cancer types, several of which have previously been shown to have a predicted or experimentally validated relationship. A maximum 375,621 PICs were identified in Lower Grade Gliomas, while a minimum 300 PICs were identified in Prostate adenocarcinoma. Four miRNA-mRNA pairs were identified as PICs in 7 different cancer types. Two miRNA-mRNA pairs were identified as PICs in 5 different cancer types where mRNA is also a validated target of miRNA. Organ specific analysis was performed to identify PICs common to cancers from same or related tissue of origin. We have also developed a database PROGTar for hosting our analysis results. PROGTar is available freely for non-commercial use at www.xvm145.jefferson.edu/progtar. We believe our method and analysis results will provide a novel prognostically relevant, pan-cancer perspective to study of miRNA-mRNA interactions and miRNA target validation.

P18

PROGTar: A database of prognostically inversely correlated miRNA and Genes in multiple cancers.

Subject:

Presenting Author: Chirayu Goswami, Thomas Jefferson University

Abstract:
PROGTar is a database of Prognostically Inversely Correlated miRNA-mRNA pairs (PIC’s) in 23 cancer types. Partner miRNA and mRNA in a PIC show inverse correlation of expression and opposite hazard implications. We analyzed miRNA and mRNA expression data downloaded from The Cancer Genome Atlas (TCGA) in a 3 step approach to identify PICs in different cancer types. In first step we performed correlation analysis between miRNAs and mRNAs for each cancer type. This was followed by performing hazard analysis separately for miRNAs and mRNAs using expression data and survival related clinical variables. In the third step we merged the correlation and hazard result sets. Resultant miRNA and mRNA pairs were filtered to retain only pairs that had negative correlation between miRNA and mRNA expression and opposite hazards for miRNA and mRNA, at a statistically significant level (p <= 0.05).
Results from our pan cancer analysis are available on the web based application PROGTar. Users can search for miRNA/mRNA of interest on the database to find inversely correlated partners. Users can also create prognostic plots for the PICs of interest. Prognostic plots created with PROGTar show arms for high and low expression of target molecule and its corresponding partner in the PIC, bifurcated at median of expression. The plots also show arms for a combined prognostic signature calculated using expression levels of both partners in the PIC. The application is available freely for non-commercial use at www.xvm145.jefferson.edu/progtar.

P19

Unsupervised Extraction of Biological Features from Complete Gene Expression Compendia with ADAGE

Subject: Machine learning, inference and pattern discovery

Presenting Author: Casey Greene, University of Pennsylvania

Author(s):

Jie Tan, Geisel School of Medicine at Dartmouth, United States
John Hammond, Geisel School of Medicine at Dartmouth, United States
Deborah Hogan, Geisel School of Medicine at Dartmouth, United States

Abstract:
The growth in genome-scale data for different species in publicly available databases provides the opportunity for hypothesis generation by the application of new analytical methods for biological interpretation of these data. Here, we present an unsupervised machine-learning approach, ADAGE (Analysis using Denoising Autoencoders of Gene Expression) and apply it to the interpretation of all of the publicly available gene expression data for Pseudomonas aeruginosa, an important opportunistic bacterial pathogen. Without any prior knowledge of any genome structure or gene function, the P. aeruginosa ADAGE model found that co-operonic genes often participated in similar processes and accurately predicted which genes had similar functions. Using newly generated data, we were able to use the model to identify gene expression differences between strains, to identify the cellular response to low oxygen, to predict the involvement of biological processes in previously published data despite low level expression differences is directly involved genes, and to identify processes that are most highly responsive to different environmental perturbations. These properties allow ADAGE to be readily applied to newly generated data to reveal differentially altered pathways or processes and identify broad mechanisms underlying observed phenotypic effects.

P20

Abstract Withdrawn

P21

Application of predicted relative solvent accessibility to the development of surface area energy function

Subject: Optimization and search

Presenting Author: Seungryong Heo, Korea Institute for Advanced Study

Abstract:
Solvent accessible surface area (SASA) of a protein is one of its most important structural features. SASA is often used as an analytical tool for describing protein-water interactions or calculating the transfer free energy of proteins. The insights often gained through more accurate predictions of protein surface area are highly useful for also predicting that protein’s structure as well as its function.
In 2012, we had presented a method to predict the solvent accessibility of proteins, called SANN, which is based on a nearest-neighbor method applied to the sequence profile. The overall performance of SANN was shown to be superior to the currently available methods (e.g. FKNN, SABLE, PROF, ACCpro, NETASA).
In order to utilize the solvent accessibility for the protein structure prediction, a restraining energy term E(SA) was designed and then implemented within the TINKER molecular modeling package. The performance of E(SA) was measured using seven benchmark sets and five different energy terms. We found this new energy term promising because it showed a better correlation coefficient with the structural quality in most targets tested, as measured by energy versus TM-score.

P22

Visualizing Robustness of Complex Phenotype and Biomarker Associations

Subject: Machine learning, inference and pattern discovery

Presenting Author: Michael Hinterberg, University of Colorado-Denver

Author(s):

David Kao, University of Colorado-Denver, United States
Larry Hunter, University of Colorado-Denver, United States
Carsten Görg, University of Colorado-Denver, United States

Abstract:
Biomarker discovery in clinical medicine is important for distinguishing populations, as well as understanding the etiology of disease. Increasingly finer-grained complex disease phenotypes and patient subpopulations are being defined through combinations of multiple clinical variables. Any set of biomarkers associated with complex phenotypes should not only be biologically plausible, but also sufficiently robust so as to be replicated in further study in additional populations. These requirements create a challenge in simultaneously optimizing statistical association, model complexity, and robustness. In previous work, we developed PEAX (Phenotype-Expression Association eXplorer), an open-source tool that allows rapid visual exploration and analysis of complex clinical phenotype and gene expression association. Through use-case studies and feedback, we identified visualizations and algorithms to guide the user toward additional insight of meaningful biological associations. Notably, when a user defines a complex, multivariate phenotype, the search space surrounding individual clinical features may be interesting, and is often manually explored. We have extended PEAX to calculate and display additional association models; we use small multiples to represent incremental refinements to the user-defined model, and thus guide the analyst through the process of refining models and understanding the overall robustness of models. We demonstrate the utility of our approach for modeling robustness through examples of visualizations using real and simulated data.

P23

Rhesus θ-Defensin-1 (RTD-1) as a Potential Anti-Inflammatory Therapy for Cystic Fibrosis Lung Disease

Subject: Other

Presenting Author: Jordanna Jayne, University of Southern California

Abstract:
Cystic Fibrosis (CF) is characterized by recurring chronic lung infections resulting in lung fibrosis, remodeling of the airways and reduced pulmonary function. Unfortunately, few effective anti-inflammatory therapies are available for CF lung disease. Here we have shown that in a chronic P. aeruginosa infection C57BL/6 mouse model rhesus Ɵ-defensin-1 (RTD-1) significantly reduces lung inflammation. RTD-1 is an antimicrobial peptide with immunomodulating activity. In vivo RTD-1 nebulization demonstrated a significant reduction in neutrophil lung infiltration compared to untreated controls (Day 3 mean -55% p=0.0003; Day 7 mean -32% p=0.0097). Furthermore, in vivo RTD-1 treatment significantly reduced inflammatory cytokines IL-6, IL-7 and IL-1β (Day 3 p=<0.05). Illumina microarray analysis revealed 109 differentially expressed genes (DEGs) post-treatment in the broncoalveolar lavage fluid (BALF) cells and 58 DEGs in the lung homogenate (Fold-change>2; FDR=0.05). Twenty-six genes were commonly downregulated between both tissue groups. Ingenuity Pathway Analysis (IPA) associated these DEGs into biological functions and diseases. The DEGs were predominantly associated with reductions in immunological response and cell migration within the lung tissue and BALF. IPA Network Analysis revealed common changes in ‘Cell-To-Cell Signaling and Interaction’, ‘Hematological System Development’ and ‘Function and Inflammatory Response’ in both tissue groups. Our data provide not only networks between the genes for understanding immune modulating properties of RTD-1, but also pathway maps for clarifying cellular targets. The downregulation of inflammatory genes is consistent with the observed treatment effect of RTD-1 on reducing airway neutrophils. These data are promising and suggest RTD-1 has therapeutic potential for CF.

P24

Stochastic Models for Studying Lymphocyte Responses Using Flow Cytometry Experiments

Subject: Machine learning, inference and pattern discovery

Presenting Author: Andrey Kan, The Walter and Eliza Hall Institute

Author(s):

Damian Pavlyshyn, The University of Melbourne, Australia
John Markham, Peter MacCallum Cancer Centre, Australia
Mark Dowling, Immunology, Australia
Susanne Heinzel, Immunology, Australia
Jie Zhou, Immunology, Australia
Julia Marchingo, Immunology, Australia
Philip Hodgkin, Immunology, Australia

Abstract:
Lymphocyte responses are complex dynamic processes whereby B and T cells undergo division and differentiation triggered by pathogenic stimuli. A quantitative understanding of lymphocyte response regulation poses a significant theoretical and experimental challenge for immunology. Flow cytometry coupled with a number of fluorescent probes is a common method for measuring progression of lymphocyte responses. Mathematical modelling of flow cytometry data plays an increasingly important role in lymphocyte studies.
Fitting a model to data is a key step in mathematical modelling, but the challenge here is the considerable variability in measurements that originates from biological and measurement noise. This noise is difficult to characterise analytically, and current model fitting methods adopt simplifying assumptions to describe the distribution of measurements. The problem with these studies is that different and conflicting sets of assumptions have been proposed for fitting to the same type of data. Here we aim to establish a consensus as to which method should be considered the standard way of model fitting in lymphocyte studies. To this end, we test the assumptions used in previous methods against a wide range of experimental data, and we find that these assumptions are strongly violated. Therefore, we propose a new measurement model that is consistent with the collected data. Our evaluation suggests that the new model can be reliably used for model fitting across a variety of conditions. Our work provides a foundation for modelling measurements in flow cytometry experiments thus facilitating progress in quantitative studies of lymphocyte responses.

P25

Spatial modeling of drug delivery routes for treatment of disseminated ovarian cancer

Subject: Simulation and numeric computing

Presenting Author: Kimberly Kanigel Winner, University of Colorado

Author(s):

Mara Steinkamp, University of New Mexico, United States
Rebecca Lee, University of New Mexico, United States
Maciej Swat, Indiana University, United States
Carolyn Muller, University of New Mexico, United States
Melanie Moses, University of New Mexico, United States
Yi Jiang, Georgia State University, United States
Bridget Wilson, University of New Mexico, United States

Abstract:
In ovarian cancer, metastasis is typically confined to the peritoneum, and requires surgical removal of the primary tumor and macroscopic secondary tumors. More effective strategies are needed to target microscopic spheroids that persist after debulking surgery. To treat this residual disease, therapeutic agents can be administered by either intravenous (IV) or intraperitoneal (IP) infusion. We use a cellular Potts model to compare tumor penetration of two classes of drugs (small- and large-molecule) when delivered by these two alternative routes. Experimental measures included fluorescence recovery after photobleaching (FRAP) to measure penetration of non-specific antibodies into cultured human ovarian cancer (SKOV3.ip1) spheroids, as well as two-photon imaging of explanted tumors from orthotopic xenografts in nude mice. Stereology analysis was used to estimate the range of vascular densities in disseminated tumors from patients. The model considers the primary route when drug is administered either IV or IP, as well as the subsequent exchange into the other delivery volume as a secondary route. By accounting for these dynamics, the model shows that IP infusion is the markedly better route for delivery of both small molecule and antibody therapies for microscopic, avascular tumors typical of patients with ascites. Small tumors attached to peritoneal organs, ranging from vascularity of 2-10%, also show superior drug delivery via the IP route. Use of both delivery routes may provide the best total coverage of tumors, particularly when there is a significant burden of avascular spheroids suspended in the peritoneal fluid as well as larger, neo-vascularized secondary tumors.

P26

Alignment-free approach for reads classification within a single metagenomic dataset

Subject: Metogenomics

Presenting Author: Lusine Khachatryan, Leiden University Medical Centre

Author(s):

Seyed Yahya Anvar, Leiden University Medical Centre, Netherlands
Peter de Knijff, Leiden University Medical Centre, Netherlands
Johan den Dunnen, Leiden University Medical Centre, Netherlands
Jeroen Laros, Leiden University Medical Centre, Netherlands

Abstract:
Due to advances in sequencing technologies, the human microbiome is becoming an increasingly informative source for scientific researches. A proper comparison by deep analysis of skin bacterial communities would make a valuable contribution to many fields, notably forensics. The analysis of metagenomic samples usually involves mapping reads to known genes or pathways and comparing the obtained profiles. However, microbial communities are usually complex and contain mixtures of hundreds to thousands of unknown bacteria which affect the accuracy and completeness of alignment-based approaches.
We developed an alignment-free approach (kPal) that is based on k-mer frequencies to assess the
level of similarity between and within metagenomic datasets. Previously our method was successfully applied for the comparison of different metagenomic samples. Recently we shown that k-mer based approach also can be utilized for classifying reads within a single metagenomic dataset. To illustrate this statement we used simulated and real data from a single molecule real time sequencer (PacBio) which provides long reads (500–20,000 bp). We have shown that k-mer frequencies can reveal the relationships between reads within a single metagenome, leading to a clustering per bacteria. This approach may potentially be used to estimate the metagenome complexity and to simplify the subsequent assembly procedure.

P27

Abstract Withdrawn

P28

NGSCheckMate: Software for ensuring sample identity in next-generation sequencing studies, with or without alignmen

Subject: Other

Presenting Author: Soohyun Lee, Harvard Medical School

Author(s):

Sejoon Lee, Samsung Medical Center, Korea, Rep
Woong-Yang Park, Samsung Medical Center, Korea, Rep
Peter Park, Harvard Medical School, United States
Eunjung Lee, Harvard Medical School, United States

Abstract:
Next generation sequencing (NGS) has been widely adopted in biology and medicine. We often need to confirm that different batches of NGS data were derived from the same individual for accurate downstream analyses such as identification of somatic mutation specifically present in a certain tissue (e.g. cancer). Among all different steps of quality control, comparison of sequencing reads is the most direct and effective way to ensure the sample identity. Here, we developed NGSCheckMate, a freely available easy-to-use software, to identify NGS data from the same individual. It extracts, from sequencing reads before or after reference alignment, genetic profiles based on known population polymorphic single nucleotide variants (SNVs) from each sample. It distinguishes samples from the same individual by comparing correlations of genetic profiles to its pre-built classification model. Our performance evaluation demonstrates that NGSCheckMate can be generally applicable to diverse contexts of NGS studies and sequencing scope (whole genome (WGS), whole exome, RNA-seq, targeted sequencing and Chip-seq). The method is applicable to low sequencing depth (> 0.2X). We also provide a module that can be run on fastq files without alignment. Given that alignment may take days for large sequencing data, our alignment-free module can be useful for a quick initial check. It requires minimal memory and takes less than a minute for a standard RNA-seq data using a single core. We recommend using NGSCheckMate as a first step in any NGS study that requires sample identity quality control.

P29

EMSAR: estimation of transcript abundance from RNA-seq data by mappability-based segmentation and reclustering

Subject: Optimization and search

Presenting Author: Soohyun Lee, Harvard Medical School

Author(s):

Chae Hwa Seo, DNA link, Korea, Rep
Burak Alver, Harvard Medical School, United States
Sanghyuk Lee, Ewha Womans University, Korea, Rep
Peter Park, Harvard Medical School, United States

Abstract:
RNA-seq has been widely used for genome-wide expression profiling. RNA-seq data typically consists of tens of millions of short sequenced reads from different transcripts. However, due to sequence similarity among genes and among isoforms, the source of a given read is often ambiguous. Existing approaches for estimating expression levels from RNA-seq reads tend to compromise between accuracy and computational cost.
We introduce a new approach for quantifying transcript abundance from RNA-seq data. EMSAR (Estimation by Mappability-based Segmentation And Reclustering) groups reads according to the set of transcripts to which they are mapped and finds maximum likelihood estimates using a joint Poisson model for each optimal set of segments of transcripts. The method uses nearly all mapped reads, including those mapped to multiple genes. With an efficient transcriptome indexing based on modified suffix arrays, EMSAR minimizes the use of CPU time and memory while achieving accuracy comparable to the best existing methods.
EMSAR is a method for quantifying transcripts from RNA-seq data with high accuracy and low computational cost. EMSAR is available at https://github.com/parklab/emsar

P30

A comparison of genetically matched cell lines reveals the equivalence of human iPSCs and ESCs

Subject: Other

Presenting Author: Soohyun Lee, Harvard Medical School

Author(s):

Jiho Choi, Cancer Center and Center for Regenerative Medicine, Massachusetts General Hospital, United States
Kendell Clement, Broad Institute, United States
William Mallard, Broad Institute, United States
Guidantonio Malagoli Tagliazucchi, University of Modena and Reggio Emilia, Italy
Hotae Lim, Johns Hopkins University School of Medicine, United States
In Young Choi, Johns Hopkins University School of Medicine, United States
Francesco Ferrari, Harvard Medical School, United States
Alex Tsankov, Broad Institute, United States
Romona Pop, Harvard Stem Cell Institute, United States
Gabsang Lee, Johns Hopkins University School of Medicine, United States
John Rinn, Broad Institute, United States
Alexander Meissner, Harvard Stem Cell Institute, United States
Peter Park, Harvard Medical School, United States
Konrad Hochedlinger, Cancer Center and Center for Regenerative Medicine, Massachusetts General Hospital, United States

Abstract:
The equivalence of human induced pluripotent stem cells (hiPSCs) and human embryonic stem cells (hESCs) remains controversial. Here we use genetically matched hESC and hiPSC lines to assess the contribution of cellular origin (hESC vs. hiPSC), the Sendai virus (SeV) reprogramming method and genetic background to transcriptional patterns while controlling for cell line clonality and sex. We find that transcriptional variation originating from genetic background dominates over variation due to cellular origin or SeV infection. Moreover, the 49 differentially expressed genes we detect between genetically matched hESCs and hiPSCs neither predict functional outcome nor distinguish an independently derived, larger set of unmatched hESC and hiPSC lines. We conclude that hESCs and hiPSCs are molecularly and functionally equivalent and cannot be distinguished by a consistent gene expression signature. Our data further imply that genetic background variation is a major confounding factor for transcriptional comparisons of pluripotent cell lines, explaining some of the previously observed expression differences between genetically unmatched hESCs and hiPSCs.

P31

Examination of risk factors for nontuberculous mycobacterial infections among National Jewish Health hospital patients in the United States

Subject: Other

Presenting Author: Ettie Lipner, University of Colorado Denver

Author(s):

David Knox, University of Colorado, Denver, United States
Joshua French, University of Colorado, Denver, United States
Jordan Rudman, Colorado College, United States
Michael Strong, National Jewish Health, United States

Abstract:
Nontuberculous mycobacterial (NTM) disease has emerged as an increasingly prevalent infectious disease, and its prevalence has been increasing across the United States, particularly over the last two decades. Prevalence of NTM varies by geographic region, but the geospatial factors influencing this variation remain unknown. The objective of our study is to identify spatial clusters of NTM disease at the zip code level and to identify associated variables of interest. NTM case data were obtained from the National Jewish Health’s (NJH) Electronic Medical Records (EMR). Zip code level NTM case counts and age-adjusted population data were uploaded to SaTScan to identify high-risk spatial clusters of NTM disease across the US. Poisson regression models are used to estimate associations of climatic, environmental, and socio-economic with NTM infection risk.

P32

Association analysis of driver genes of hepatocellular carcinoma with cancer hallmarks

Subject: System integration

Presenting Author: Liangqun Lu, University of Hawaii at Manoa

Author(s):

Travers Ching, University of Hawaii at Manoa, United States

Abstract:
Primary liver cancer is the second leading cause of cancer death globally, and hepatocellular carcinoma(HCC) accounts for approximately 75% of liver cancer cases. It is widely believed that driver genes contribute to cancer hallmarks such as cell proliferation, angiogenesis and metastasis. However, the driver genes found in HCC are not specific to HCC and the majority are discovered in other cancers. Moreover, downstream effects of driver genes to tumor development in HCC are not clear.

With TCGA omic data, we obtained 13(q<0.1) driver genes in HCC through MutSigCV which detects significantly mutated genes compared to background mutation. For each driver gene, we tested the expression change of genes related to cancer hallmarks, as well as miRNAs. We identified 13 driver genes with high mutation frequency, including CTNNB1, TP53, RB1 and HCC specific gene ALB. Racial differences exist in TP53, RB1 and ALB. The gene expression change due to driver gene mutation shows that Ccnd2, Cdk6 in cell cycle, Igf1r, Tgfb1 and Vegfb in signaling process, Mmp11 and Mmp2 in metastasis are mostly likely altered genes in HCC. Additionally, we found the driver genes are correlated with 58 miRNAs expression in tumor samples. The collaborative regulation on hsa−mir−200c, hsa−mir−335 and hsa−mir−375 makes them potential regulators in therapeutics.

In summary, our study on the association of driver genes with genes of cell cycle regulation, signaling process and tumor metastasis help identify the functional gene subsets and help guide therapeutics in HCC.

P33

Detection and interpretation of extrachromosomal microDNAs from next-generation sequencing data

Subject: Other

Presenting Author: Mark Maienschein-Cline, University of Illinois at Chicago

Author(s):

Pinal Kanabar, University of Illinois at Chicago, United States
Stefan Green, University of Illinois at Chicago, United States
Chunxiang Zhang, Rush University, United States

Abstract:
Extrachromosomal microDNAs are short, circular DNA molecules derived from genomic DNA. They are typically hundreds of nucleotides long, and appear to be omnipresent in mammalian cells. However, their mechanism of formation and function in cells is only beginning to be understood. A major roadblock in the development of microDNA studies is the lack of a robust computational methodology for detecting them from next-generation sequencing (NGS) data, and a clear path to interpreting their presence in cells. Confounding these problems is the extremely low molecular reproducibility observed for microDNAs, where biologically replicated experiments turn up very few identical microDNAs. I will present solutions to both challenges: first, I will describe a systematic, flexible bioinformatics pipeline for detecting microDNAs in NGS data. Second, I will address the low molecular reproducibility by showing how a systems-based interpretation of microDNAs can both substantially increase the concordance between biological replicates, as well as distinguish different conditions from each other.

P34

Abstract Withdrawn

P35

Abstract Withdrawn

P36

Hypothesis independent test development using mass spectrometry data from patient clinical groups reveals underlying biological pathways

Subject: Spectral analysis

Presenting Author: Krista Meyer, Biodesix, Inc

Author(s):

Heinrich Roder, Biodesix, Inc, United States
Julia Grigorieva, Biodesix, Inc, United States
Kevin Sayers, Biodesix, Inc, United States
Senait Asmellash, Biodesix, Inc, United States

Abstract:
Identification of biomarkers for stratifying patient sub groups based on diagnosis, treatment selection, measuring response, etc., in many if not most diseases is an unmet need. Seemingly limitless publications report potential biomarkers that are based on our current understanding of biological pathways involved in the disease, while few are ever validated and translated into clinical practice. Our method does not rely on a molecular understanding of the disease state or a particular hypothesis for biomarkers that participate in the disease. We use mass spectrometry data collected from patient serum samples in combination with robust analytical methods, the so called Diagnostic CortexTM data analytics platform, to enable the discovery and validation of multivariate tests without any preliminary hypothesis. This approach has many benefits including the streamlined and rapid process from discovery to test commercialization. Indeed, it could be argued that high clinical utility is the most important factor required. However, this argument does not satisfy the desire to understand the “how and why” a test works.
Through several approaches, we have discovered that mass spectral data and classification labels assigned by a test can enrich our understanding of the proteins and biological pathways that allow a test to distinguish patient groups. We will present examples of these methods and the results. While test development is hypothesis-independent, the same data can be useful for the generation of new hypotheses that lead to a greater understanding of the disease states.

P37

Creating molecular diagnostic tests with supervised learning using time-to-event data

Subject: Machine learning, inference and pattern discovery

Presenting Author: Carlos Oliveira, Biodesix

Author(s):

Heinrich Roder, Biodesix, United States
Julia Grigorieva, Biodesix, United States
Krista Meyer, Biodesix, United States
Arni Steingrimsson, Biodesix, United States
Maxim Tsypin, Biodesix, United States
Joanna Roder, Biodesix, United States

Abstract:
Creating new molecular diagnostic tests for prognosis or treatment benefit using sample sets for which class labels for supervised learning are not clearly defined is a challenging task. Tests based on time-to-event outcome endpoints fit within this category, because it is not a priori obvious which patients should be assigned to the better/worse prognosis or benefit/no benefit groups for classifier training. This is particularly important, as time-to-event endpoints such as overall survival are normally the gold standard for assessing treatment benefit.
The Diagnostic CortexTM data analytics platform can create molecular tests using small training sets with associated time-to-event data. The class labels are defined iteratively at the same time as the classifier is trained. The platform draws on ideas from Deep Learning and, in addition, incorporates important elements focused on dealing with the “more features than instances” problem so that overfitting is minimized. The method has been shown to produce tests that generalize well to independent sample sets and so can provide an accurate performance estimate even when no validation set is available. The method also allows for the tuning of the classifier to meet clinically relevant performance criteria. Although initially developed to work with MALDI-TOF spectral data, the Diagnostic Cortex can also be used to analyze other kinds of data, for example mRNA expression.
We will present some examples of clinical tests in oncology developed with the Diagnostic Cortex from both serum proteomic MALDI-TOF MS and tissue mRNA expression data, including a classifier with clinical utility in immunotherapies.

P38

A framework for reproducible computational research

Subject: Data management methods and systems

Presenting Author: Apua Paquola, Salk Institute for Biological Studies

Author(s):

Jennifer Erwin, Salk Institute for Biological Studies, United States
Fred Gage, Salk Institute for Biological Studies, United States

Abstract:
Analysis of high-troughput biological data often involves the use of many software packages and in-house written code. For each analysis step there are multiple options of software tools available, each with its own capabilities, limitations and assumptions on the input and output data. The development of bioinformatics pipelines involves a great deal of experimentation with different tools and parameters, considering how each would fit to the big picture and the practical implications of their use. Without structure and good practices, combining experimentation with reproducibility could prove challenging. In this work we present a set of methods and tools that enable the user to experiment extensively, while keeping analyses reproducible and organized. The framework centers on a container structure designed to organize analysis steps with the goal of reproducibility, explicitly separating user-generated data from computer-generated data. It also provides version control of code, documentation and data, dependency tracking, logging and automated execution of multiple analysis steps.

P39

Integrating genome-scale network models of transcriptional regulation and metabolism to characterize the biochemical response to genetic perturbations

Subject: Machine learning, inference and pattern discovery

Presenting Author: Rani Powers, University of Colorado Anschutz Medical Campus

Author(s):

James Costello, University of Colorado Anschutz Medical Campus, United States

Abstract:
The generation of –omics data is widespread in modern biomedical research; however, current approaches for modeling and interpreting these data are not keeping pace with the rate of data production. Because each type of –omics data only captures a single layer in the behavior of a complex system, there is a tremendous need for methods that integrate heterogeneous data sources in order to glean mechanistic insights. Two of the cellular processes commonly affected in cancer are gene expression and metabolism. Changes in gene expression are elicited by transcription factor (TF) binding. This response is quantified genome-wide with microarray or RNA Seq experiments, and can be used to construct a transcriptional regulatory network (TRN) of TFs and their targets. Many of these targets are enzymes, which in turn catalyze metabolic reactions and thereby regulate the abundance of metabolic compounds in a cell. Understanding both transcriptional and metabolic regulation is crucial to advancing our understanding of cancer. In prostate cancer, for example, there is relatively little known about how gene expression alterations influence cell metabolism. Therefore, to investigate the interplay between genetic and metabolic phenotypes, we have integrated a human TRN with a global reconstruction of human metabolism (Recon2). With this model, we simulated gene expression changes observed in aggressive prostate cancer and predicted results that were consistent with metabolomics experiments. Furthermore, the results suggest a mechanism behind the aberrant proliferation of this aggressive subtype.

P40

Abstract Withdrawn

P41

Exploring the role of gene interaction in idiopathic pulmonary fibrosis with exome sequencing

Subject:

Presenting Author: Adam Richards, University of Colorado Denver

Author(s):

Tasha Fingerlin, Colorado School of Public Health, United States
Ivana Yang, University of Colorado Denver, United States
David Schwartz, University of Colorado Denver, United States

Abstract:
A number of genetic loci have been associated with idiopathic pulmonary fibrosis (IPF) and a major focus has been the discovery of novel markers associated with disease outcome. Despite a number of promising findings, like the MUC5B promoter variant, the known genetic underpinnings do not adequately explain disease risk. In this work we continue this search using exome sequencing data, but we further investigate the role of gene-gene interaction to infer epistasis among specific genetic variants.

The cohort consisted of 286 cases and disease associated variants were inferred from the 1000 genomes data. A random forest coupled with a support vector machine approach was employed to perform feature selection, account for co-variates and evaluate classification potential. We further complement these analyses by applying methods from gene set analysis to associate biochemical pathways and test specific functional hypotheses. Variants from the HLA region were were highly associated with disease and a number of additional Mucin (MUC) genes (MUC2, MUC6, and MUC16) were revealed as part of the analyses.

P42

Venom Peptides as Therapeutic Agents: Can we use Phylogenetics to Inform Drug Discovery?

Subject: Machine learning, inference and pattern discovery

Presenting Author: Joseph Romano, Columbia University

Author(s):

Nicholas Tatonetti, Columbia University, United States

Abstract:
INTRODUCTION
Animal venoms have been used for therapeutic purposes since the dawn of recorded history, and present-day researchers view them as a largely untapped resource for drug discovery. Techniques for discovery and prediction of biological therapeutic agents utilize phylogenetic methods in various ways, but there are conflicting reports as to whether venom peptide phylogenies actually reflect speciation (and, consequently, whether venom peptide phylogenies are informative). In this study, we use phylogenetic distance metrics and modern informatics techniques to demonstrate that venom peptide sequences do not recapitulate speciation.

METHODS
Our methods include various high-throughput computational techniques that are known collectively as “phylogenetic simultaneous analysis”. We constructed phylogenetic trees for families of orthologous peptide sequences, and computed the aforementioned simultaneous analysis metrics between each protein phylogeny and a “reference tree” from mitochondrial proteins, grouping venom peptide and non-venom peptide families together. Finally, we performed both parametric and empirical hypothesis tests to determine whether the computed values were substantially different between the two peptide classes (venom and non-venom).

RESULTS
The distributions of simultaneous analysis values across all venom peptide families are substantially different (p-value < 0.05 in all cases) from those generated using non-venom peptide families. Therefore, we can infer that venom phylogenies do not recapitulate accepted patterns of speciation.

CONCLUSIONS
Our results strongly suggest that venom-based drug discovery and repurposing should not rely on evolutionary lineages of venomous species until the causes of this observation are better understood. We also offer a number of potential explanations for this peculiar behavior.

P43

Reconstructing chromosome conformation by fluorescence microscopy

Subject: Other

Presenting Author: Brian Ross, University of Colorado

Author(s):

Fabio Anaclerio, University of Bari, Italy
Andrew Laszlo, University of Washington, United States

Abstract:
The in-vivo conformation of chromosomes is increasingly recognized as an important regulator of gene expression. For example, enhancers contact promoters to increase gene expression; genes may physically move to transcription factories to produce mRNA; and structures in the nucleus such as the lamina and nuclear porins have been variously reported to down/upregulate gene expression. Unfortunately chromosome conformation has been difficult to measure directly; most present-day information comes indirectly through cell-averaged DNA-DNA interaction assays (3C and derivatives).

In principle, a chromosome could be straightforwardly reconstructed if a large number of genetic loci could be identified and localized using a microscope, simply by 'connecting the dots'. The problem is that whereas thousands of loci are required to capture the complex structure of a chromosome, microscopes can uniquely identify only a handful of different fluorescent labels by color, so there will be many indistinguishable loci having the same labeling color. Our solution to this problem is to use the known genetic separations of the labeled loci, along with a calibration between genetic distance and spatial distance, to help identify each spot in the microscope image. The conformation is encoded in pairwise mapping probabilities from imaged spots to genetic loci. Crucially, certain statistics of the mapping probabilities can blindly assess the quality of the reconstruction in comparison to color-scrambled control mappings. Aside from experimentally validating the reconstruction procedure and quality metrics, we demonstrate a methodology which will be needed in larger experiments involving hundreds or thousands of loci.

P44

iSeGWalker a easy handling de novo genome reconstruction dedicated to small sequence

Subject: Other

Presenting Author: Benjamin Saintpierre, Institut Pasteur

Author(s):

Johann Beghain, Institut Pasteur, France
Eric Legrand, Institut Pasteur, France
Anncharlott Berglar, Institut Pasteur, France
Deshmukh Gopaul, Institut Pasteur, France
Frédéric Ariey, Institut Cochin, France

Abstract:
Most of “de novo softwares” are global assemblers, meaning they work on the assembling of all reads from a sequencing file. They are not adapted to get a short sequence as the non nuclear DNA from a pathogen. Here we present a Perl software, iSeGWalker (in silico Seeded Genome Walker), developed to accomplish a quick de novo reconstruction of a region, by « genome walking » on Next Generation Sequencing (NGS) data.
The first step of the process is to determine an initial seed, which must be unique and specific to the targeted region. The second step is a cycle with the seed search step (an exact-matching reads selection), the alignment of all selected reads and the generation of a consensus sequence. Once the consensus obtained, a new seed, composed by the 30 last consecutive nucleotides, is obtained and a new cycle is performed.
We tested our software using an apicoplaste seed on a Fastq file obtained from Illumina’s Plasmodium falciparum 3D7 reference strain sequencing. We were able to identify the whole complete genome of the apicoplast including a non-published region harboring a balanced polymorphism that may have a function in the regulation and/or division of the falciparum apicoplast genome.

P45

Flowr: Robust and efficient workflows using a simple language agnostic approach

Subject: Data management methods and systems

Presenting Author: Sahil Seth, MD Anderson Cancer Center

Author(s):

Jianhua Zhang, MD Anderson, United States
Xingzhi Song, MD Anderson, United States
Xizeng Mao, MD Anderson, United States
Huandong Sun, MD Anderson, United States
Andrew Futreal, MD Anderson, United States
Samir Amin, MD Anderson, United States

Abstract:
Motivation: Bioinformatics analyses have increasingly become a compute intensive process, with lowering costs and increasing number of samples. Each lab spends time in creating and maintaining a set of pipelines, which may not be robust, scala-ble or efficient. Further, different compute environments across institutions hinder collaboration and portability of anal-yses pipelines.
Results: Flowr is a robust and scalable framework for designing and deploying computing pipelines in an easy-to-use fashion. It implements a scatter-gather approach using computing clusters, trivializing the concept to the use of five simple terms (in submission and dependency types). Most importantly it is flexible, such that customizing existing pipelines is easy and since it works across several computing environments (LSF, SGE, Torque and SLURM) it is portable. Using flowr, analyses of raw read reads to somatic variants can be achieved in about 2 hours (from about 24).

P46

Reducing dimensionality of sparse, compositional ‘omics data through modular decomposition to increase power in correlational analyses.

Subject: Metogenomics

Presenting Author: Michael Shaffer, University of Colorado Denver - Anschutz Medical Campus

Author(s):

Catherine Lozupone, University of Colorado Denver - Anschutz Medical Campus, United States

Abstract:
Integration of 'Omics data is commonly hamstrung by a lack of statistical power. With studies typically having small sample sizes, as a result of cost or availability, and larger numbers of observations, ranging from hundreds to hundreds of thousands, finding significant correlations between data types is difficult. By finding modules of autocorrelated features and summarizing them, the number of features can be reduced and statistical power increased. The tool WGCNA has become the standard tool for cooccurence analysis, module formation and feature reduction but is designed to use parametric tests as well as an assumption of scale-free network topology. We find that 16S sequencing data and metabolomics data do not meet these assumptions. To remedy this we have developed a package for correlational analysis with sparse, compositional data. To find modules of cooccuring observations SparCC is used to avoid the pitfalls of Pearson and Spearman correlations with sparse, compositional data and the clique percolation method is used to find modules. Modules are summarized and a new table is output. Methods for finding correlations between tables including sparse, compositional data to take advantage of the increased power following module summarization are included. When used on 16S sequencing data we found a 10% reduction in features yielding increased power for further statistical analysis.

P47

Using Dirichlet Mixture Model to Detect Concomitant Changes in Allele Frequencies

Subject: Machine learning, inference and pattern discovery

Presenting Author: Wen Shi, University of Colorado, Anschutz Medical Campus

Author(s):

Jan Hannig, University of North Carolina at Chapel Hill, United States
Corbin Jones, University of North Carolina at Chapel Hill, United States

Abstract:
RNA viruses are challenging for protein and nucleotide sequence based methods of molecular evolutionary analysis because of their high mutation rates and complex secondary structures. To identify evolved nucleotides and/or amino acids in a viral genome without relying on sequence annotation or the nature of the change, we have developed a novel statistical approach that models allelic variance under a Bayesian Dirichlet mixture distribution. We propose an efficient multi-stage clustering procedure that distinguishes treatment causal changes from variation within viral populations. Our method was applied to a longitudinal time-sampled influenza A H1N1 virus strain in either the absence of presence of oseltamivir. Along with the most common H1N1 oseltamivir resistance mutation H274Y on segment 6, we found another genomic location on segment 8 with strong evidence of treatment effect and a list of sites with high allelic variation in the untreated environment. We believe that our approach can be broadly applied and is particularly useful for the data from genomes that are recalcitrant to traditional sequence analysis.

P48

Estimating parameters of the Hodgkin-Huxley cardiac cell model by integrating raw data from multiple types of voltage-clamp experiments.

Subject: Qualitative modeling and simulation

Presenting Author: Matthew Shotwell, Vanderbilt University

Author(s):

Richard Gray, Food and Drug Administration, United States

Abstract:
The Hodgkin-Huxley cardiac cell model is used to model the behavior of ion-channels during the cardiac action potential. On a larger scale, the model is used, for example, to model cardiac arrhythmias and to assess the effects of defibrillation protocols. Historically, the model parameters have been estimated in a piecewise fashion using summaries of raw data from voltage-clamp experiments, and by fitting the summarized data to model sub-components. This process is repeated for each of the model sub-components and corresponding summaries of voltage-clamp data until all of the model parameters are estimated. However, we demonstrate that by summarizing the raw data, some information about the model parameters is ignored. We show that the piecewise estimation procedure can be biased, and can yield estimates that are not unique. Finally show that the model parameters can be estimated simultaneously by integrating data sources across multiple types of voltage-clamp experiments, and that this technique is more efficient than the piecewise approach.

P49

Abstract Withdrawn

P50

SomVarIUS: Somatic variant identification from unpaired tissue samples

Subject: Other

Presenting Author: Kyle Smith, University of Colorado Denver - Anschutz Medical Campus

Author(s):

Shanshan Pei, University of Colorado, United States
Subhajyoti De, University of Colorado, United States
Vinod Yadav, University of Colorado, United States
Daniel Pollyea, University of Colorado, United States
Craig Jordan, University of Colorado, United States

Abstract:
Motivation: Somatic variant calling typically requires paired tumor-normal tissue samples. Yet, paired normal tissues are not always available in clinical settings or for archival samples.
Results: We present SomVarIUS, a computational method for de-tecting somatic variants using high throughput sequencing data from unpaired tissue samples. We evaluate the performance of the meth-od using genomic data from synthetic and real tumor samples. SomVarIUS identifies somatic variants in exome-seq data of ~150X coverage with at least 67.7% precision and 64.6% recall rates, when compared with paired-tissue somatic variant calls in real tumor sam-ples. We demonstrate utility of SomVarIUS by identifying somatic mutations in formalin-fixed samples, and tracking clonal dynamics of oncogenic mutations in targeted deep sequencing data from pre- and post-treatment leukemia samples.

P51

Regulatory network inference: use of whole brain- vs brain region-specific gene expression data in the mouse

Subject:

Presenting Author: Ronald Taylor, Pacific Northwest National Laboratory

Abstract:
The incidence of Alzheimer’s disease (AD) is on the ascendancy. However, its etiology remains only partially resolved. Transcriptional Regulatory Networks (TRNs) have tremendous potential to offer insights. There is a view that brain region specificities of gene expression regulation [1,2] limit the usefulness of TRNs derived from whole brain gene expression data. Here, we test that assumption. We determine what fraction of the regulatory connections seen for a selected functionally related group of genes, inferred from expression data from a small subset of brain regions, are also inferred when expression data is used from across the entire brain. We report which brain region-specific transcription factor – target relationships are so strong as to be discernible when whole brain data are used for network inference. TRNs derived from the hippocampus, a relevant brain region because of its association with learning and memory, are of particular interest. We generate, explore, and compare TRNs of genes relevant to neurodegeneration generated using in situ hybridization data from Allen Brain Atlas hippocampus regions (“hippocampal_region”, “hippocampal_formation”, “dentate_gyrus”, “ammon's_horn” and “subiculum”) on the one hand, and all brain regions together on the other. TRNs were learned using a high-performing network inference algorithm: Context Likelihood of Relatedness (CLR).

References
[1] W. Sun, et al., Transcriptome atlases of mouse brain reveals differential expression across brain regions and genetic backgrounds, G3 (Bethesda). 2 (2012) 203-211.
[2] M. Caracausi, et al., A quantitative transcriptome reference map of the normal human hippocampus, Hippocampus. (2015).

P52

A Computational Method for Drug Repurposing in Amyotrophic Lateral Sclerosis

Subject: Graph Theory

Presenting Author: Phyllis Thangaraj, Columbia University

Author(s):

Nicholas Tatonetti, Columbia University, United States

Abstract:
ALS is a devastating disease that involves spontaneous regression and degeneration of motor neurons. Currently, the life expectancy for individuals diagnosed with ALS is, on average, three to five years. The only proven treatment available is Riluzole, which has shown only a modest effect in improving symptoms.There are several distinct pathogenic pathways for ALS, and animal models do not capture the genetic heterogeneity of the disease making experimental pharmacologic therapies difficult to produce. A network biology approach that can accommodate the genetic complexity of the disease may succeed where a traditional drug design strategy fails.

Recently, systems pharmacology has been used to map drugs to disease mechanisms, in particular adverse drug events (ADE). These algorithms can predict drugs causing ADEs more accurately than current FDA reporting alone. In this study, we apply a similar methodology to predict drugs that can be repurposed for treatment of ALS. We identified genes from the experimental literature and used them as seeds to create protein-protein interaction, transcription factor, and tissue expression networks. We then mapped the networks to the known targets of FDA-approved drugs to identify potential medications to treat ALS. In addition, using the networks available, we re-identified rare disease variants. Our general approach may be applied other diseases with complex genetics to repurpose drugs through computational targeting of disease and drug pathways.

P53

Comparison of off-target scoring algorithms with experimentally detected off-target sites following CRISPR-Cas9 mediated gene editing

Subject: Optimization and search

Presenting Author: Tongyao Wang, Editas Medicine

Author(s):

Hari Jayaram, Editas Medicine, United States
Eugenio Marco, Editas Medicine, United States

Abstract:
CRISPR-associated RNA-guided endonuclease Cas9 can be used to make targeted changes to the genome of cells. To achieve targeting specificity, a guide-RNA (gRNA) is used to direct Cas9 to a specific genomic locus. Cas9 nuclease activity results in indels at the gRNA target locus (the on-target site) and sometimes at off-target sites that differ from the on-target sites by a few base mismatches. Algorithms have been developed to score the specificity of gRNA and select gRNA for use in genome editing.
Next-generation sequencing based assays can be used to quantify the level of genome editing at on-target and off-target sites. One of these methods, GUIDE-seq (Genome-wide, Unbiased Identification of Double stranded breaks Enabled by sequencing), reveals off-target sites with up to six mismatches and single nucleotide “bulges” between the guide and target. Another method called BLESS (direct in situ Breaks Labeling, Enrichment on Streptavidin and next-generation Sequencing) may be used to capture a snapshot of Cas9-induced DSBs. A comparison between these experimentally determined and in silico predicted off-target sites reveals a small overlap between predicted and observed sites suggesting that more data is required to better understand off-target activity. Newer algorithms that incorporate chromatin context and genome-wide Cas9 binding information are important steps towards better algorithms for understanding Cas9 on and off-target activity.
In this work we compare several gRNA scoring algorithms and examine the degree of overlap between their predicted sites and those observed by multiple published unbiased off-target methods.

P54

High-grade serous ovarian cancer subtypes are similar across populations

Subject: Machine learning, inference and pattern discovery

Presenting Author: Gregory Way, University of Pennsylvania

Author(s):

James Rudd, Geisel School of Medicine at Dartmouth College, United States
Chen Wang, Mayo Clinic, United States
Brooke Fridley, University of Kansas Medical Center, United States
Gottfried Konecny, David Geffen School of Medicine at UCLA, United States
Ellen Goode, Mayo Clinic, United States
Casey Greene, University of Pennsylvania, United States
Jennifer Doherty, Geisel School of Medicine at Dartmouth College, United States

Abstract:
Three to four gene expression-based subtypes of high-grade serous ovarian cancer (HGSC) have been previously reported. We sought to determine the similarity of HGSC subtypes between populations by performing k-means clustering (k = 3 and k = 4) in five publicly-available HGSC mRNA expression datasets with >130 tumors. We applied a unified bioinformatics analysis pipeline to the five distinct datasets and summarized differential expression patterns for each cluster as moderated t statistic vectors using Significance Analysis of Microarrays. We calculated Pearson’s correlations of these cluster-specific vectors to determine similarities and differences in expression patterns. Defining strongly correlated clusters across datasets as “syn-clusters”, we associated their expression patterns with biological pathways using geneset overrepresentation analyses. Across populations, for k = 3, correlations for clusters 1, 2 and 3, respectively, ranged between 0.77-0.85, 0.80-0.90, and 0.65-0.77. For k = 4, correlations for clusters 1-4, respectively, ranged between 0.77-0.85, 0.83-0.89, 0.51-0.76, and 0.61-0.75. Results are similar using non-negative matrix factorization. Syn-cluster 1 corresponds to previously-reported mesenchymal-like subtypes, syn-cluster 2 to proliferative, syn-cluster 3 to immunoreactive, and syn-cluster 4 to differentiated. The ability to robustly identify correlated clusters across number of centroids, populations, and clustering methods provides strong evidence that at least three different HGSC subtypes exist. The mesenchymal-like and proliferative-like subtypes are the most consistent and could be uniquely targeted for treatment.

P55

Developing robust drug sensitivity classifiers using multiple kernel learning for feature selection

Subject: Machine learning, inference and pattern discovery

Presenting Author: Nicolle Witte, University of Colorado Denver - Anschutz

Abstract:
A fundamental challenge in precision medicine is to identify molecular features that are predictive of patient response to drug treatment with the goal of offering optimal, informed clinical care. Therapies targeting specific molecular mechanisms have been effective, however many patients still show no response or eventually develop resistance to the treatment. Many machine learning techniques have been used to identify the molecular features that are predictive of response to therapy, commonly referred to as biomarkers. Since many types of molecular representations exist for a patient (e.g., gene expression, mutations, copy number, methylation) multiple kernel learning (MKL) methods have been used to leverage independent predictive information that exist between these biological measurements. However, since the feature space is transformed into similarity kernels, feature selection is often not used prior to MKL. This also hinders biological interpretability of the learned kernels because the features are abstracted in the learning algorithm. Here we incorporate a MKL feature selection technique into a Multitask Multiple Kernel Learning (MT-MKL) algorithm to improve performance in predicting the pharmacological response to different therapies based on molecular profiles. To build the classifier, molecular data for cell lines and pharmacological responses to an array of drugs were gathered from the Genomics of Drug Sensitivity database. Our results show an improvement from baseline MT-MKL predictions when incorporating the MKL feature selection step, thus emphasizing the utility of kernel-based feature selection and demonstrating a way to interpret kernel-based prediction results into biomarkers by evaluating the features selected in the MKL approach.

P56

REPdenovo: Inferring de novo repeat motifs from short sequence reads

Subject: Optimization and search

Presenting Author: Yufeng Wu, University of Connecticut

Author(s):

Chong Chu, University of Connecticut, United States
Rasmus Nielsen, University of California, Berkeley, United States

Abstract:
Repeat elements are important components of eukaryotic genomes. Sequencing data from many species are now available, providing opportunities for finding and comparing genomic repeat activity among species. One limitation in our understanding of repeat elements is that most analyses rely on reference genomes that are incomplete and often are missing data in highly repetitive regions that are difficult to assembly. To overcome this problem we develop a new method, REPdenovo, which assembles repeat sequences directly from raw shotgun sequencing data. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. We apply the method to human data and discover a number of new repeats sequences that have been missed by previous repeat annotations. By aligning these discovered repeats to Pacbio long reads, we confirm the existence of these novel repeats. REPdenovo is a new powerful computational tool for annotating genomes and for addressing questions regarding the evolution of repeat families.

P57

Abstract Withdrawn

P58

Discovery and Evaluation of a Proteomic Signature Predicting Liver Fibrosis Risk in Obese Adults

Subject: Machine learning, inference and pattern discovery

Presenting Author: Stu Field, SomaLogic, Inc.

Author(s):

Malti Nikrad, SomaLogic, Inc., United States
Steve Williams, SomaLogic, Inc., United States
Craig Wood, Geisinger Institute, United States
Glenn Gerhard, Geisinger Institute, United States
Christopher Still, Geisinger Institute, United States
Joel Lavine, Columbia University, United States

Abstract:
Introduction: Non-alcoholic fatty liver disease (NAFLD) encompasses a wide range of liver disease from hepatic steatosis (fat) to inflammation and fibrosis (non-alcoholic steatohepatitis; NASH). The current gold standard to diagnose NASH is liver biopsy, which is expensive and invasive. The machine learning objective was to develop a proteomics-based statistical model to predict “at-risk” patients, i.e. individuals either currently with NASH or in the pre-stages of the disease, from a standard serum blood sample.

Methods: Paired serum samples and liver biopsies were collected from 535 obese patients at the Geisinger clinic. Of these, 159 had normal biopsies (no disease), 169 showed varying degrees of steatosis, and 207 showed various grades of NASH fibrosis. Of the 535 samples, 412 were used in model training and 123 distribution-matched samples were held out for model performance evaluation. Biomarker discovery was performed using SomaLogic’s SOMAscan™ proteomic assay, which simultaneously measures 1129 proteins. Candidate features were selected using stability selection with a L1-regularized logistic regression kernel. A naïve Bayes model was robustly fit following feature selection via 10 runs of 5-fold internal cross-validation. Performance was evaluated from predictions on the hold-out set.

Results: The “risk” classifier (7 features) had a test ROC area under the curve (AUC) of 0.82 (sensitivity=0.66; specificity=0.81; accuracy=0.73) and the posterior Bayes probability increased with severity across the liver disease spectrum. Several features are known to be involved in liver injury, apoptosis, inflammation, collagen deposition, amino acid and insulin metabolism, and the extracellular matrix.

P59

A pilot task on thematic role assignment

Subject: Text Mining

Presenting Author: Prabha Yadav, University of Colorado, School of Medicine, Denver

Author(s):

Kevin Cohen, university of Colorado School of Medicine, United States

Abstract:
Understanding a sentence depends on assigning its constituent to their proper thematic roles. Thus in interpreting the phrase “kinetochore-dependent microtubule rescue ensures efficient and sustained kinetochore-microtubule interactions in early mitosis”. We must determine whether “in early mitosis” is a temporal expression or a location.
In the pilot experiment reported here, the goal was to determine the extent to which people agree with each other, when assigning thematic roles using a flow chart. The flow chart used here represents a hierarchy of thematic roles that can facilitate mapping across resources of differing granularity, as well as different types of NLP applications. The list of chosen predicates represents the intersection of the top-20 over-represented predicates in CRAFT (i.e., omitting words like "gene" and "mouse") with the top-20 most frequent predicates. The possible arguments of the predicates were assigned based on the sentences in CRAFT corpus. The different senses of the predicates were listed separately. There were 2 raters.

The result shows medium agreement with F-measure of 0.64. The agreement is good, given the fact that there was no training provided. Various conclusions regarding the flaws in the flow chart were drawn.

P60

MS1Probe – Implementation of a Statistical Tool for MS1-based Quantitation in Skyline for High Throughput Quantitative Analysis.

Subject: Data management methods and systems

Presenting Author: Alexandria Sahu, The Buck Institute for Research on Aging

Abstract:
Recently, we described a new proteomic quantitation method in Skyline, MS1 Filtering, which allows one to integrate ion intensity data in MS1 spectra across multiple runs in a platform-independent manner. MS1 Filtering uses the full array of existing graphic and software tools in Skyline, and can be readily augmented with additional data analysis tools if desired. Here, we describe a new statistical tool, MS1Probe, for MS1-based quantitation. MS1Probe is unique in that it can further process MS1 full scan data originally analyzed in Skyline, and was designed to be capable of high throughput statistical quantitation of Skyline MS1 Filtering datasets. Features of MS1Probe include calculating peak area means, variability measures, ratios between different sample conditions and corresponding P-values.

P61

Knowledge Engine Cloud-Based Scalable Analytics Suite

Subject: Other

Presenting Author: Omar Sobh, University Of Illinois at Urbana-Champaign

Author(s):

Saurabh Sinha, Associate Professor , United States
Faraz Faghri, PH.D. Candidate, United States
Emad Amin, Post Doctoral Research Associate, United States
Charles Blatti, Post Doctoral Research Associate, United States
Milt Epstein, Software Engineer, United States

Abstract:
Many research efforts have shown how the analysis of experimentally generated genomic datasets can be aided by incorporating collections of public, biological knowledge, often represented as networks that capture the relationships between genes, proteins, functional roles, disease phenotypes, etc. Because of the rapidly increasing size and diversity of community data sets, there is a deficit of algorithmic tools and resources that are capable of analyzing and finding patterns from this data, and those that exist are often inaccessible to most biological and medical researchers. To address this problem, our team has first attempted to identify optimal systems for distributing the data to minimize times for query response and algorithmic execution. Our overall strategy also focuses on building a library of scalable analysis algorithms designed to be deployed on heterogeneous, commercial cloud services and on creating tools to flexibly integrate these computational components into sophisticated and robust analysis workflows. In particular, we adopt the paradigm of each algorithmic component running as a lightweight, secure Docker container that is managed and scheduled on the appropriate cloud resource by Apache Mesos. We demonstrate the utility of our approach and the KnowEnG Analytics Suite by recreating a computationally intensive network-based stratification analysis of cancer tumor subtypes developed by Hofree, et al., 2013, and making it accessible to users to easily deploy on scalable cloud resources.

P62

Novel pathways to explain the cardiovascular toxicity of cyclo-oxygenase-2 inhibitor drugs revealed by transcriptomic analysis

Subject: Other

Presenting Author: Nicholas Kirkby, Imperial College London

Author(s):

Blerina Ahmetaj-Shala, Imperial College London,
Sarah Mazi, Imperial College London,
Mark Paul-Clark, Imperial College London,
Jane Mitchell, Imperial College London,

Abstract:
Cyclo-oxygenase (COX)-2 inhibitor drugs such as ibuprofen and CelebrexTM produce analgesic and anti-inflammatory benefit but increase risk of heart attacks. This was thought to reflect inhibition of COX-2 in blood vessels, however, we have demonstrated that COX-2 is absent in arteries and veins yet abundant elsewhere e.g. kidney and brain. Here we performed transcriptomic profiling of COX-2-deficient mouse tissues to generate novel hypotheses regarding COX-2 and cardiovascular risk.
Methods: Aorta, heart, kidney and brain RNA from wildtype and COX-2-deficient mice was analysed using Illumina MouseRef8 microarrays. Data were analysed by modified t-test using Ingenuity iReport (p<0.05, 1.5-fold cut-off, FDR correction) with pathway analysis using Cytoscape/ClueGO. Plasma ADMA levels were measured by HPLC.
Results: Deletion of COX-2 had little effect of the transcriptome of the aorta, heart, blood and brain, in each case altering only 1 gene. In the kidney, however, 626 transcripts were altered by COX-2-deletion. Pathway analysis implicated several pathways relevant to cardiovascular homeostasis including the renin-angiotensin system and nitric oxide (NO) signalling. In particular we found alterations in genes regulating the endogenous NO synthase inhibitor ADMA including PRMT1, AGXT2 and DDAH2. To determine the significance of these we measured plasma ADMA levels and found a 15-fold increase in COX-2-deficient mice.
Conclusion: These data show how a systems biology approach was successfully applied to generate novel explanations for how COX-2 regulates cardiovascular health. Our data implicate the kidney rather than blood vessels and suggest dysregulation of circulating ADMA levels as a possible mechanism for COX-2-inhibitor associated cardiovascular toxicity.

P63

Detection and disambiguation of geospatial locations for phylogeography

Subject: Text Mining

Presenting Author: Davy Weissenbacher, Arizona State University

Author(s):

Tasnia Tahsin , Arizona State University, United States
Rachel Beard, Arizona State University, United States
Matthew Scotch, Arizona State University, United States
Graciela Gonazalez, Arizona State University, United States

Abstract:
Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles.

Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a ‘metadata heuristic’).

Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences.