12th Annual Rocky Mountain Bioinformatics Conference

POSTER PRESENTATIONS

P01

Dietary Influences on the Incidence of Alzheimer Disease: Converging Inferences by Disparate Algorithms

Subject: Machine learning, inference and pattern discovery

Presenting Author: George Acquaah-Mensah, MCPHS University

Author(s):

Ronald Taylor, Pacific Northwest National Laboratory, United States

Abstract:
Unlike microarray data, in situ hybridization (ISH) data have rarely been exploited as sources of high-throughput gene expression data for transcriptional regulatory relationship network inference. Dementia is a hallmark of Alzheimer Disease (AD), and the hippocampus is a brain region associated with learning and memory. In this study, the Allen Brain Atlas mouse hippocampus and hippocampal formation ISH data were utilized as a source for gene expression measurements. Three disparate, high-performing network inference algorithms: the ordinary differential equations-based Inferelator, the mutual information-based Context Likelihood of Relatedness (CLR), and the tree-based GEne Network Inference with Ensemble of trees (GENIE3) were used to explore these data for insights into the AD etiology.

Unique ISH data in the hippocampal fields were extracted. Focusing on 275 genes relevant to neurodegeneration, transcriptional regulatory networks were learned using Inferelator, CLR, and GENIE3. Each gene can have up to 159,326 expression values, one per voxel. A human post-mortem hippocampal microarray dataset from AD and control subjects was superposed on the orthologous versions of the networks.

There was consensus among the three inference algorithms regarding regulatory roles for
USF2, SP1, CDKN1A, EGR1, and NFE2L1 in the overall network. Some network hubs are affected by glucose or by high dietary fats, both of which are associated with AD. The similarities between diabetes mellitus and AD have been documented. Furthermore, high-fat diet enhances the hallmarks of AD. These findings shed light on possible environment-gene interactions on AD. Further, they demonstrate the utility of ISH in transcriptional regulatory network inference.

P02

Prediction and validation of genes regulating breast cancer metastasis.

Subject: Other

Presenting Author: Eran Andrechek, Michigan State University

Abstract:
Metastasis is responsible for mortality associated with breast cancer. Despite this, the molecular underpinnings of cancer progression are poorly understood. By integrating bioinformatics with genetics we are examining the events mediating metastasis using mouse model systems. We assembled a gene expression database of over 1200 mouse breast cancers from 23 different models. Using this database, transcription factor motif binding predicted E2F transcription factors were overrepresented in metastatic models. Gene expression signatures for specific E2Fs were applied and predicted strong activity in metastatic models. Further bioinformatic analysis predicted that E2F transcription factors were active in a model of HER2+ breast cancer, a type that composes 25% of human breast cancer. To test the hypothesis that E2Fs regulate HER2 breast cancer, transgenic mice overexpressing an ortholog of HER2 were interbred with E2F transcription factor knockout mice. Strikingly, there was a significant reduction in metastasis with loss of either E2F1 or E2F2. To determine the mechanism by which E2Fs regulated metastatic progression, we examined gene expression profiles of tumors in both control and E2F knockout tumors. Importantly, genes identified as being lost in the non-metastatic model system are co-amplified in human HER2+ breast cancer in a large amplicon. This amplicon is present in over 30% of HER2+ human breast cancers and we show a strong association with metastasis. Taken together, by integrating bioinformatics with mouse models we demonstrated that E2F transcription factors play a pivotal role in tumor metastasis.

P03

Interpolating Genetic Characteristics of Zoonotic Viruses for Cluster modeling

Subject: Machine learning, inference and pattern discovery

Presenting Author: Rachel Beard, Arizona State University

Author(s):

Matthew Scotch, Arizona State University, United States

Abstract:
Tracking the spread of emerging and reemerging zoonotic diseases is crucial to curbing outbreaks and reducing morbidity and mortality. Previous approaches to modeling viral spread have focused primarily on genetic or environmental data, though current approaches have shifted to the integration of these types of attributes across several fields of study such as molecular epidemiology, phylogeography and landscape genetics. However, these efforts are limited by the sparse genetic data available in addition to associated location information. For instance, GenBank records pertinent to zoonotic viruses are often lacking metadata regarding the location of isolation. Here we describe approaches to interpolating the genetic characteristics of a virus for a given region in which no data is available, in order to inform a predictive model of disease clustering. Specifically, we study methods of improving predictive modeling of areas at risk for clustered disease by incorporating both the genetic attributes of a given virus and the environmental characteristics of the affected region. Our approach focuses on the identification of regions within the genome of West Nile virus that demonstrate variations in sequence conservation as a means to produce a series of genetic attributes. These attributes will then be used for predictive modeling of disease clusters identified from previous years. To evaluate the contribution of genetic attributes to the identification of highly clustered disease areas, we develop multiple models inclusive and exclusive for these genetic attributes, in addition to environmental factors previously indicated in viral spread.

P04

What is in control of the 3D topology of the genome?

Subject: Qualitative modeling and simulation

Presenting Author: Sven Bilke, National Cancer Institute

Abstract:
The linear sequence of DNA is only the most basic description of the
genome. It is increasingly understood that the three dimensional (3D)
nuclear topology is functionally important [1]. It
seems plausible that a change in DNA topology during cell
differentiation or disease progression is a key component of cell fate
determination as it alters the proximity of trans-regulatory elements.
In a recent study , Dekker and co-workers introduced
a novel method, HiC, allowing for an unbiased genome wide study of 3d
conformations producing a "probability map" of DNA-DNA contacts in an
ensemble of cells [2].

Here we aim to identify genomic parameters correlating with the
3d-structure measured in [2]. We have developed an initial model
based exclusively on DNA sequence related observables and a set of
mixing parameters. Using Monte Carlo optimization techniques, we
identify two major sequence features contributing to the contact
matrix. The resulting model reproduces the empirical consensus contact
probability map described in [2] with Pearson's correlation r >
0.71. For comparison, intra-experiment correlation of the data in [2]
ranges from 0.55 to 0.89.

[1] G. Cavalli and T. Misteli. Functional implications of genome topology. Nature Structural & Molecular Biology, 20(3):290{9, Mar. 2013.

[1] Lieberman-Aiden,E., Dekker,J. et al, Comprehensive mapping of
long-range interactions reveals folding principles of the human
genome. Science, 326(5950) 289-293 (2009).

P05

Detecting Fossils of Horizontal Gene Transfer between Symbiotic Species

Subject: Optimization and search

Presenting Author: Jonathon Brenner, Loyola University Chicago

Author(s):

Catherine Putonti , Loyola University Chicago, United States
George Thiruvathukal, Loyola University Chicago, United States

Abstract:
Horizontal gene transference (HGT) serves as a novel source of genetic information from the typical mode of vertical (parent to offspring) inheritance. The ubiquity of symbiotic relationships throughout evolutionary history amongst prokaryotes and eukaryotes suggests it should be occurring with greater regularity than currently detected. Moreover, many of these detected events were not the intention of the researchers’ study. This discrepancy can be largely placed on current comparative genomic tools being ill-adapted for the constraints of this unique problem. Although a powerful resource, the inherent design of BLAST makes it an insufficient tool for conducting large-scale investigations of HGT events. To address these limitations, we have developed new functionality to identify instances in which a gene or part of a gene has been transferred. The challenge in algorithm development is speed. New data structures and algorithms were therefore developed; genomes are decomposed into k-mers (subsequences of size k) and through efficient storage via hash tables, can be easily compared in O(1) time. To recognize sequence similarities even if it has undergone significant evolution presents a far greater challenge. Modifications to the hash tables themselves as well as novel hash functions facilitate a fuzzy search in nearly linear time. Herein we present the software developed as well as preliminary analyses of publicly available genomic data.

P06

Modeling Binding Affinity of the Multiple Zinc-Finger Protein Prdm9

Subject: Qualitative modeling and simulation

Presenting Author: Greg Carter, The Jackson Laboratory

Author(s):

Michael Walker, The Jackson Laboratory, United States
Timothy Billings, The Jackson Laboratory, United States
Alexander Fine, The Jackson Laboratory, United States
Kenneth Paigen, The Jackson Laboratory, United States
Petko Petkov, The Jackson Laboratory, United States

Abstract:
Mammalian genomes encode hundreds of zinc finger proteins (ZFPs) that bind DNA, but most have unknown functions. One well-characterized example, Prdm9, recognizes and binds with multiple tandem zinc fingers to regulate genomic locations of meiotic recombination. Over 20 alleles of Prdm9 are known in mice, each containing a unique array of between 7 and 17 zinc fingers. Each allele is expected to bind a different set of loci, providing a natural experimental system for investigating how ZFPs recognize DNA sequence. To study Prdm9 site selection, we used a novel, in vitro sequencing strategy called Affinity-Seq to assess binding of two mouse Prdm9 alleles to genomic DNA. We found over 30,000 significant binding sites for each allele and quantified the frequency of binding at each sequence. This enabled estimation of binding affinity at each site in addition to standard nucleotide frequencies. The vast majority (95%) of sites contained an allele-specific binding motif, suggesting a single binding sequence for each Prdm9 allele. We identified a few core nucleotides required for binding. However, analysis of F1 hybrid mice suggested variability at all bases affects binding frequency. To assess the importance of each nucleotide, we performed linear regression to model effects on binding affinity. Quantitative data for thousands of sites allowed us to test additive and interactive effects for all bases, revealing multiple nucleotide-nucleotide interactions that drive binding. Our work provides a detailed view of DNA binding by Prdm9 and unprecedented power to assess the complex rules of zinc finger binding specificity.

P07

Calculating prior and posterior probabilities of concepts in conceptually annotated corpora

Subject: Text Mining

Presenting Author: Negacy Hailu, University of Colorado

Abstract:
In a natural language, predicates have restrictions on their argument. For example, the direct object of the verb phosphorylate is likely to be a protein, and the object of the verb methylate is likely to be DNA. This concept is called selectional preference in linguistics. Selectional preference is a very important technique for understanding the semantics of specialized domains, such as molecular biology. We are interested in determining if there are predicates, including verbs and nominalizations, in the CRAFT corpus that have restrictions on the semantic classes of the arguments that they take. The calculation of selectional preferences is not mathematically complicated. Conceptual annotations, e.g. of concepts from ontologies, should make it easier to determine selectional preferences, since the semantic classes are a given. In practice, however, determining what should be the denominator in calculating both prior and posterior distributions is quite complicated when dealing with conceptually annotated corpora. A number of competing approaches are presented, and their consequences are discussed.

P08

Computational Analysis of Mycobacterium Tuberculosis Drug Resistance

Subject: Machine learning, inference and pattern discovery

Presenting Author: Gargi Datta, National Jewish Health, University of Colorado School of Medicine

Author(s):

Michael Strong, National Jewish Health, University of Colorado School of Medicine, United States
Rebecca Davidson, National Jewish Health, United States
Nabeeh Hasan, National Jewish Health, University of Colorado School of Medicine, United States
Benjamin Garcia, National Jewish Health, University of Colorado School of Medicine, United States

Abstract:
Mycobacterial diseases, caused by Mycobacterium tuberculosis (M. tb) and Nontuberculous mycobacteria (NTM), are global threats clinically that result in over 1.4 million deaths each year. The primary mechanism of disease typically affects the pulmonary system, although mycobacterial diseases can occur disseminated to other parts of the body, including skin infections. Mycobacterial strains, including those that cause tuberculosis (TB) and NTM infections, exhibit varying degrees of drug resistance, often resulting from specific DNA mutations or variations affecting drug target or drug activating enzymes. Although some non-mutational drug tolerance mechanisms occur, including the up-regulation of drug efflux pumps and the role of mycolic acids and the cell wall, it is thought that the majority of drug resistance among mycobacteria results from mutations at the DNA level, many of which result on amino acid changes. Although drug resistance mutations have been well characterized for M. tb, new mutations continue to be identified. NTM mutations have yet to be as comprehensively characterized. In order to rapidly identify mycobacterial mutations that may lead to drug resistance, and present this to investigators in an efficient manner, we aim to implement a fast and robust genome sequence analysis pipeline for pathogen mutation identification, inference of resistance using machine learning approaches, and information delivery via a mobile and web app.

P09

A phylogenomic approach to assess global genetic diversity and intrapatient evolution of clinical Mycobacterium abscessus strains

Subject: Other

Presenting Author: Rebecca Davidson, National Jewish Health

Author(s):

Nabeeh Hasan, University of Colorado Denver, United States
Paul Reynolds, National Jewish Health, United States
Sarah Totten, National Jewish Health, United States
Benjamin Garcia, University of Colorado Denver, United States
Adrah Levin, National Jewish Health, United States
Charles Daley, National Jewish Health, United States
Michael Strong, National Jewish Health, United States

Abstract:
Nontuberculous mycobacterial (NTM) infections caused by Mycobacterium abscessus are responsible for a range of disease manifestations from pulmonary to skin infections and are notoriously difficult to treat due to innate resistance to antibiotics. Previous population studies of clinical M. abscessus utilized multi locus sequence typing or pulsed field gel electrophoresis, but high resolution examinations of the genetic diversity at the whole genome level has not been well characterized, particularly among clinical isolates derived in the United States (US). We performed whole genome sequencing of eleven clinical M. abscessus isolates derived from eight US patients with pulmonary NTM infections, compared them to 30 globally diverse clinical isolates and investigated intra-patient genomic diversity and evolution. Phylogenomic analyses revealed a cluster of closely related US and Western European-derived M. abscessus ssp. abscessus isolates that are genetically distinct from other European and all Asian isolates. Large-scale variation analyses suggested genome content differences of 0.3 – 8.3% relative to the reference strain, ATCC 19977T. Longitudinally sampled isolates showed very few single nucleotide polymorphisms and correlated genomic deletion patterns suggesting homogenous infection populations. Our study explores the genomic diversity of clinical M. abscessus strains from multiple continents, and provides insight into the genome plasticity of an opportunistic pathogen.

P10

A Quality-Control and Data Analysis Pipeline for Comparative Reduced Representation Bisulphite Sequencing

Subject: Data management methods and systems

Presenting Author: James Denvir, Marshall University

Author(s):

Wei-Ping Zeng, Marshall University, United States
Don Primerano, Marshall University, United States
Jun Fan, Marshall University, United States

Abstract:
Methods: We developed a novel pipeline for quality control and analysis of data generated by reduced representation bisulphite sequencing that compared methylation states between multiple samples. The pipeline is built on FastQC (for sequence quality assessment) and Bismark (for alignment and methylation status calling) and incorporates novel in-silico prediction of locations on the reference genome where reads are expected to align following cleavage at Msp1 sites. These predicted mapping sites are used to verify capture of the Msp1 fragments in the sequencing library. The predicted sites were further used to generate 200-nucleotide windows, and methylation status in these windows was computed for each sample for comparative purposes. We executed the pipeline on RRBS data generated from five different murine cell lines.

Results: 97% of all reads mapped to the in-silico predicted genomic locations. The windows generated by the predicted locations were able to distinguish methylation status of the different cells lines and could further be used to classify differentially methylated regions by their locations relative to known gene annotation.

Conclusions: This pipeline builds on existing tools and provides a useful mechanism for simultaneous validation of genomic location and analysis of differential methylation.

P11

A Comparative Bioinformatic Investigation of the Protein-Protein Interaction Networks of the Three Domains of Life

Subject: Graph Theory

Presenting Author: Catherine Derow, Oxford Brookes University

Author(s):

David Fell, Oxford Brookes University,

Abstract:
Protein-protein interactions lie at the heart of biological processes. Protein-protein interaction networks (PPINs), so far, have been found to be scale-free, small-world, with similar network diameters, and to display similar levels of hierarchical organization. Highly connected proteins, ‘hubs’, are thought to be more likely to be necessary for viability and to have higher levels of disorder in their structures. The necessity for viability is thought to relate to the greater perturbation to the network caused by their elimination, including a greater increase in the diameter of the network.
The Archaea, while prokaryotic, are more closely related to eukaryotes in many ways, than bacteria. One unifying feature of the domain seems to be an ability to survive energetic stress. Understanding archaeal PPINs may aid understanding of this property.
The properties of the PPINS of the three domains of Life will be compared. The findings could help elucidate the nature of evolution and which network properties of PPINs are universal and essential for life, as well as increase understanding of each domain.
This research is relevant to exobiology as many Archaea are extremophiles, able to exist under harsh conditions, thought to prevail on many other planets. Thus extremophiles may provide insight into the types of life which might be found on other planets. In addition, Archaea may contaminate spacecraft. This research could help in finding ways to reduce or eliminate such contamination to support planetary protection. Archaea may also have biotechnological

P12

Intellectual Disability in Down Syndrome: Deciphering Molecular Mechanisms using Knowledge Discovery in Databases

Subject: Machine learning, inference and pattern discovery

Presenting Author: Arockia Dhanasekaran, University of Colorado Denver

Author(s):

Katheleen Gardiner, University of Colorado Denver, United States

Abstract:
Down syndrome (DS) is caused by an extra copy of human chromosome 21 (HSA21) and is the most common genetic cause of intellectual disability (ID). Individuals with DS have an average IQ of 40 (unaffected - 100), which affects their quality of life and that of their families. Despite the high incidence of DS (one in 700 live births), pharmacotherapies to ameliorate ID are lacking. To address this issue, a systematic approach to unravel the molecular cascades that bring about ID in DS is required followed by the identification of drug targets. This study employs Knowledge Discovery in Databases (KDD) to discover molecular mechanisms underlying ID in DS and connect them to potential drugs to alleviate ID. The process of knowledge discovery includes 1) extracting ID, HSA21, and mouse learning and memory (LM) relevant data from expert curated databases, 2) integrating, preprocessing and transforming the disparate datasets, 3) discovering and summarizing useful knowledge from the integrated database and 4) visualizing the discovered knowledge using Cytoscape. The initial datasets include 523 ID genes, 688 LM genes and 163 HSA21 protein coding genes. The overlap between LM and ID genes is only 77, and that between HSA21 genes and LM and/or ID is only 13. The 1134 IDLM proteins interact with 7252 proteins of which 65 are HSA21 proteins. Knowledge discovered from the integrated database will resolve the complex molecular cascades that trigger ID, serve as a guide to experimental analyses and lead to development of potential drugs to alleviate ID.

P13

Co-expression Network Hubs in Hypoxia and Acclimatization to High Altitude

Subject: Machine learning, inference and pattern discovery

Presenting Author: Daniel Dvorkin, University of Colorado

Author(s):

Robert Roach, University of Colorado, United States

Abstract:
The AltitudeOmics project studied the transcriptomic and epigenomic mechanisms that humans employ to counteract the physiological challenge posed by high-altitude hypoxia. We exposed 21 healthy subjects recruited at sea level to high altitude while obtaining a variety of physiological and genomic measurements to follow the process of acclimatization.

During the two-week exposure period, almost half the genes in the genome show significant changes in expression. Weighted gene co-expression network analysis (WGCNA) classifies these genes into "modules" of genes that are highly connected (co-expressed) across the course of the experiment. As in most networks, some "hub" genes are connected to large numbers of other genes, and within the module framework, certain genes are especially strongly connected to other genes within the module. Previous work in module-based network analysis has largely focused on these module "intrahubs." Here we discuss a subset of these that may also be classified as "interhubs," that is, within-module hub genes that are also highly connected to other hub genes outside the module.

We hypothesize that these genes are crucial to information flow throughout the network, particularly between pairs of modules in which one module shows a pattern of early expression change that is then mirrored by late change in the other modeule. We show examples of this phenomenon, discuss and contrast the characteristics of intrahubs and interhubs relative to other genes, and propose experiments to test the hypothesis that interhubs are of particular importance in the response to hypoxia.

P14

Effects of transcript aggregation on RNAseq data

Subject: Other

Presenting Author: Felix Eichinger, University of Michigan

Abstract:
In RNAseq, the observation and functional unit of measurement is the transcript. However, gene level data is often required for further analysis, i.e. pathway mapping or functional enrichment. Consequently, we face the problem of choosing an appropriate aggregation of transcripts to gene level. One common method is to define genes as the sum of their transcripts. While being intuitively reasonable, as a gene is the set of all its transcripts, it poses the following challenges for analysis of expression data.
Summarizing transcripts leads to very high expression values for genes with many expressed transcripts. These transcripts are weighted higher and the summarized expression value does not reflect the individual transcript expression. It also assumes concordant function of all encoded proteins.
To evaluate alternatives, we analyzed an RNAseq experiment of patients with Systemic Lupus Erythematosus (SLE) compared to healthy controls, sequenced on an Illumina HiSeq2000 and analyzed with tophat/cuffdiff. We compared the results of the isoform to gene level data aggregated by summarization (cuffdiff), averaging and medianpolish. Resulting foldchanges and q-values were compared using Lin's concordance correlation coefficient. Concordance between the foldchanges of isoform data to sum, medianpolish and average are 0.681, 0.765 and 0.803, respectively. Concordance between q-values of isoform and sum, medianpolish and average data are 0.58, 0.69 and 0.71. These preliminary results suggest that summarization deviates furthest from isoform data. To test whether the differences are likely to induce different interpretation, we will perform functional mapping of the significant genes and test the concordance of the enriched entities.

P15

Linking patient outcome to high throughput protein expression data reveals novel drivers of colorectal adenocarcinoma

Subject: Machine learning, inference and pattern discovery

Presenting Author: Christi French, Vanderbilt University

Author(s):

Alissa Weaver, Vanderbilt University, United States

Abstract:
Cancer patients with undetectable micrometastases at the time of diagnosis are in danger of metastatic growth that worsens their prognosis. Identification of these patients with more aggressive tumor phenotypes could reduce morbidity and mortality by helping physicians choose whether to treat with an adjuvant therapy. A key question in cancer systems biology is how to use molecular data to predict the biological behavior of tumors from individual patients. While genomics has been heavily studied, protein signaling data is more directly connected to biological phenotype and might predict cancer cell and tumor phenotypes such as invasion, metastasis, and patient survival. In this study, we mined publicly available data for colorectal adenocarcinoma (CRC) from the Cancer Genome Atlas (TCGA) and identified protein expression and signaling changes that are statistically associated with patient outcome. Our analysis identified a number of known and potentially new regulators of CRC. High levels of IGFBP2 were associated with both recurrence and death, and this was validated by immunohistochemical staining of a tissue microarray from a secondary patient dataset. Interestingly, GATA3 was the protein most frequently associated with death in our TCGA analysis, and GATA3 expression was significantly decreased in TCGA tumor samples from stage I-II deceased patients compared to living patients. Experimental studies show that GATA3 expression decreases colony growth and invasiveness, suggesting that GATA3 drives aggressive CRC behavior. This bioinformatics approach appears promising as a discovery method for identification of prognostic biomarkers and novel drivers of cancer.

P16

Compositional rules for Gene Ontology synonym generation

Subject: Text Mining

Presenting Author: Christopher Funk, University of Colorado School of Medicine

Author(s):

K. Bretonnel Cohen, University of Colorado School of Medicine, United States
Lawrence E. Hunter, University of Colorado School of Medicine, United States
Karin Verspoor, University of Melbourne, Australia

Abstract:
Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language. In this work, we present two different types of rules to help capture the multitude of ways that GO terms can appear in natural language. The first set of rules takes into account the compositional nature of GO and recursively separates the terms into their smallest composite parts. Once we have these smaller terms, we generate derivational, inflectional, and syntactic variations and combine all synonyms from all composite terms through recursion. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.464 to 0.607 is observed. Additionally, we evaluated both types of rules over 600k full text documents from PMCOA; over 1.5 million more apparently correct GO terms were annotated, based on random sampling.

P17

Analyzing Disease-Gene Networks Using Hypergraph Techniques

Subject: Graph Theory

Presenting Author: Suzanne Gallagher, University of Colorado

Author(s):

Micah Dombrower, University of Colorado, United States
Debra Goldberg, University of Colorado, United States

Abstract:
Disease-gene networks are bipartite networks that connect diseases to the genes with which they are associated. Due to the absence of tools to analyze bipartite networks, however, they are usually analyzed by “projecting” the bipartite network into a unipartite one by taking one side of the bipartition as the nodes and connecting two nodes if they share a common neighbor in the bipartite network. This projection, however, loses information, such as exactly which groups of diseases share a common gene. An alternative is to translate these networks into hypergraphs. Hypergraphs are an extension of graphs where each hyperedge can contain an arbitrary number of nodes, unlike the edges of graphs that have exactly two endpoints. In a disease-gene hypergraph, we can represent diseases as nodes, genes as hyperedges, and an association between a disease and gene by having the node contained in the hyperedge. This model preserves all of the information from the bipartite network and can be analyzed using hypergraph techniques.

We analyzed disease-gene networks using hypergraph and bipartite graph techniques. We used previous hypergraph metrics, two-node clustering coefficients, to find related diseases in disease-gene networks. We also introduce a new metric, the cross-partite clustering coefficient, that measures the relationship between two nodes on opposite sides of the partition in a bipartite network. We demonstrate that this measure is much higher between diseases and genes that have a known association and hypothesize that it could be used to predict undiscovered disease-gene associations.

P18

Computational Methods to Study Differential Transcription

Subject: Other

Presenting Author: Debra Goldberg, University of Colorado Boulder

Author(s):

Suzanne Gallagher, University of Colorado Boulder, United States
May Alhazzani, University of Colorado Boulder, United States
Leslie Seitz, Fairview High School, United States

Abstract:
While the general process of gene transcription is well understood, the mechanisms by which different genes are activated in different conditions or different cell types are not. Transcription must be precisely controlled for proper development and response to differing conditions. The Mediator protein complex is essential for most transcription in eukaryotes, and seems to have a role in differential transcription. CDK8 and CDK19 are homologous proteins that function similarly, alternatively occupying the same position in the CDK module of Mediator. We wish to identify the functional differences between CDK8 and CDK19 by considering how the presence of each one impacts the transcriptional program that Mediator helps regulate. Towards this end, we have studied differences in gene expression associated with the presence of CDK8 and CDK19 under various stress conditions. We have developed methods to predict the transcription factors (TFs) that play a role in this differential gene expression both directly and indirectly. Many TFs are implicated with both CDK8 and CDK19, so we have also identified TFs that are significantly more associated with one than the other.

P19

Integrative Genomics Approaches for Predicting Drug Synergies

Subject: Other

Presenting Author: Andrew Goodspeed, University of Colorado-Anschutz Medical Campus

Author(s):

James Costello, UCD-AMC, United States
Heide Ford, UCD-AMC, United States
Andrew Thorburn, UCD-AMC, United States
Annie Jean, UCD-AMC, United States

Abstract:
High-throughput, genome-scale technologies offer the opportunity to study disease and response to therapeutics at an unprecedented resolution; however, extracting relevant biology from these data remains a challenge. Here, we present a novel approach to integrate gene expression and genome-wide synthetic lethal data to predict synergistic drug combinations along with experimental validation.
To demonstrate our methodology, we integrated gene expression sampled from TNF-related apoptosis-inducing ligand (TRAIL) treated B Lymphoma cells and a genome-wide shRNA screen with TRAIL treated cells. We identified then rank ordered genes that when treated, both increased in gene expression and were synthetic lethal. Using the Drug Gene Interaction database, we compiled sets of genes targeted by a drug and tested these sets for enrichment in the ranked gene lists using the Kolmogorov-Smirnov test. We predicted that antagonists of the α-1 adrenergic receptor would be synergistic with TRAIL and validated this finding by showing that Prazosin sensitized cells to TRAIL treatment.
While our method and results are encouraging, the approach still relies on previous annotation. To identify novel relationships, we present two additional approaches that rely less on previously annotated genes, including one approach that is completely data driven. This approach uses a gene-gene interaction network constructed from a gene expression compendium. Gene expression and synthetic lethal interactions are mapped onto the network. Novel predictions of drug combinations are based off network clustering to identify druggable gene hotspots. The approaches we present can be applied to any set of gene expression and synthetic lethal screening data.

P20

Towards a Modern Curriculum for Teaching Computational Bioscience

Subject: Other

Presenting Author: Carsten Goerg, University of Colorado

Author(s):

Elizabeth Wethington, University of Colorado, United States
Kevin Cohen, University of Colorado, United States

Abstract:
The last decade brought tremendous progress in the life sciences, life science technology, as well as in computational approaches and computing hardware. Developing a comprehensive, cross-disciplinary curriculum for teaching computational bioscience that covers fundamentals in biology and computer science and also provides enough depth in selected areas is therefore a challenging endeavor. Over the last year, we re-designed the curriculum in our CU Computational Bioscience graduate program, taking into account the background of our prospective students, the specialty of our faculty, our research-oriented teaching mission, and the research setting of our program in a Medical School. We will present the challenges we faced, the solutions we found, the lessons we learned, and hopefully stimulate an engaging discussion on modern approaches for educating our students in computational bioscience.

P21

Using ontology annotations and knowledge base of biomedicine in retrospective study on spinal cord injury

Subject: Text Mining

Presenting Author: Irina Grichtchenko, University of Colorado Denver Anschutz Medical Campus

Author(s):

William Baumgartner, University of Colorado Denver Anschutz Medical Campus, United States
Ivo Georgiev, University of Colorado Denver Anschutz Medical Campus, United States
Barbara Grimpe, University Medical Center Düsseldorf, Germany
Lawrence Lawrence , University of Colorado Denver Anschutz Medical Campus, United States
Kevin Cohen, University of Colorado Denver Anschutz Medical Campus, United States

Abstract:
Here we aimed to identify the potential biomarkers in the spinal cord injury (SCI) rat model by re-analyzing the genomic data in the Gene Expression Omnibus (GEO) repository.

We downloaded the GSE464 gene expression profile on the SCI rat model with the sham control comprising 544 samples at the different time points of injury from the GEO repository. We normalized and preprocessed the expression data using the AFFY and LIMMA packages in R (Version 3.1.1) and Bioconductor (Version 2.14). Using the GENE-E visualization and analysis platform (www.broadinstitute.org) we found the differentially expressed genes (DEG) that were previously reported recapitulating the results from the original study (PMID12666113). With the current bioinformatics tools we identified the new DEG that were not previously reported.

Using the GATHER site (http://gather.genome.duke.edu) we performed the gene set enrichment analysis and the annotation of the new DEG with functional descriptors from Gene Ontology (GO) and Cell Ontology (CL), and proposed a hypothesis that the new DEG are involved in these ontology terms with respect to the SCI.

Knowledge-based analysis shows promise for improving our ability to do hypothesis generation from the results of high-throughput assays, but automation of such approaches remains a challenge. We propose a workflow for implementing a pipeline to carry out such analyses using gene expression microarrays and a large integrated Knowledge Base of Biomedicine (KaBOB).

We anticipate that our hypothesis-generation approach stimulates deciphering the molecular mechanisms that cause disability in the spinal cord injury.

P22

Phage Phisher: Isolating Viral Sequences From Complex Genomic Datasets

Subject: Metogenomics

Presenting Author: Thomas Hatzopoulos, Loyola University Chicago

Author(s):

Siobhan Watkins, Loyola University Chicago, United States
Catherine Putonti, Loyola University Chicago, United States

Abstract:
The comparatively recent advent of research into environmental viruses, particularly bacterial viruses (bacteriophages), has continued to flourish under the scope of high throughput sequencing. However, previous work conducted on bacteriophages on a genomic scale is sparse, and as a result there is a lack of information for phages available on online databases. Furthermore, separating viral information from host can be difficult: an issue further compounded by the ubiquitous presence of mobile genetic elements in such datasets. In this work we introduce a computational pipeline called Phage Phisher. This tool allows for the extraction and isolation of specifically viral sequences from sequence data, which, with consecutive reassembly, enables more resolved insight into viral genomes. Phage Phisher includes new functionality developed in our lab in conjunction with repurposing existing algorithms. Our results suggest that this pipeline, a three step process, can be used to examine single phage genomes, metaviromic datasets generated through whole genome sequencing, and temperate phages embedded in bacterial genomes. Herein, we present our pipeline as well as the analyses of Illumina sequencing of several complex, heterogeneous viral data sets.

P23

PEAX: Knowledge-Assisted Visual Analytics for Complex Phenotype-Expression Association Discovery

Subject: Machine learning, inference and pattern discovery

Presenting Author: Michael Hinterberg, University of Colorado-Denver

Author(s):

David Kao, University of Colorado-Denver, United States
Lawrence Hunter, University of Colorado-Denver, United States
Carsten Goerg, University of Colorado-Denver, United States

Abstract:
Discovery and interpretation of novel disease phenotypes requires expert clinical knowledge, while the search for associated mRNA and miRNA expression biomarkers requires additional knowledge of genes and their interactions. We have developed the tool PEAX (Phenotype-Expression Association eXplorer), which integrates clinical decision-tree based visualization with statistical association of expression data. PEAX is designed using a combination of open-source components including R for statistical analysis, D3 for visualization, and Shiny for a dynamic and responsive interface, while not being tied to any specific disease domain. We discuss current and ongoing use-case observations of PEAX, well as new algorithmic features, such as allowing the user to define expression up-regulation/down-regulation search patterns to find similarly responsive genes based on a hypothesized phenotype. Additionally, we augment exploration with database knowledge of known and predicted miRNA-mRNA interactions. Taken together, PEAX is an open-source tool designed to integrate and abstract molecular interaction knowledge and statistical analysis in order to guide a clinical researcher toward phenotype discovery.

P24

Identification of Biomarkers Specific to Pulmonary Tuberculosis Using a Highly Multiplexed Proteomic Assay

Subject: Qualitative modeling and simulation

Presenting Author: Thomas Hraha, SomaLogic, Inc.

Author(s):

Louis Green, SomaLogic, Inc., United States
Mary Ann De Groote, SomaLogic, Inc, United States
Nebojsa Janjic, SomaLogic, Inc., United States
David Sterling, SomaLogic, Inc., United States
Urs Ochsner, SomaLogic, Inc., United States

Abstract:
The accurate diagnosis of active infection with Mycobacterium tuberculosis (M.tb) remains a global challenge, especially in the case of HIV co-infection. Using a highly multiplexed proteomic technology (SOMAscan™), serum measurements of >1100 protein targets were generated in an international population of M.tb infected cases (n=160) and non infected controls (n=218) presenting M.tb-like symptoms. Some of the cases were negative for M.tb by routine smear microscopy but all were positive by culture. One third of the patients were HIV-positive.
Stability selection with an L1-regularized logistic regression kernel was used select host biomarkers specific to M.tb infection independent of HIV and smear status. Ten-fold stratified cross-validation was used to optimize complexity (protein number) and performance in a Naïve Bayes classifier model. A 9 marker model performed equally well in HIV-positive and HIV-negative subjects. In a blinded verification set (n=250) the model had an 80% sensitivity and 84% specificity, despite the presence of 30 HIV-positive, smear negative samples, a group that was not seen in the training set. When applied to a separate longitudinal sample set tracking the response of M.tb infected subjects to treatment, the model was able to classify subjects responding properly to treatment with a decreased number of bacilli and those who were not. The discovery of robust, quantitative, non-culture based biomarkers of active pulmonary M.tb has great potential to facilitate rapid and accurate diagnosis of M.tb, leading to reduced transmission of disease.

P25

RNABindRPlus Parallelization: Faster Prediction of RNA-Binding Residues in Proteins

Subject: Optimization and search

Presenting Author: John Hsieh, Iowa State University

Author(s):

Rasna Walia, Iowa State University, United States
Drena Dobbs, Iowa State University, United States

Abstract:
RNABindRPlus (http://einstein.cs.iastate.edu/RNABindRPlus/) uses a sequence-based machine learning method to predict potential RNA-binding residues in proteins. It is free software hosted on a modest webserver at Iowa State University. The rate-limiting step in the prediction process is the calculation of position-specific scoring matrices (PSSMs), which can require up to 10 minutes per sequence (using PSI-BLAST). To alleviate this bottleneck, we recently implemented a Fortran script that invokes MPICH to distribute the PSSM calculation to multiple cores on the RNABindRPlus server. To evaluate the parallelized version of RNABindRPlus, we executed test jobs in which predictions were made for sets of protein sequences (ranging from 1 to 16 different protein sequences) and compared the runtimes to those of similar jobs run serially on the webserver. We obtained a dramatic decrease in runtimes using the new parallelized version of RNABindRPlus compared to the original serial version. Other performance metrics indicated that we had achieved even data distribution and load balancing, despite using a simple random assignment of PSSM calculations to the multiple cores. Thus, this study not only resulted in improved performance of RNABindRPlus, but has also demonstrated a potential open source solution for relatively small servers that do not have MPI installed. In future work, we plan to use DELTA-BLAST instead of PSI-BLAST to generate PSSMs, which should provide an additional boost in performance, both in terms of runtime and prediction sensitivity. The new parallelized version of RNABindRPlus can be freely accessed at: http://einstein.cs.iastate.edu/RNABindRPlus/index_parallel.html

P26

Comprehensive survey of prophage genomes for the characterization of prophage composition and insertion behavior

Subject: Machine learning, inference and pattern discovery

Presenting Author: Han Kang, San Diego State University

Author(s):

Robert Edwards, San Diego State University, United States
Katelyn McNair, San Diego State University, United States

Abstract:
With recent advancements in DNA sequencing technology, there has been a rapid growth in the number of available sequenced bacterial genomes. Embedded in these genomes is a multitude of foreign DNA, including transposons, insertion elements, as well as bacteriophage DNA known as prophage. While there are fewer fully sequenced phage genomes, these newly available bacterial genomes highlight phage genomics in the context of temperate phages. We present a comprehensive survey of 11,941 bacterial genomes, which were scanned for prophage regions using PhiSpy, a weighted algorithm used to identify prophage regions using a variety of characteristics of phage. A total of 67,022 prophages were identified, with a mean length of 23,351 bp. The data is being used to develop novel methods of identifying prophage insertions into tRNA and bacterial genes to reveal that phages do not preferentially insert into tRNA sites as previously believed. A phage gene heatmap demonstrates the favored position of hallmark phage genes in relation to the position of the integrase gene. Altogether these findings provide a new perspective of the mysterious nature of phage behavior – which has significant implications in understanding trends in microbial ecology as well as bacterial gains of function such as antimicrobial resistance.

P27

A computational and experimental method for combinatorial drug discovery.

Subject: System integration

Presenting Author: Muhammad Kashif, Uppsala University

Author(s):

Claes Andersson, Uppsala Univeristy, Sweden
Sadia Hassan, Uppsala Univeristy, Sweden
Henning Karlsson, Uppsala Univeristy, Sweden
Rolf Larsson, Uppsala Univeristy, Sweden
Mats Gustafsson, Uppsala Univeristy, Sweden

Abstract:
The potential of multi-compound therapies to achieve effective anti-cancer treatments is well established but quite unexplored, in particular regarding combinations involving more than two compounds. There is a lack of robust automatic methods practical enough to effectively search in large combination spaces for combinations of arbitrary size.
To cater these problems we have developed a semi-automated robotic platform for in-vitro combination screening that uses a pipeline consisting of (1) combination generation as per design of experiment, (2) cell seeding, (3) drug combination transfer by developing Biomek-2000 robotic functionality and (4) end point analysis.
Main limiting factor in this pipeline was drug combination transfers that complicate experimentation exponentially with increasing number of drugs. A “cherry picking” programme was developed and integrated in Biomek-2000 robot to perform automated large scale drug combination transfers. One only needs to specify desired drug combinations in a spread sheet and Biomek-2000 robot does the rest. Validation experiments showed that new functionality is less error prone than manual procedures, and can transfer arbitrary size drug combinations precisely.
Another robot Precision-2000 was programmed for cell seeding and Beckman coulter ORCA robot was used for automated end point readouts. Additionally, R scripts were developed for high-throughput analysis of end point data and to generate set of combination based upon analysis and as per user defined criteria.
Experimental results found that data from the whole platform is reproducible, reliable and one can perform the combination studies of arbitrary size drug combinations with different experimental settings semi-automatically.

P28

Analysis and characterization of high-resolution RNA structural probing data

Subject: Other

Presenting Author: Katrina Kutchko, University of North Carolina at Chapel Hill

Author(s):

Amanda Solem, University of North Carolina at Chapel Hill, United States
Nate Siegfried, University of North Carolina at Chapel Hill, United States
Wes Sanders, University of North Carolina at Chapel Hill, United States
Kenneth Plante, University of North Carolina at Chapel Hill, United States
Mark Heise, University of North Carolina at Chapel Hill, United States
Nathaniel Moorman, University of North Carolina at Chapel Hill, United States
Alain Laederach, University of North Carolina at Chapel Hill, United States

Abstract:
Prediction of RNA secondary structure is a computationally difficult problem, but the addition of experimental data greatly improves the accuracy of these predictions. For the purposes of characterizing RNA structure, we used a new high-resolution structural probing technique, SHAPE-MaP, to perform probing of RNA secondary structure on a variety of RNA transcripts, including the Chikungunya viral genome and a significant subset of the in vivo human RNA transcriptome. With SHAPE-MaP, chemical reagents preferentially modify accessible or unpaired nucleotides, which are then fixed as mutations during reverse transcription. Thus, from mutational profiling of next-generation sequencing data, a “reactivity profile” describing nucleotide accessibility can be created.

The SHAPE-MaP reactivity profile is calculated using the modified RNA along with a structural and a mutational control. Determining the reactivity profile for an RNA, particularly in the context of a eukaryotic transcriptome, requires tailored analysis of RNA sequencing data. Here we used SHAPE-MaP to find and compare reactivity profiles of both human transcripts and the Chikungunya genome. From this data, we explored strategies to best prepare and characterize SHAPE-MaP data for both viral and eukaryotic RNA.

P29

Network analysis to identify human genes influencing susceptibility to Mycobacterium tuberculosis and Nontuberculous mycobacteria infection.

Subject: Networking, web services, remote applications

Presenting Author: Ettie Lipner, University of Colorado Denver

Author(s):

Benjamin Garcia, UCD, United States
Michael Strong, National Jewish Health, United States

Abstract:
Tuberculosis (TB) and Nontuberculous mycobacterium (NTM) disease together constitute a high burden of pulmonary disease in humans. As multiple genetic factors influence the onset of complex human disease, a systems biology approach captures the relationships of gene-gene and gene-chemical interactions in a broad biological context. This approach allows for the identification of susceptibility genes, pathogenic mechanisms, and pathways associated with disease. We performed pathway and network analysis on genes associated with Mycobacterium tuberculosis (MTB) and NTM infection to identify relevant and enriched human genes and pathways involved in mycobacterial infection. Our findings suggest that the main signaling pathways for mycobacterial infection involve communication between innate and adaptive immune cells as well as differentiation of T helper cells. Signaling pathways involving autoimmune disease were also enriched. Highly interconnected genes were then placed a gene-chemical network to identify drugs and nutrients with immunomodulatory effects. This preliminary analysis explores the connectivity between TB and NTM-associated human genes, susceptibility to infection, and possible therapeutics.

P30

Automatic Discovery of Regulatory Networks from Morphological Experimental Data

Subject: Machine learning, inference and pattern discovery

Presenting Author: Daniel Lobo, Tufts University

Author(s):

Michael Levin, Tufts University, United States

Abstract:
Bioinformatics tools for analysis of regulatory mechanisms are generally limited to genomic or time-series concentration data. However, many crucial experiments in developmental and regenerative biology are based on manipulations and perturbations resulting in morphological outcomes. For example, planarian worms can regenerate a complete organism from almost any amputated piece, but knocking down certain genes can result in the regeneration of double-head worms. The inherent complexity and non-linearity of biological regulatory networks prevent us from manually discerning testable comprehensive models. No automated tool exists to mine the huge database of functional results, and despite hundreds of years of experiments, no model has been found that can explain more than one or two morphological results. To bridge the gulf separating morphological data from an understanding of pattern formation, we developed a software method to automate the discovery of regulatory networks form phenotypic experimental data. The method uses mathematical ontologies to unambiguously formalize surgical, genetic, and pharmacological experiments and their resultant morphological phenotypes to explain, and a whole-organism simulator capable of performing the same experiments in silico. We demonstrate this approach by automatically discovering the first comprehensive model of planarian regeneration, which not only explains at once all the key experiments available in the literature (including surgical amputations, knock-down of specific genes, and pharmacological treatments), but also predicts testable novel outcomes. Our approach is the first step to a bioinformatics of shape, which will pave the way for understanding complex pattern formation in developmental and regenerative biology.

P31

Comparison of Phylogeographic Node Flux with Local Disease Trends

Subject: Simulation and numeric computing

Presenting Author: Daniel Magee, Arizona State University

Author(s):

Matthew Scotch, Arizona State University, United States

Abstract:
Discrete phylogeography is a widely used approach for inferring ancestral history of genetic lineages. For virus sequences, this can be used to estimate the spread of an outbreak. However, by default, analyses of discrete phylogeography do not consider the specific characteristics of each sampled location as a means for validating the ancestral reconstruction. For example, for a communicable virus like influenza the discrete locations, or nodes, that are abundant within a tree should in theory contain a high population density, large travel network, or other defining trait to support the node’s virus flux.

In this talk, we will discuss our approach to validating discrete phylogeography estimates of virus spread by linking ancestral models with local data indicative of disease trends. We will consider five recent phylogeographic studies on both a communicable virus, influenza, and a vector-borne virus, West Nile Virus (WNV). For influenza, we will analyze human population density, airports, and incidence of influenza at each node. For WNV, we will analyze human population density, humidity, and incidence of WNV at each node. Here, we will consider both the influx and outflux of the ancestral nodes for both viruses. The influx represents the number of branches leading to the location and outflux represents the number of branches diverging from the location. We will use our work to formulate a hypothesis on the efficacy of discrete phylogeographic studies and to determine if our descriptive approach could be an effective metric of or quality control for phylogeographic inference.

P32

Gene order and core gene analysis of giant virus genomes

Subject: Other

Presenting Author: Shane Dorden, University of Tampa

Abstract:
In recent years, the genomes of several giant viruses have been determined. These genomes provide a rich source of data to delve deeper into the comparative genomics of these viruses. The genomes of Mimivirus, Megavirus, Moumouvirus, and Pandoravirus were analyzed using the bioinformatics software tools GeneOrder4.0 which determines gene order and CoreGenes3.5 which determines the core and dispensable genes of these viruses. GeneOrder4.0 analysis showed differences in gene order between some of these viruses. In addition, CoreGenes3.5 analysis showed numerous shared genes between these viruses, as well as many unique genes not found in the core genome. Interestingly, many of these shared and unique genes are unknown or hypothetical proteins. We demonstrate the utility of GeneOrder4.0 and CoreGenes3.5 in annotating these hypothetical proteins to gain greater insight into their function.

P33

RNABindRPlus Predicts RNA-Protein Interface Residues in Multiple Protein Conformations

Subject: Machine learning, inference and pattern discovery

Presenting Author: Carla Mann, Iowa State University

Author(s):

Rasna Walia, Iowa State University, United States
Drena Dobbs, Iowa State University, United States

Abstract:
RNABindRPlus (http://einstein.cs.iastate.edu/RNABindRPlus/) is a sequence-based machine learning program that predicts RNA-interacting residues in protein sequences. RNABindRPlus combines an optimized Support Vector Machine (SVM) classifier with a sequence homology-based predictor (HomPRIP) to predict which residues of a protein directly participate in the RNA-protein interface (Walia et al., 2014 PLoS ONE). On several benchmark datasets, RNABindRPlus outperforms other available sequence- and structure-based methods for predicting interfacial residues in RNA binding proteins (0.72 % Specificity; 0.86 AUC of ROC on RB44 test set).
To further improve the performance of RNABindRPlus, we investigated the source of false positive predictions. We hypothesized that certain “false positive” interfacial residue predictions might correspond to actual interfacial residues in a different structural conformation of the RNA-protein complex under consideration. Here we report that many apparently “false positive” predictions made by RNABindRPlus are, in fact, interfacial residues. Thus, using only sequence information as input, RNABindRPlus can recognize amino acids that are involved in binding RNA when the protein is in alternate conformations. Because highly similar sequences were eliminated in constructing the RB198 dataset on which RNABindRPlus was trained (to avoid biasing the SVM classifier), structures with identical or nearly identical sequences but differing conformations were removed. Thus, instances in which a specific protein has been observed to make alternative contacts with an RNA partner were not included in cross-validation experiments. We conclude that RNABindRPlus has a lower false positive rate than previously reported. An updated version of the RNABindRPlus webserver that incorporates these results is under development.

P34

Transcriptomic Profiling of Human Blood Reveals Interactions between Cyclo-Oxygenase-2 and Viral-Type Inflammatory Responses

Subject:

Presenting Author: Sarah Mazi, Imperial College London/King Saud University

Author(s):

Nicholas Kirkby, Imperial College London,
Mark Paul Clark, Imperial College London,
Jane Mitchell, Imperial College London,

Abstract:
Prostaglandins are lipid mediators produced by cyclo-oxygenase (COX)-1 and COX-2 enzymes with roles in health and disease. Non-steroidal anti-inflammatory drugs (NSAIDs) e.g. naproxen (Aleve), inhibit COX-2 and/or COX-1 and have been used for decades to treat inflammation. Nonetheless, there is much about actions of these drugs in inflammation, cancer and the cardiovascular system that we do not understand. To generate new insight into their pharmacology, we administered healthy human volunteers with standard doses of COX-2-selective (celecoxib) or COX-1/COX-2 inhibitors (naproxen) for 7 days and measured the transcriptome of whole blood. Analysis of Illumina microarray data using GeneSpring and modified t-test showed that celecoxib and naproxen treatment altered transcription of 104 and 114 mapped genes respectively, with one gene, FMNL1, common between the treatments. Pathway analysis of these gene lists using g:Profiler showed no significantly enriched pathways in the naproxen group, but in the celecoxib group, an association with increased type I interferon response pathways (p=0.038) that are associated with anti-viral immunity. In parallel, we measured the transcriptome of blood from COX-1 or COX-2-deficient mice and found 26 and 60 differentially expressed genes, respectively. In COX-1-deficient mice these did not map to biologically relevant pathways but in COX-2-deficient mice, there was an association with up-regulation of antigen processing genes that are related to interferon-type immune responses. Taken together, these analyses reveal a novel association between COX-2 and augmented interferon responses in human blood that may have implications for COX-2 inhibitor drug (NSAID) use in inflammatory disease and patients with viral infections.

P35

Generating Novel Insight into COX-2 Inhibitor Toxicity Using Online Microarray Data Mining

Subject:

Presenting Author: Sarah Mazi, Imperial College London/King Saud University

Author(s):

Nicholas Kirkby, Imperial College London,
Mark Paul Clark, Imperial College London,
Jane Mitchell, Imperial College London,

Abstract:
Cyclo-oxygenase (COX)-2 inhibitor drugs such as meloxicam (Mobic) and Ibuprofen (Alleve) are widely used to treat pain and inflammation. While effective, these drugs produce substantial side effects including cardiovascular, renal and gastrointestinal toxicity. In addition certain COX-2 inhibitor drugs are hepatotoxic, for example, up to 7% of patients receiving meloxicam exhibit aminotransferase elevations. Mechanisms driving COX-2 inhibitor-associated toxicity are poorly understood. Here we have set out to find novel pathways by which COX-2 inhibitors produce toxicity by mining the Open TG-GATEs database (http://toxico.nibio.go.jp) for data concerning COX-2 inhibitor drugs. This publically available toxicogenomics dataset contains liver microarray data from rats treated with 168 pharmaceutical agents. We found data for 10 COX-2 inhibitor drugs but focused analysis on meloxicam, due its known effects on the liver. Analysis using LIMMA modified t-test (FC>2, p<0.05, FDR correction) revealed that meloxicam produced a time-dependent change in the transcriptome, with 14, 40 and 587 differentially expressed genes present respectively at 3, 9 and 24 hours. At 3 hours, no specific pathways were altered. At 9 hours, organo-nitrogen compound response pathways were enriched, as were response pathways for the inflammatory transcription factor, nuclear factor kappaB (NFkB). At 24 hours, analysis of genes altered >5-fold indicated activation of neutrophils and an endotoxin-like inflammatory response as well as metabolic pathways to organic molecules. Overall these data, suggest acute meloxicam administration produces an inflammatory response in the liver, possibly associated with early NFkB activation and this could contribute to liver and systemic toxicity.

P36

Comparative genomics between frogs Xenopus laevis and Xenopus tropicalis.

Subject: Other

Presenting Author: Gonzalo Riadi, Pontificia Universidad Católica de Chile

Author(s):

Juan Larraín, Center for Aging and Regeneration and Millennium Nucleus in Regenerative Biology, Chile
Francisco Melo, Molecular Bioinformatics Laboratory, Chile

Abstract:
The clawed African frog Xenopus laevis has been one of the main animal models for genetic studies in developmental biology. However, for molecular studies, Xenopus tropicalis has been the experimental model of choice due to a simpler genome that does not result from genome duplication as in the case of X. laevis. Today, although in a large number of scaffolds, over 80% of X. tropicalis and 90% of X. laevis genomes have been sequenced. There is a growing expectation for a comparative physical map that can be used as a Rosetta Stone between X. laevis genetic studies and X. tropicalis genomic research. In this work, we have mapped through coarse-grained alignment the 3,169 largest scaffolds of X. laevis, on the 10 reference scaffolds representing the haploid genome of X. tropicalis. Upon validation of the map with empirical and theoretical data, and establishing an average of 44,57% identity between the two species, we made an analysis of synteny of 3,475 orthologous genes. Interestingly, although we found that 99.4% of genes are in the same order, we estimate that up to 10% of the genes may have undergone some type of rearrangement. Taken together, our results established the correspondence between half of both genomes, providing a new and more comprehensive tool for comparative analysis of these species.
Acknowledgements: Fondecyt project 3130441; Iniciativa Cientifica Milenio No. P09-016-F and No. P07/011-F.

P37

Meta-analysis leads to deeper understanding of cellular senescence

Subject: Machine learning, inference and pattern discovery

Presenting Author: Chris Morrissey, Buck Institute

Author(s):

Marco Demaria, Buck Institute, United States
Dave Slate, Buck Institute, United States
Judy Campisi, Buck Institute, United States
Sean Mooney, Buck Institute, United States

Abstract:
Cellular senescence, (CS) is a state of terminal mitotic arrest. It plays a role in aging as senescent cells accumulate in older tissues and secrete a set of extra cellular signal proteins (SASP) which contribute to chronic inflammation and age related diseases. Senescence is also an anti-cancer mechanism which can be triggered by oncogenes, and directly blocks tumor formation. This anti-cancer role is complicated by SASP being conducive to tumor growth in the surrounding tissue. This unique set of beneficial as well as harmful effects, leads us to believe that CS is an example of antagonistic pleiotropy wherein the evolutionary advantage senescence provides as a check against cancer outweighs its deleterious effects on older, post reproductive organisms. As such it provides an attractive target for therapeutic regulation.
We aim to better understand these aspects of CS by conducting a meta-analysis of various models of senescence in mouse and human tissues. We have detected characteristic changes in gene expression that are cell type independent, and are conserved between mouse and human. We are integrating transcriptomic, proteomic, and epigenetic data to better understand these signatures of senescence, allowing us to better understand the short term benefits of senescence and the long term detriments. We have detected transcriptional regulation of the genes involved with SASP, mitotic replication, and apoptosis, and are using the connectivity map to identify drugs to target these networks.

P38

Prediction of bacteriocin associated operons

Subject: Optimization and search

Presenting Author: James Morton, University of Colorado Boulder

Author(s):

Iddo Friedberg, Miami University, United States
Stefan Freed, University of Notre Dame, United States
Shaun Lee, University of Notre Dame, United States

Abstract:
Bacteriocins are peptide-derived molecules produced by bacteria, which function as virulence factors, antibiotics, and signaling molecules. To date, close to five hundred bacteriocins have been identified and classified. Recent discoveries have shown that bacteriocins are very diverse and suggest bacteriocins are widely distributed among bacterial species. However, many tools struggle with identifying bacteriocins due to the large sequence and structural diversity of bacteriocins. Bacteriocins are derived from their precursor via a pathway comprising several genes known as context genes. Although bacteriocins themselves are structurally diverse, context genes have been shown to be similar across unrelated species. Our goals are: (1) to identify new candidates for context genes which may clarify how bacteriocins are synthesized, and (2) to identify new candidates for bacteriocins which bear no sequence similarity to known toxins. To achieve these goals, we have developed a software tool, Bacteriocin Operon Associator (BOA) that can identify homologous bacteriocin associated gene clusters and predict novel bacteriocin associated gene clusters

P39

EMIRGE 2: Improved resolution of microbial community structure

Subject: Metogenomics

Presenting Author: Adrienne Narrowe, University of Colorado Denver

Author(s):

Casey Bleeker, University of Colorado Denver, United States
Cuining Liu, University of Colorado Denver, United States
Christopher S. Miller, University of Colorado Denver, United States

Abstract:
Sequencing of the 16S rDNA gene is a popular method to characterize bacterial community membership, abundance, and spatial and temporal dynamics. However, current high-throughput sequencing read lengths limit most studies to sequencing only 30% of the length of the full 16S rRNA gene. This reduction in information content can confound community profiling and abundance estimates by hindering our ability to distinguish closely related organisms.

EMIRGE is an approach designed to exploit the depth of high-throughput sequencing without the loss of information content experienced in a typical short-read amplicon sequencing experiment. By iterative mappings against and refinements to a set of candidate sequences, EMIRGE reconstructs near full-length 16S rRNA gene sequences, including novel sequences not represented in the candidate set, and provides accurate sequence abundance estimates.

We improved the EMIRGE algorithm and systematically characterized parameter space in the face of rapidly changing sequencing technologies. Using realistic simulated datasets, we find increased sensitivity and specificity with the updated algorithm. This ability to more accurately reconstruct community structure from complex communities results from changes to: 1) composition of the candidate sequence database; 2) a sequence-merging heuristic; 3) using a new, indel-aware read mapper; and 4) changes to read preprocessing and postprocessing of output sequences.

We applied EMIRGE 2 to the characterization of bacterial and archaeal diversity in methanogenic freshwater wetlands sediments across 87 samples spanning hydrological and depth gradients. The complex community shifts substantially across these geographic and geochemical gradients, implicating specific organisms in carbon cycling and methane dynamics.

P40

A repository of semantic types in the MIMIC II database clinical notes

Subject: Text Mining

Presenting Author: Richard Osborne, University of Colorado

Author(s):

Kevin Cohen, University of Colorado, United States
Alan Aronson, National Library of Medicine , United States

Abstract:
The MIMIC II database contains over 1.2 million clinical documents from various medical sources from an intensive care unit. These notes contain a trove of valuable information that can help better inform medical personnel in their decision-making process. A common task for researchers working in BioNLP is to run MetaMap, which uses the UMLS Metathesaurus, on various kinds of documents to identify specific semantic types of entities contained therein. Doing so on the MIMIC-II notes can help to find valuable information that can be used by practitioners and researchers alike. However, performing this on this many documents is very time-consuming, computationally demanding and can produce different results from run to run. Given that other researchers will continue to use these data, providing a repository that contains MetaMap output could be of great value. Research in many groups could be accelerated if there were a community-accessible set of outputs from running MetaMap on this document collection, cached and available on the MIMIC-II website. In addition, research that builds on such a repository benefits from a common data set that avoids issues associated with reproducibility. This paper provides performance data on running MetaMap and describes a repository of all MetaMap output from the MIMIC II database, publicly available, assuming compliance with usage agreements required by UMLS and MIMIC-II. Additionally, software that allows for easy manipulation of MetaMap output, available on SourceForge with a liberal Open Source license, is described.

P41

Taking an omics approach to phenotype responses to cigarette smoke in humans in vivo and in vitro: comparison with healthy smokers and those with COPD

Subject: Data management methods and systems

Presenting Author: Mark Pau-Clark, Imperial College London

Author(s):

Neil Galloway-Phillipps, Imperial College London,
Paul Armstrong, Queen Mary's University London,
Clare Ross, Imperial College London,
Jane Mitchell, Imperial College London,
Christopher Brearley, Imperial College London,
Trevor Hansel, Imperial College London,
Katsuhito Ikeda, Imperial College London,

Abstract:
According to the World Health Organisation, approximately 5 million people die each year as a result of smoking cigarettes. Smoking represents the second largest global disease burden, and is accountable for 90% of lung cancers. It is a leading risk factor for chronic obstructive pulmonary disease (COPD) and cardiovascular disease. Despite the link between cigarette smoking and disease, the biological mechanisms by which cigarette smoke is associated with inflammation and pathology are not fully understood. We have, over the past 5 years, used transcriptomic analysis on in vitro and ex vivo systems to unravel predicted and novel pathways altered by cigarette smoke. To progress this work we have now undertaken a systems biology approach using a range of ‘omics’ endpoints to address this question in human subjects. We specifically have used transcriptomics, metabolomics together with traditional clinical and lab based assays to analyze and profile responses to smoking in healthy smokers and well-defined phenotypes of patients with COPD. Metabolomic, transcriptomic and ex vivo blood assays suggest that smokers with COPD have different baseline characteristics and responses to smoking cigarettes when compared to healthy smokers. Principal component analysis of the transcriptome revealed clustering with severity of COPD disease. At the metabolic level, disease-distinguishing pathways included (i), 1,5-anhydroglucitol, which implicates a disturbance in renal glucose homeostasis (ii), arginine, which implicates the nitric oxide synthase pathway and (iii), lysolipids, which implicate cytosolic phospholipase A2 activity. A comprehensive analysis of all endpoints may reveal biomarkers and therapeutic targets for smoking related disease.

P42

Statistical Modeling RNA Seq Simulations Comparing Two Isoforms

Subject: Simulation and numeric computing

Presenting Author: Daniela Perry, Johns Hopkins School of Medicine

Author(s):

Stefan Canzar, Johns Hopkins School of Medicine, United States
Liliana Florea, Johns Hopkins School of Medicine, United States

Abstract:
Alternative splicing is the inherent property of eukaryotic genes to produce multiple mRNA and protein isoforms by using different combinations of the gene's exons. Alternative splicing plays important roles in evolution and disease: differential isoform use and transcript splicing ratios have been identified as biomarkers data for an increasing number of diseases, including cancers. We use a control experiment in which we simulate RNA-seq data in support of exon skipping events with different splicing ratios and read coverage for the control and test samples. We tested several statistical methods, including nonparametric tests for their ability to detect differential alternative splicing events between control and test samples. We present results describing the accuracy of each method as a function of the two parameters: the difference between the two splice ratios best explains the accuracy of statistical tests, while on a smaller scale, the ratio itself affects such outcomes.

P43

Integrative analysis through bivariate Gaussian mixture models of DNA methylation and gene expression across different cancer types

Subject: Simulation and numeric computing

Presenting Author: Ivan Imaz-Rosshandler, National Institute of Genomic Medicine

Author(s):

Said Munoz-Montero, National Institute of Genomics Medicine, Mexico
Francisco Barajas-Olmos, National Institute of Genomics Medicine, Mexico
Ferderico Centeno-Cruz, National Institute of Genomics Medicine, Mexico
Lorena Orozco, National Institute of Genomics Medicine, Mexico
Claudia Rangel-Escareno, National Institute of Genomics Medicine, Mexico

Abstract:
Research in systems biology focuses on complex interactions between and among biological systems operating primarily at the cellular level. It has been considered a holistic approach to cell biology, however, the integration of different levels of information becomes an implicit challenge by definition. A well-known example is Cancer Genomics, where such a complexity requires a fine but global understanding of the biological process that underlie the disease. In order to accomplish this task and despite de availability of appropriate data, the mathematical models have to deal with different types of information that grows in size and complexity as new dimensions are included into the models. Moreover the propagation of error is not a trivial problem. When referring to regulation of gene expression by DNA methylation, classical statistical approaches as well as experiments guided by biological knowledge have shown to be useful but not sufficient in terms of the scope of large-scale data analysis. In this work, we present an alternative strategy for integrative analysis of DNA methylation and gene expression based on multivariate clustering analysis through Gaussian Mixture Models using the Ovarian Cancer data set from TCGA. The resulting clusters can then be explored by pathway enrichment analysis with an integrative perspective. In addition, a network of gene expression interaction was inferred using mutual information theory from which the betweenness centrality measure was computed and then introduced and mapped to the clustering scheme.

P44

Discovering Identities through COmpositional properties, A Computational Approach to Metagenomics

Subject: Metogenomics

Presenting Author: Steven Reisman, Loyola University Chicago

Author(s):

Zachary Romer, Loyola University Chicago, United States
Michael Shaffer, University of Denver, United States
Catherine Putonti, Loyola University Chiago, United States

Abstract:
Investigating complex environmental samples via metagenomic whole genome sequencing has provided a glimpse into the true range of microbial life, from the human gut to the polar icecaps. The diversity and fragmentation of metagenomic samples creates a large computational challenge when attempting to identify and classify constituent genomic pieces. Contemporary sequencing technologies, which are churning out more data, faster, and for a reduced cost, have uncovered a wealth of new sequences never before seen that have fallen prey to be described only as “unclassified”. In an attempt to provide some insight into the taxonomic source and putative function of sequences generated from complex samples, we have developed the software tool DISCO (Discovering Identities through COmpositional properties). DISCO works on a pre-constructed framework of categorized genomes from the NCBI Genomic database. Metagenomic reads are classified based upon a variety of compositional properties. In contrast to other composition-based classification tools, DISCO integrates several different metrics to refine its determination of species taking into account the variation that is likely to occur in nature. It was developed with flexibility and speed in mind such that as new sequence data becomes available, it can easily be integrated into DISCO libraries to further inform our classifier.

P45

The Inverse Relationship Between Sample Size and Differentially Expressed Genes in the selected Microarray Experiments

Subject: Other

Presenting Author: Akram Samarikhalaj, Ryerson University

Author(s):

Asli Uyar, Okan-asliuyar@gmail.com, Turkiye
Ayse Basar Bener, Ryerson University- ayse.bener@ryerson.ca, Canada

Abstract:
In microarray studies the sample size determinations is one of the concern of biology and life science researchers as the results that are obtained from small sample size are not accurate and providing large sample are expensive and time consuming. In this study, three experiments with 8 and more samples per condition from EMBL-EBI public database are collected to investigate the relationship between sample size and number of differentially expressed genes. An unpaired t-test with p-value of 0.05 is used to identify the number of genes that are differentially expressed in all arrays. The results obtained after Robust Multi-Chip Average (RMA) normalization and removing outliers from the samples using PCA plots indicate that the mean total number of differentially expressed genes is decreased when the sample size is increased. The standard deviation from the mean is also decreased when the sample size increased. This result is reliable if the samples have the high quality and the outliers are removed from the samples. The total number of common genes was an indicator for identifying the outliers. It is unusually increased when the outliers are not removed from the samples.
Conclusion: There is an inverse relationship between the sample size and the number of differentially expressed genes in these three selected experiments. This conclusion can be generalized to all microarray experiments if the selected samples have the high quality.

P46

A pipeline for virus phylogeography that accounts for geospatial observation error

Subject: System integration

Presenting Author: Matthew Scotch, Arizona State University

Author(s):

Robert Rivera, Arizona State University, United States
Tasnia Tahsin, Arizona State University, United States
Rachel Beard, Arizona State University, United States
Mari Firago, Arizona State University, United States
Davy Weissenbacher, Arizona State University, United States
Garrick Wallstrom, Arizona State University, United States
Graciela Gonzalez, Arizona State University, United States

Abstract:
Tracking evolutionary changes in viral genomes and their spread often requires the use of data deposited in public databases such as GenBank or the Influenza Research Database. Sequences and their metadata can be downloaded and imported into software applications that generate phylogeographic models for surveillance. These models require the geospatial assignment of taxa, which is often obtained from GenBank metadata. Unfortunately, geospatial metadata such as host location is uncertain in GenBank, with a median of only 30% containing precise location such as a county or town. For example, information such as China or USA was indicated instead of Beijing or Seattle. While town or county might be included in the corresponding journal article, this valuable information is not available for immediate use unless it is extracted and then linked back to the appropriate sequence.

This work focuses on developing and applying information extraction and statistical phylogeography approaches to enhance models that track evolutionary changes in viral genomes and their spread. We will discuss the design and initial results of a framework that uses natural language processing for the automatic extraction of relevant geospatial data from the literature, and assigns a confidence between such geospatial mentions and the GenBank record. Our system will use these locations and the estimates as observation error in the creation of phylogeographic models of zoonotic virus spread.

P47

Phylogeny-wide Discovery of Bacterial Transcription Factor Binding Motifs by Protein Family-based Approach

Subject: Machine learning, inference and pattern discovery

Presenting Author: Maxim Shatsky, Lawrence Berkeley National Lab

Author(s):

Alexey Kazakov, LBNL, United States
Kanchana Padmanabhan, LBNL, United States
John-Marc Chandonia, LBNL, United States
Pavel Novichkov, LBNL, United States

Abstract:
Reconstruction of high quality genome-scale gene regulatory networks remains challenging even when gene expression data are available. We developed an automated method for phylogeny-wide discovery of transcription factor (TF) biding motifs across all bacterial genomes to produce starting points for regulon identification. Our approach, SEFMA (Simultaneous Entire-Family Motif Analysis), is based on the simultaneous analysis of all regulators from a given TF family. In contrast to the existing approaches that require a collection of sequenced genomes neighboring the genome of interest we are able to identify TF binding sites across many microbial genomes with high accuracy. Here, we target cis-acting regulators, as in bacterial species the vast majority (~80%) of TFs have BSs in regions local to its operon. The detected local motifs then can be used to search the genome to identify regulons.

We demonstrate an improvement in discovery of local TF BSs over current state of the art approaches. First, we compare SEFMA results to the manually curated motifs from the RegPrecise database and experimentally verified binding sites from RegulonDB. Next, we compare SEFMA results and the available motifs for Shewanella oneidensis MR-1, predicted by one of the standard approaches in the field, against the RegPrecise data. Next, we test SEFMA on >2 million experimentally synthesized and assayed for binding TetR BSs. We also applied SEFMA to identify binding sites of metal-sensing regulators in Pseudomonas stutzeri and to discover a novel fatty acid metabolism regulon in Gammaproteobacteria.

P48

Global Survey of Protein Complexes from the Sulfate Reducer Desulfovibrio vulgaris: evidence for lower connectivity within stable bacterial interactomes

Subject: Machine learning, inference and pattern discovery

Presenting Author: Maxim Shatsky, Lawrence Berkeley National Lab

Author(s):

Simon Allen, UCSF, United States
Barbara Gold, LBNL, United States
Nancy Liu, LBNL, United States
Thomas Juba, University of Missouri, United States
Sonia Reveco, LBNL, United States
Dwayne Elias, ORNL, United States
Ramadevi Prathapam, LBNL, United States
Jennifer He, LBNL, United States
Wenhong Yang, LBNL, United States
Evelin Szakal, UCSF, United States
Haichuan Liu, UCSF, United States
Mary Singer, LBNL, United States
Jil Geller, LBNL, United States
Bonita Lam, LBNL, United States
Avneesh Saini, LBNL, United States
Valentine Trotter, LBNL, United States
Steven Hall, UCSF, United States
Susan Fisher, UCSF, United States
Steven Brenner, UC Berkeley, United States
Mark Biggin, LBNL, United States
Swapnil Chhabra, LBNL, United States
Terry Hazen, University of Tennessee, United States
Judy Wall, University of Missouri, United States
Ewa Witkowska, UCSF, United States
John-Marc Chandonia, LBNL, United States
Gareth Butland, LBNL, United States

Abstract:
Sulfate Reducing Microorganisms (SRMs) derive energy from the dissimilatory reduction of sulfate. Desulfovibrio vulgaris Hildenborough is the most intensively studied SRM and serves as a model. Here we report a global survey of protein-protein interactions in D. vulgaris using tandem affinity purification and mass spectrometry (MS). Building on our previous work which enabled rapid locus-specific modification of the D. vulgaris chromosome, we generated and successfully tested 947 gene affinity-tagged D. vulgaris strains. We developed a novel computational pipeline that significantly reduced the false discovery rate of identified interactions, allowing 459 high-confidence protein-protein interactions to be detected with a 17% false discovery rate. Our high-confidence interactome contains many novel interactions associated with existing or new complexes and includes 145 previously unannotated or hypothetical proteins. Our binary protein-protein interactions include 21% same-operon pairs, significantly more than reported for any previous bacterial affinity purification MS study. This observation as well as additional analysis implies that the networks proposed in these earlier studies are largely comprised of false positives and that bacterial stable interactomes are less connected than these studies implied.

P49

What types of data do I need to fit my complex biomathematical model?

Subject: Simulation and numeric computing

Presenting Author: Matthew Shotwell, Vanderbilt University

Author(s):

Richard Gray, Food and Drug Administration, United States

Abstract:
In the development of biomathematical models, it is important to consider the types of empirical data that, when considered simultaneously (i.e., integrated), are sufficient to estimate the model parameters. When no such data are available, or impossible to collect, the model may be ill-posed. We present a general computational approach to parameter estimability assessment for complex biomathematical models, where data are integrated from multiple disparate (i.e., multi-scale) sources. The methods are demonstrated using a simplified Hodgkin-Huxley model for transmembrane voltage in cardiac excitable cells. We show that integrated single- and multi-cell action potential propagation data are sufficient to identify a larger subset of model parameters than either type of data alone.

P50

Discovering Disease Associated Molecular Interactions Using Discordant Correlation

Subject: Machine learning, inference and pattern discovery

Presenting Author: Charlotte Siska, University of Colorado Anschutz Medical Campus

Abstract:
A common approach for identifying molecular features (such as transcripts or proteins) associated with disease is testing for differential expression or abundance in –omics data. However, this approach is limited for studying interactions between molecular features, which would give a deeper knowledge of the relevant molecular systems and pathways. We have developed a method for this purpose that we call the Discordant method. The Discordant method measures the posterior probability that a pair of features has discordant correlation between phenotypic groups using mixture models and the EM algorithm. We compare our method to existing approaches; one that uses Fisher’s transformation in a classical frequentist framework and another that uses an Empircal Bayes joint probability model. We prove with simulations and miRNA-mRNA glioblastoma data from the Cancer Genome Atlas that the Discordant method performs better in predicting related feature pairs. In simulations we demonstrate that while all of the methods have similar specificity, the Discordant method has better sensitivity and is better at identifying pairs that have a correlation coefficient close to 0 in one group and a largely positive or negative correlation coefficient in the other group. Using the glioblastoma data, which has matched samples between miRNA and mRNA, we find that the Discordant method finds relatively more glioblastoma-related miRNAs compared to other methods. We conclude from the results in both simulations and glioblastoma data that the Discordant method is more appropriate for identifying molecular feature interactions unique to disease.

P51

SASE-hunter – a method for detecting signatures of accelerated somatic evolution in cancer genomes

Subject: Simulation and numeric computing

Presenting Author: Kyle Smith, University of Colorado

Author(s):

Subhajyoti De, University of Colorado, United States
Brent Pedersen, University of Colorado, United States

Abstract:
Detection of novel cancer-associated genes and pathways has yielded novel therapeutics and advanced detection, diagnosis, and treatment of cancer. So far, the emphasis has primarily been on protein coding regions, but non-coding regions that cover 98% of the genome, and harbor major regulatory elements, has been largely under-studied. To this end, we have developed a novel computational framework called SASE-hunter (Signatures of Accelerated Somatic Evolution – hunter) to identify genomic regions that have a significant excess of somatic mutations compared to a locally constructed null model. This method takes into account regional variation in mutation rate, evolutionary conservation and genomic context while making the inference. By applying this methodology to the promoter regions of protein-coding genes in 724 samples in 10 different cancer types, we have identified the promoters that deemed significant in multiple samples. Integrating gene expression, methylation, and survival data, we assess functional impact of the selected cases in lymphoma and melanoma. Our findings facilitate biomarker discovery and have implications for cancer diagnosis and treatment. Application of the SASE-hunter algorithm has identified mutated promoters in lymphoma that are associated with alterations in gene expression for multiple cancer related genes.

P52

Using the Business Model Canvas to develop, communicate and enable scientific innovation

Subject: Other

Presenting Author: Simon Twigger, BioTeam

Abstract:
This is a poster about a poster. In recent years the entrepreneurial community have developed some unique visual tools to describe and develop business models and to define how their new business ideas provide value to their customers. Attend any Startup bootcamp, tech incubator, business school entrepreneur class and you will see the Business Model Canvas on every wall, a framework to help the founders outline their innovation, who will benefit from it, how they are going to achieve their goals, why they are uniquely qualified to deliver this solution and, importantly, how the finances will add up. It is a living document that is tested and validated as the team explore the key elements of their idea and its viability. It is fast replacing the 'Business Plan' (a static document that hardly anyone ever reads that optimistically describes a future that hardly ever comes true as planned) as the way in which new business ideas are developed and described.
Much of bioinformatics and BioIT entails bringing innovation to our users to solve their data-related problems. I will show how these canvases and similar tools can be used in practical ways to help teams collaborate on these types of projects. On a wider level, the grants we are all very familiar with bear a remarkable similarity to the Business Plan - is it possible that something like the Business Model Canvas could benefit the scientific community in the same way as it has revolutionized the entrepreneurial community?

P53

A Bayesian framework for Signature-driven Protein Quantification

Subject: Machine learning, inference and pattern discovery

Presenting Author: Bobbie-Jo Webb-Robertson, Pacific Northwest National Laboratory

Author(s):

Melissa Matzke, Unilever, United States
Susmita Datta, University of Louisville, United States
Samuel Payne, Pacific Northwest National Laboratory, United States
Jiyun Kang, Pacific Northwest National Laboratory, United States
Lisa Bramer, Pacific Northwest National Laboratory, United States
Carrie Nicora, Pacific Northwest National Laboratory, United States
Anil Shukla, Pacific Northwest National Laboratory, United States
Thomas Metz, Pacific Northwest National Laboratory, United States
Karin Rodland, Pacific Northwest National Laboratory, United States
Richard Smith, Pacific Northwest National Laboratory, United States
Mark Tardiff, Pacific Northwest National Laboratory, United States
Jason McDermott, Pacific Northwest National Laboratory, United States
Joel Pounds, Pacific Northwest National Laboratory, United States
Katrina Waters, Pacific Northwest National Laboratory, United States

Abstract:
Mass spectrometry-based proteomics has the capability to measure tens of thousands of peptides simultaneously across complex biological samples. With samples from higher organisms, translating these peptides into protein level estimates is a major challenge for computational proteomics. A limitation to existing computationally-driven protein quantification methods is that most ignore protein variation, such as alternate splicing of the RNA transcript and post-translational modifications or other possible proteoforms, which will affect a significant fraction of the proteome. The consequence of this assumption is that statistical inference at the protein level, and consequently downstream analyses, such as network and pathway modeling, have only limited power for biomarker discovery. A new approach to this problem is to utilize peptide-level signature in a Bayesian framework (BP-Quant) to identify peptides associated with over-expressed patterns to improve relative protein abundance estimates. BP-Quant is a research-driven approach that utilizes the objectives of the experiment, defined in the context of a standard statistical hypothesis, to identify a set of peptides exhibiting similar statistical behavior relating to a protein. This approach infers that changes in relative protein abundance can be used as a surrogate for changes in function, without necessarily taking into account the effect of differential post-translational modifications, processing, or splicing in altering protein function. We verify the approach using a dilution study from mouse plasma samples and demonstrate that BP-Quant achieves similar accuracy as the current state-of-the-art methods at proteoform identification with significantly better specificity.

P54

Predicting survival for diverse patient cohorts using large-scale cancer genomics data

Subject: Machine learning, inference and pattern discovery

Presenting Author: Nicolle Witte, University of Colorado - Anschutz Medical Campus

Author(s):

James Costello, University of Colorado - Anschutz Medical Campus, United States

Abstract:
A fundamental challenge in precision medicine is to identify the genomic features that are predictive of patient response to drug treatment and overall survival. Large, nation-wide efforts, such as The Cancer Genome Atlas (TCGA) have characterized the genomes, transcriptomes and proteomes from tens of thousands of patients across many tumor types. These data present a rich source to characterize predictive genomic features, but caution must be taken when associating patient features with outcome. TCGA represents a heterogeneous sampling of the population with inconsistent patient demographics and patients being treated with an array of therapies. Here, we apply machine learning approaches (lasso regression and random forests) to predict patient survival across several –omics data sources (gene expression, copy number, methylation and protein quantification). Our predictions are based on stratifying patients into cohorts defined by clinical variates, drug treatments and demographics. Here, we explore four cancers: glioblastoma, ovarian, kidney, and lung adenocarcinoma. We compare our results to published results that do not consider clinical or demographic information when predicting survival and show that stratifying the population is critical to identifying features that are predictive of survival. Our results demonstrate differences in predictive ability when accounting for clinical variates across data sources. For example, we see significant differences in performances when comparing Temazolimide-treated vs. untreated glioblastoma patients using methylation data. Our results represent a benchmarking dataset for survival prediction using genomics data. We also identify genomic signatures associated with different patient cohorts.

P55

Network-based analysis for large environment microbial genomics data

Subject: Machine learning, inference and pattern discovery

Presenting Author: Erliang Zeng, University of South Dakota

Author(s):

Wei Zhang, WalmartLabs, United States
Scott Emrich, University of Notre Dame, United States
Stuart Jones, University of Notre Dame, United States
Abdelali Barakat, University of South Dakota, United States
Joshua Livermore, University of Notre Dame, United States
Dan Liu, University of Notre Dame, United States

Abstract:
Early environmental microbe studies characterized a limited snapshot of microbial diversity. The availability of huge amount of genome sequence data from natural microbial consortia enables integrated analysis to resolve the genetic and metabolic potential of microbial communities, to establish how functions are partitioned in and among populations, and to reveal how microbial communities evolve and adapt across multiple environments. In this study, we propose to analyze comparative microbial genomes from three environments (human gut, soil, and marine) using network preservation analysis. The developed computational framework was able to identify functional modules and evaluate the functional roles of those modules in microbial communities as response to environmental change. We found modularized environmental adaptation among microbial communities. It is gene module instead of individual gene that serves as evolution unit in microbial communities. Overall, the gene network from soil samples shows more coherent structure, and the network from marine sample has largest divergence. We observed that the wet environments (gut and marine) tend to be more complex than dry environment (soil), due to the dynamics of wet environments such as gut and marine. We demonstrated that modules with strong evidence of preservation contains ubiquitous pathways that contribute to the metabolism of major nutrient substances; while non-preserved modules include many environmentally-dependent pathways that are expected to be responsive to environmental change.

P56

A Novel Visualization Technique for the Discovery of Inter-strain Recombination in Apicomplexans

Subject: Graphics and user interfaces

Presenting Author: Javi Zhang, University of Toronto

Author(s):

Asis Khan, NIH, United States
Michael E. Grigg, NIH, United States
Andrea Kennard, NIH, United States
John Parkinson, Hospital for Sick Children, Canada

Abstract:
Apicomplexan parasites, which include Toxoplasma gondii and Plasmodium falciparum, the causative agent of malaria, cause more than a million deaths annually. The population structures of Apicomplexan parasites are a key factor for their evolutionary and virulence potential. However they are poorly described by existing visualization techniques, such as phylogenetic tree and neighbor-net, due to recombination between strains, as genetic distance is no longer an accurate metric of genetic relationship. To overcome this challenge, we have developed a novel visualization pipeline that shows inter-strain genetic relationships on both the global and the local level. Here, we apply this pipeline to a geographically diverse set of T. gondii strains to determine the scale and frequency of inter-strain recombination.
Whole genome sequences of the T. gondii strains are aligned and clustered using the Markov clustering algorithm (MCL) on the basis of genome-wide single nucleotide polymorphisms (SNPs). The genomes are then divided into 10kb segments, and each clustered individually using MCL. The longitudinal clustering patterns along the genomes are used to detect recombination events and regions of recent common ancestry between strains. By combining the two analyses into a single network view, we achieve an accurate, high-resolution representation of the inter-strain genetic relationships. Our results show regions of recombination between both genetically and geographically diverse strains, highlighting the crucial role recombination plays in shaping T. gondii population structure.

P57

Visualizing Microbiome Data for Scientific Analysis & Communication

Subject: Graphics and user interfaces

Presenting Author: Megan Pirrung, University of Colorado Anschutz Medical Campus

Abstract:
Sequencing technologies are getting cheaper and producing vast amounts of data, especially in the field of microbial ecology. Proper visualization of biological data is key to informative analysis and insight. Data analysis by users through informative, useful, and responsive visualizations is key to harnessing the true potential of big data. We propose an experiment that will help to determine which types of visualization techniques are most informative for microbial ecology data in both public and expert scientific audiences. To perform this experiment on a large number of subjects in a systematic way, we have created a modular system with easily substituted visualization methods. We have created dynamic visualizations that parallel the visualizations commonly used in microbial ecology publications using the d3 (Data Driven Documents) javascript library, and a visualization-testing framework. A visualization technique is systematically selected from the set of visualizations appropriate for the data selected and shown to the user. The user is also presented with a questionnaire that will let us determine which visualizations allow users to answer the most questions correctly. The survey is available at http://tinyurl.com/vissurvey-rocky. We expect that the scores will indicate that certain visualization techniques are more appropriate for certain types of data, and that certain visualizations may be found informative for one audience with certain demographics over another, in public and scientific audiences.

P58

GIST - A novel pipeline for accurate taxonomic assignment

Subject: Metogenomics

Presenting Author: John Parkinson, Hospital for Sick Children

Author(s):

Samantha Halliday, University of Toronto, Canada

Abstract:
Metatranscriptomics, unbiased mRNA shotgun sequencing, is emerging as a powerful technology to functionally interrogate microbiomes. Through characterising gene expression across a large diversity of species simultaneously, metatranscriptomics offers the potential to identify specific functional contributions associated with each taxon within a microbiome. A significant challenge in these studies however, is assigning accurate functional and taxonomic information to each read. While current approaches typically rely on sequence-similarity searches, accuracy is compromised by the high degree of bacterial diversity associated with environmental samples. Here we introduce GIST (Generative Inference of Sequence Taxonomy), an ensemble method that integrates several statistical and machine learning methods for compositional analysis of both nucleotide and amino acid content with the output from the Burroughs-Wheeler Aligner to produce high quality taxonomic assignments for metatranscriptomic RNA read datasets. Based on simulated and real datasets, the GIST pipeline is found to significantly outperform existing taxonomic assignment methods. We are currently developing GIST as a standalone, open source package that may be readily integrated into new and existing metagenomic analytical pipelines.