10th Annual Rocky Mountain Bioinformatics Conference

Poster List by Number with Abstracts

Poster List by Number no Abstracts


P01:
Transcriptional Regulatory Networks of the Brain: the Whole versus the Part

Presenting Author: George Acquaah-Mensah, Massachusetts College of Pharmacy and Health Sciences

Author(s):
Ronald Taylor, Pacific Northwest National Laboratory, United States

Abstract:
Brain region-specific microarray gene expression data are less freely available than data derived from the whole brain. Paucity of region-specific data that can provide insights into specific functions has often necessitated the use of whole brain mouse data. Specific localized events tend to be diluted out when the data derived from disparate brain regions are examined as an aggregate. Gene expression measurements generated by high-throughput in situ hybridization and stored in the mouse brain atlas (http://mouse.brain-map.org/ welcome.do) at the Allen Institute for Brain Science form a unique brain region-specific resource. For our work, we have employed the data in this repository in the inference of mRNA transcriptional regulatory (sub)networks, said reconstruction being based on algorithms that employ correlations in gene state to infer gene-to-gene regulation. We present results using such methods on Allen data for the determination of high-confidence transcriptional regulatory subnetworks in the mouse brain. We used several state-of-the-art algorithms for such network inference: the Context Likelihood of Relatedness algorithm (mutual information based), the Genie3 (random forest based), and the Supervised Inference of Regulatory Networks (SIRENE) algorithm (Support Vector Machine based). Networks inferred were compared with those similarly generated using whole mouse brain microarray data in the Phenogen database. We compared their overlap, and performed further topological analyses using the Cytoscape MCODE plug-in.


top
P02:
Expression genome-wide association study (eGWAS) to identify therapeutic targets and functional relationships in Huntington's disease

Presenting Author: Alexander Alleavitch, The Buck Institute for Research on Aging

Author(s):
Biao Li, The Buck Institute for Research on Aging, United States
Cendrine Tourette, The Buck Institute for Research on Aging, United States
Robert Hughes, The Buck Institute for Research on Aging, United States
Lisa Ellerby, The Buck Institute for Research on Aging, United States
Sean Mooney, The Buck Institute for Research on Aging, United States

Abstract:
Huntington’s disease (HD) is an autosomal dominant neurodegenerative disease that leads to serious psychological and motor symptoms that progressively worsen over an individual's lifespan. HD is caused by expanded CAG repeats in an N-terminal polyglutamine tract in the Huntingtin (HTT) protein. HTT is involved in a large number of interactions with different proteins at the synapse, at various membranes, and within the nucleus, but little is known of its definitive function. The NCBI's Gene Expression Omnibus (GEO) is a public repository of transcriptome data from published microarray experiments. GEO presents a valuable tool for meta-analyses but presents a difficult problem in integrating data from a diverse range of experimental methods and conditions. In order to elucidate functional relationships and identify potential treatment targets for HD, we are conducting a gene expression-based genome-wide association study (eGWAS) to identify commonly dysregulated genes in a large number of GEO microarray experiments featuring HD models. We are examining 58 independent case-control matched microarray experiments (from 950 unique samples) across a variety of different models and conditions. We then intend to perform functional enrichment analyses on the results and validate the relevance of our candidate genes in an in-vitro HD cell model. High profile candidates will be prioritized for validation in a murine model and in cell lines. Furthermore, we will attempt to integrate these results into a larger network model along with pre-existing HD protein-protein interaction network data.


top
P03:
A comparison of co-temporal signals from real biological and simulated data

Presenting Author: Edward Allen, Wake Forest University

Author(s):
William Turkett, Wake Forest University, United States
David John, Wake Forest University, United States
Jame Norris, Wake Forest University, United States
Stan Thomas, Wake Forest University, United States
Larry Daniel, Wake Forest University, United States
Richard Loeser, Wake Forest University, United States
Kimberly Nelson, Wake Forest University, United States
Leslie Poole, Wake Forest University, United States
Elizabeth Hiltbold, Auburn University, United States
Jacquelyn Fetrow, Wake Forest University, United States

Abstract:
The construction of a model of the underlying biological network of a cell from time series data is a difficult problem in modern biology. Typically, network models are constructed using either next state or co-temporal paradigms. Next-state models are often described as first order Markov, dynamic or causative. Co-temporal models include those described as using correlative, static or invariant methods. The two types of models should extract different types of information from the data. From a graph-theoretic point of view, next-state models should find parental or ancestor relationships while co-temporal models should find sibling relationships. The quality of the underlying algorithmic methods for the construction of the networks is often compared using publicly available simulated data. Simulated data is most often constructed using next-state methods as the data is generated from time point to time point. In this work, we show how co-temporal signals are found in real biological data but not in the simulated data sets studied. This work emphasizes the fact that real biological networks are complex and algorithms for the construction of simulated biological data need to take into account multiple types of dependencies.


top
P04:
Rediscovery of the p53 transcriptome

Presenting Author: Mary Allen, University of Colorado

Author(s):
Joaquin Espinosa, University of Colorado, United States
Robin Dowell, Univeristy of Colorado, United States

Abstract:
The guardian of the genome is a transcription factor p53 that can activate transcription of target genes in response to cellular stress. Thus, p53 plays a central role in the regulation of cellular processes and the suppression of cancer. Yet, despite the vast amount of research on p53, the precise transcriptional response to activating p53 is not understood. The traditional method of studying transcription examines steady state stable RNA levels (by sequencing or microarray) and is therefore unable to detect unstable transcription. Furthermore, the steady-state nature of the assay makes it impossible to distinguish genuine transcriptional response from changes in message stability. To overcome the limitations mentioned above, I have used Global Nuclear Run On-sequencing (GRO-seq) to directly measure changes in transcription genome-wide at early time points after p53 activation. My preliminary data contains several exciting results. Surprisingly, only one hour after p53 activation, transcription of several well-known target genes is up-regulated, something novel and undetectable by traditional methods. In addition, many previously unannotated transcripts are regulated by p53, including a intriguing percentage of potential p53 binding sites that are transcribed. These preliminary data demonstrate the potential of this approach to yield new insights into p53 and cancer regulation via p53. And open the field to new questions about the effect of transcription over a p53 binding site.


top
P05:
Molecular Prediction by Regression (MPR) a New Platform for Predicting Molecular Targets and Multiple Drug Classes for Tumor Subtypes

Presenting Author: David Astling, University of Colorado School of Medicine

Author(s):
Tiffany Chan, University of Colorado School of Medicine, United States
Aik-Choon Tan, University of Colorado School of Medicine, United States

Abstract:
Numerous studies have demonstrated that some cancers depend on oncogene driven signals for survival and maintenance. Targeted cancer therapies have exploited this “oncogene addiction” concept, leading to several successful genotype-directed clinical applications of targeted therapies. The clinical paradigm until now is based on a simple binary correlation between a mutated cancer gene and response to a given therapy. Since this simple binary correlation has only been validated in a few cancer subtypes, new biomarkers are needed to improve patient selection for other cancer subtypes. We have proposed to develop a robust computational platform, Molecular Prediction by Regression (MPR), that integrates a statistical and machine learning approaches, applied to global gene expression data, to aid with predicting drug classes and identify molecular targets in cancer subtypes. Here, we first applied multivariate regression modeling as a dimension reduction step to find a set of gene pairs that serve as the best predictors of drug class. We then employed the k-TSP algorithm as the feature extraction method to identify gene pairs where their relative expressions reversed from drug sensitive to resistant group. The resultant classifier uses this reduced set of genes for an improved identification of molecular targets that are most suitable for various classes of drugs. As a proof of concept, we examined this approach to derive an MPR classifier for the MEK1/2 class of inhibitors in colorectal cancer. We also identified the common resistance pathways for these MEK1/2 inhibitors, revealing potential resistance mechanisms for rational combination studies to this disease.


top
P06:
A model-guided workflow for prediction and validation of synthetic lethal gene pairs as antibiotic targets in the human pathogen Escherichia coli O157:H7

Presenting Author: Ramy Aziz, University of California, San Diego

Author(s):
Nicole Fong, University of California , United States
Kathleen Andrews, University of California , United States
Jonathan Jonathan , University of California , United States
Joshua Lerman, University of California , United States
Bernhard Palsson, University of California , United States
Pep Charusanti, University of California , United States

Abstract:
With the alarming increase of infections caused by antibiotic-resistant pathogens and the innovation gap in antibiotic discovery—notably against Gram-negative bacteria—pressure mounts on researchers to develop alternative, high-throughput methods for the discovery of antibiotics and their target genes. One such approach is the use of random mutagenesis or metabolic modeling for identifying essential genes as antibiotic targets within microbial genomes; however, this approach was not fruitful because of fewer than expected single essential bacterial genes identified and the presence of human homologs to many of them. Alternatively, products of synthetic lethal gene pairs can be tested as targets for combinatorial antimicrobial therapy, a process already successfully implemented in cancer chemotherapy.
Here, we present a systems biology-based, in silico model-guided workflow for identifying synthetic lethal gene pairs common across pathogenic enterobacteria for possible development into broad-spectrum targets for combinatorial drug therapy. We deployed genome-scale metabolic models for preliminary prediction of pairs of genes, each of which not singly essential, but whose combined deletion leads to bacterial inability to grow. This preliminary prediction led to a list of 75 candidate gene pairs, which we further narrowed via literature mining to 52 gene pairs, 21 of which were already validated in literature. We are currently validating in silico predictions by deleting candidate gene pairs in the highly pathogenic enterohemorrhagic Escherichia coli O157:H7 and Salmonella Typhimurium. Future directions include in silico screening of those candidate genes against small molecule libraries followed by testing potential molecules in vitro for bacteriostatic or bactericidal activity.


top
P07:
Neural networks as a methodology for upgrading the selection of eucalyptus clones

Presenting Author: Leonardo Bhering, University of Viçosa

Author(s):
Bruno Galveas, EMBRAPA, Brazil
Cosme Damião, University of Viçosa, Brazil
Caio Césio, University of Viçosa, Brazil

Abstract:
Forestry measurement is an important element in forestry management since it provides precise information on the Forest. The two most used variables in the achievement of forestry measurements are height and diameter at breast height (DBH) which are used for calculation of the wood volume existing in a Forest. Because of the large number of trees from which data many errors may be made at the field level, harming forestry measurement, and a further selection of higher genotypes useful for plant breeding program. An alternative for that is to use methodologies of volume prediction based only on the DBH, reducing the labor of field measurements, resulting in more precise measurements. The objective of this work was to evaluate the methodology of neural networks to predict wood volume in breeding programs for eucalyptus and its impacts on the selection of higher genetic materials. The data used in this work were obtained with 140 genotypes and measurements from the three and six years of age. The variables used in this work were volume estimated based on DBH and height and the volume obtained by the neural network based only on the DBH. The methodology of neural network was efficient in estimating the volume based only on DBH. It was also allowed to estimating the parameters of the original stand as means, genetic variation, in addition to providing selection of higher material of the population. Thus neural networks proved to be a powerful methodology for upgrading the selection of eucalyptus clones.


top
P08:
Simulationg and Increment of Data for Use in Neural Networks in Studies of Classification and Discrimination of Populations

Presenting Author: Leonardo Bhering, University of Viçosa

Author(s):
Caio Césio, University of Viçosa, Brazil
Cosme Damião, University of Viçosa, Brazil

Abstract:
Several techniques are used to assess the genetic diversity within and among different populations. Among these, there are conventional methods such as those proposed by Anderson discriminates analysis or by Fisher, which has been widely used in order discrimination populations. However, techniques such as Artificial Neural Networks have been shown to be very promising due to the possibility of a performance superior to conventional models used in troubleshooting. This technique is based on mathematical models or computational operation that have inspired into the human brain. This work has as objective to evaluate the efficiency of the use of neural networks in the analysis of discrimination of populations compared with multivariate procedures. The classification of individuals in both populations by Anderson discriminant function as the Fisher discriminant function was similar, obtaining an apparent error rate of 20.3% for both. Starting with cross-validation methodology was obtained similar to the behavior observed by discriminant analysis used in this study. However, the apparent error rate was 21.67%. A neural network methodology to be employed in the original data set, provided more satisfactory results than those obtained by Fisher discriminant functions or Anderson, with apparent error rate below 1%.


top
P09:
What is in Control of Replication Timing

Presenting Author: Sven Bilke, National Cancer Institute

Author(s):
Yevgeniy Gindin, NCI, United States

Abstract:
Initiation of DNA replication at specific sites in the genome is known to be far from perfect; known initiation sites may or may not fire in a given single cell during specific cell cycle. At the same time, from larger scale perspective, DNA replication timing is highly orchestrated, where specific regions in the genome are copied at specific, reproducible time points throughout S-phase. In an attempt to reconcile these two somewhat discordant observations, many researchers have assumed that a yet unknown, active regulatory mechanism in the cell is responsible for coordinating replication timing. Here we present a mechanistic explanation for genome-wide replication timing patterns. Our Random Diffusion Model does not contain any active control components and reproduces existing replication timing data with extreme fidelity (correlation > 0.9, about the same as the correlation between technical replicates of the experiment), demonstrating that (a) genome-wide replication timing patterns do not necessitate existence of an active control system in the cell, (b) replication timing is a “systems” phenomenon that cannot be understood by investigating individual initiation sites, (c) integrative analysis of recently released ENCODE data identifies semi-local chromatin structure as the main determinant of replication timing, (d) plasticity of replication timing across different cell types are a direct consequence of changes in the chromatin structure


top
P10:
Identifying Driver Mutations in Cancer with Protein Structures

Presenting Author: Andrew Bordner, Mayo Clinic

Author(s):
Barry Zorman, Mayo Clinic, United States

Abstract:
Ongoing cancer resequencing projects are producing large quantities of data that give an comprehensive view of somatic genome variations occurring in different cancer types. Because cancer cells originate through an evolutionary process of mutation, selection, and clonal expansion most mutations are expected to be functionally neutral. Therefore a key challenge is to differentiate functional, or driver, mutations responsible for tumerigenesis and cancer progression from neutral, or passenger, mutations. Furthermore, the identification of driver mutations has potential biomedical applications in target discovery for drug development, personalized therapies, and cancer biomarker development.
We focus on protein coding regions since molecular modeling can potentially be used to predict their functional effects. Our computational method for detecting driver mutations in the exome relies on large-scale homology modeling of human protein complexes. Functional sites in the human protein models are predicted by homology with experimental structures of complexes, machine learning-based prediction, and clustering of known driver mutations. In addition, molecular mechanics simulations are used to predict the specific effects of mutations on protein stability, protein-protein interactions, and protein-ligand binding. An important feature of the method is that it not only identifies driver mutations, but also infers their specific biochemical consequences.


top
P11:
KBase: An Integrated Knowledgebase for Predictive Biology and Environmental Research

Presenting Author: Kevin Keegan, Argonne National Laboratory

Author(s):
Adam Arkin, Lawrence Berkeley National Laboratory, United States
Robert Cottingham, Oak Ridge National Laboratory, United States
Sergei Maslov, Brookhaven National Laboratory, United States
Rick Stevens, Argonne National Laboratory, United States
Dylan Chivian, Lawrence Berkeley National Laboratory, United States
Paramvir Dehal, Lawrence Berkeley National Laboratory, United States
Christopher Henry, Argonne National Laboratory, United States
Folker Meyer, Argonne National Laboratory, United States
Jennifer Salazar, Argonne National Laboratory, United States
Doreen Ware, Cold SPring Harbor Laboratory, United States
David Weston, Oak Ridge National Laboratory, United States
Elizabeth Glass, Argonne National Laboratory, United States
Thomas Brettin, Argonne National Laboratory, United States

Abstract:
The Systems Biology Knowledgebase (KBase) has two central goals. The scientific goal is to produce predictive models, reference datasets and analytical tools and demonstrate their utility in DOE biological research relating to bioenergy, carbon cycle, and the study of subsurface microbial communities. The operational goal is to create the integrated software and hardware infrastructure needed to support the creation, maintenance and use of predictive models and methods in the study of microbes, microbial communities and plants.

KBase is a collaborative effort designed to accelerate our understanding of microbes, microbial communities, and plants. It will be a community-driven, extensible and scalable open-source software framework and application system. KBase will offer free and open access to data models and simulations, enabling scientists and researchers to build new knowledge, test hypotheses, design experiments, and share their findings to accelerate the use of predictive biology. Our immediate 18-month goal is to have a beta-version completed by February 2013.


top
P12:
Compression of structured high-throughput sequencing data

Presenting Author: Fabien Campagne, Weill Medical College of Cornell University

Author(s):
Kevin Dorff, Weill Medical College of Cornell University, United States
Nyasha Chambwe, Weill Medical College of Cornell University, United States

Abstract:
Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. We report compressing data to less than 50% of the size than can be achieved with BZip2 for a similar computational cost. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file.
Compared to the previous HTS compression state of the art, these methods reduce file size more than 20% when storing gene expression and epigenetic datasets. A robust implementation of these approaches is available in Goby 2 (http://goby.campagnelab.org) and is used natively in the GobyWeb analysis platform (http://gobyweb.campagnelab.org). The platform supports analysis of HTS gene expression data (RNA-Seq and small RNA-Seq) as well as analysis of DNA methylation data (RRBS or Methyl-Seq assays). GobyWeb offers intuitive user interfaces, small data footprints and efficient parallel computations.


top
P13:
A correlation-based gene selection method on microarray data using Support Vector Machines

Presenting Author: Juana Canul-Reich, Universidad Juarez Autonoma de Tabasco

Abstract:
Gene expression microarray datasets often consist of a limited number of samples with a large number of expression measurements, usually on the order of thousands of genes. Therefore, dimensionality reduction through a gene selection process is crucial prior to any classification task. This work introduces CbGS-SVM, an embedded gene selector for microarray data. The basic idea of CbGS-SVM is to select genes that are not correlated with any genes selected from previous iterations, as those genes would be redundant. Also, potential genes to select will lead to higher accuracy. The final set of selected genes is considered relevant for classification. Current implementation of CbGS-SVM uses support vector machine with a linear kernel as the learning classification algorithm, it follows a sequential forward approach and uses the Pearson correlation coefficient.
CbGS-SVM is applied to four cancer microarray datasets: colon cancer (cancer vs. normal), leukemia (subtype classification), Moffitt colon cancer (prognosis predictor) and lung cancer (prognosis predictor). Our method is evaluated and results presented as average accuracies of five runs of a10-fold cross validation process. We compare these results with those obtained with the Iterative Feature Perturbation method (IFP).


top
P14:
A web portal for the integration of genomic and proteomic data in Huntington’s Disease

Presenting Author: Greg Ceniceroz, The Buck Institute

Author(s):
Matthew Fleisch, The Buck Institute, United States
Biao Li, The Buck Institute, United States
Ari Berman, The Buck Institute, United States
Tobias Wittkop, The Buck Institute, United States
Brad Gibson, The Buck Institute, United States
Lisa Ellerby, The Buck Institute, United States
Robert Hughes, The Buck Institute, United States
Sean Mooney, The Buck Institute, United States

Abstract:
Huntington’s Disease is a highly penetrant neurodegenerative genetic disease that is caused by abnormal trinucleotide CAG repeats in the huntingtin gene. There are few treatments and the underlying molecular causes of the disease remain elusive. We are using bioinformatic methods to investigate the underlying causes and to prioritize potential targets for treatment. To this end, we are integrating public datasets for protein interactions, gene expression and other genomic data for interrogation. The web resource, which is not yet available publicly, is developed in Cytoscape web and the Drupal Content Management System. Protein interaction data from the Human Protein Reference Database (HPRD) and gene expression datasets from the Gene Expression Omnibus are included and visualized as a network. In addition, known, predicted and homologous post-translational modifications in the Huntintin protein and its interactors are included and browse-able with citations.


top
P15:
An evolutionary model for asymmetric expression between paralogs

Presenting Author: Thomas Clarke, University of North Carolina at Chapel Hill

Abstract:
In several model organisms, it has been observed that approximately half or more duplicated gene pairs show a pattern of expression divergence in which one of the genes has a higher level of mRNA abundance than its paralog across all conditions in which either gene is known to be expressed. Though the prevalence of expression asymmetry suggests it must be driven by factors that are widespread, the pattern is not predicted by current models of duplicate gene retention and evolution. We hypothesized that a highly pleiotropic regulatory architecture, in which regulatory factors govern the level of expression across many different conditions, would predispose gene pairs to diverge asymmetrically. To test this idea, we developed a toy model for the evolution of expression in duplicated genomes in which the protein products are indistinguishable, and reproductive success is solely a function of the combined expression level. Though mutations occurred with equal rate and magnitude to both genes of a pair, we nonetheless found that expression asymmetry arose naturally across a wide range of conditions. The population frequency of asymmetry was sensitive to regulatory architecture but relatively insensitive to other parameters, including the form of the fitness function. Experimental data from yeast on the level of expression divergence between asymmetric and non asymmetric paralogs supports the predictions of the model. We conclude that the asymmetric expression of duplicate genes may be a natural consequence of a highly pleiotropic regulatory architecture, and does not require or imply differentiation between the the two protein products.


top
P16:
Genomic and Computational analysis of tuberculosis drug resistance

Presenting Author: Gargi Datta, University of Colorado School of Medicine; National Jewish Health

Author(s):
Rebecca Davidson, National Jewish Health, United States
Benjamin Garcia, University of Colorado School of Medicine; National Jewish Health, United States
Michael Strong, National Jewish Health; University of Colorado School of Medicine, United States

Abstract:
Tuberculosis is a leading cause of death due to an infectious disease, with an estimated 1.7 million deaths in 2006. The recent emergence of multiple drug-resistant strains of M. tuberculosis is threatening to make the disease incurable with current pharmaceutical drugs. There is an urgent need to rapidly identify drug-sensitivity profiles of M. tb strains and to find new drugs to treat resistant strains. A mutation identification, annotation, and functional impact pipeline was developed to investigate progression of drug resistance tuberculosis strains pre- and post- treatment. The pipeline was later extended to find correlations between mutations and drug resistance/susceptibility in a dozen M. tb strains of varying drug resistance. We plan to extend this pipeline to improve the predictive power of a project aimed at probabilistically determining the effect of a genetic mutation on drug resistance based upon proximity to know drug-resistance conferring mutations. A naïve Bayesian network was initially implemented to predict the impact of each mutation on drug resistance. To aid fast and efficient treatment of tuberculosis in the field, especially in rural areas in developing countries, a mobile application is also being developed to look up pre-existing drug resistance information for TB and calculate alternative drug choices for strains that are likely resistant to particular drugs or that have specific genetic mutations. Together, these projects aim at improving diagnosis and treatment of tuberculosis in the field, especially in rural areas in developing countries.


top
P17:
Analyzing "Unlikely" Semantic Similarity Across Multiple Ontologies

Presenting Author: Darcy Davis, Buck Institute for Research on Aging

Author(s):
Sean Mooney, Buck Institute for Research on Aging, United States
Tobias Wittkop, Buck Institute for Research on Aging, United States

Abstract:
Semantic similarity has often been used by computational methods for inference about gene function, protein interactions, disease involvement, drug targets, and other interesting bioinformatics problems. Traditional semantic similarity measures consider lists of identical or highly overlapping terms to be highly similar, regardless of the characteristics and meaning of the terms. We posit that genes (or other biological entities) associated with highly overlapping generic term sets may only be similar with respect to our lack of knowledge, and that a correct metric should indicate no evidence of “meaningful” similarity. Many newer approaches in bioinformatics incorporate the concept of term information content, which is usually done by weighting terms by their inverse frequency (rare terms are more valuable) or ontological depth.

We extend this idea with a metric that targets "unlikely" gene similarity, relative to random expectation with respect to both gene and term specificity. Furthermore, we account for the effect of term dependencies that are purely artifacts of ontology structure, such as long hierarchical chains, which can bias results. The random distribution is constructed through weighted random sampling without replacement and with parent-child dependencies preserved. Our approach is ontology-independent and compatible with term lists derived from many sources. We will present results for term-to-gene annotations produced by the NCBO automatic annotation service, covering 41 biomedical ontologies. The evaluation includes comparison to other widely used or state-of-the-art similarity metrics, with careful controls to isolate the factors which contribute most to performance.


top
P18:
Accessing ontology web-services of the National Center for Biomedical Ontology using Cytoscape

Presenting Author: Chaoyi Du, Buck Institute for Research on Aging

Author(s):
Tobias Wittkop, Buck Institute for Research on Aging, United States
Nigam Shah, Stanford University, United States
Sean Mooney, Buck Institute for Research on Aging, United States

Abstract:
Currently, the National Center for Biomedical Ontology (NCBO) hosts over 250 biomedical ontologies, thus providing organized controlled vocabularies for a variety of fields such as drugs, diseases, gene functions, or phenotypes for instance. Furthermore, the NCBO provides many web-services to access this data. One of these services is the annotator web-service, which applies advanced string matching to annotate a user-provided text to concepts from available ontologies. In our previous project STOP (Statistical Tracking of Ontological Phrases; available at www.mooneygroup.org/stop) we utilized this service in order to map genes and proteins to concepts as basis for a web application that performs term enrichment on gene sets using these novel annotations. In addition to our own work, the NCBO systematically applied the annotator web-service to various sources such as PubMed abstracts, GEO expression data descriptions, various entries in public pathway databases, as well as entries in PubChem, and DrugBank, to name a few.
In an effort to make this valuable information more accessible and applicable we are building a respective NCBO plugin for the widely used Cytoscape framework. Cytoscape is a graph visualization and analysis platform with a well established plugin structure that already includes many data web-services and network analysis methods. We think that our plugin will enhance researchers’ analyses of biomedical networks by providing access to novel annotation data to improve existing and create novel networks.


top
P19:
THE EFFECTS OF TEMPERATURE AND WATER TYPE ON SEED GERMINATION IN ACLEPIAS INCARTNATA

Presenting Author: Natalya Dungee, North Carolina Agricultural & Technical State University

Abstract:
As an intern at the Kellogg Biological Station located in Hickory Corners, Michigan I was able to experience evolution in Ascelpias incarnata, swamp milkweed, and study the possible limitations of growth when affected by temperature and water type. For my research I conducted an experiment in which I tested a range of temperatures from 27C-35C, in increments of two degrees, and treated the experimental seeds at each temperature with either deionized or tap water to see if either water type would affect germination once out of the cold chamber. Is germination success affected by temperature? Which temperature had the greatest effect on milkweed seed germination? Is germination success affected by water that contains ions? My hypothesis was that the swamp milkweeds treated with the deionized water at higher temperatures would have a better outcome of growth when compared to the milkweeds treated with tap water because of the purity of deionized water. Upon completing my experiment I was able to determine that Asclepias incarnata would be able to grow in extreme temperatures above the normal range without temperature or water type having an effect on growth.


top
P20:
Characterizing the SOS response of the human microbiome

Presenting Author: Ivan Erill, Univerity of Maryland Baltimore County

Author(s):
Jopeph Cornish, Univerity of Maryland Baltimore County, United States

Abstract:
In most bacteria, DNA damage triggers the activation of a regulatory network commonly known as the SOS response, which has been experimentally characterized in a large number of bacteria. In Escherichia coli and other well-studied bacteria, the SOS response coordinates the expression of a more than 40 genes involved in DNA repair, translesion synthesis and cell division inhibition. In addition, the SOS response has been shown to control bacteriophage activity, dissemination of antibiotic resistance genes, integron recombination and toxin production. Recent efforts to map the human microbiome have generated an enormous wealth of data that can be analyzed to gain insight into the complex interactions among the genetic networks of gut bacteria. Here we use known data on the SOS response of Gram-positive and Gram-negative bacteria and conventional search methods to analyze the composition of the SOS response in the microbiome of healthy individuals. Our findings indicate that the SOS response is a key component of the shared microbiome gene network, linking multiple gene dissemination pathways. The gene function profile of the microbiome SOS response is remarkably similar to that observed in reference organisms, suggesting that it is conserved among gut bacteria. However, we report also an abundance of regulated toxin-antitoxin genes, highlighting the importance of plasmid-borne contributions to this system. We validate some of these findings experimentally and we discuss their implications with regard to the analysis of metagenomic data and the nature of the SOS response.


top
P21:
Automated concept annotation: how are we doing?

Presenting Author: Christopher Funk, University of Colorado

Author(s):
William Baumgartner, University of Colorado, United States
Chris Roeder, University of Colorado, United States
Ben Garcia, University of Colorado, United States
Kevin Cohen, University of Colorado, United States
Karin Verspoor, National ICT Australia, Australia

Abstract:
Mining the scientific literature and extracting important pieces of information from journal articles will help drive scientific discovery through integration with other sources of data and assist us in filling in gaps in biological databases. Normalizing concepts to ontologies is one way to link texts together to facilitate this discovery. Recognizing ontological terms is difficult because of variability and ambiguity in the terms themselves and words used to represent them in text. There are a number of competing approaches to recognizing ontological terms out there; we would like to know how well these approaches actually work and if there are any interesting interactions between an ontology and the optimal parameter setting for a system.
In this work, we compare three widely used automated concept annotation systems (Metamap, NCBO annotator, and Concept mapper) on their ability to recognize terms from five ontologies annotated in the CRAFT corpus (Sequence Ontology, Gene Ontology, NCBI taxonomy, CHEBI, and Cell Ontology). The parameter space of each system is explored and analysis gives insight into how characteristics of different ontologies are affected by the parameters of each system. As a result of this work, two important conclusions are presented: 1) For a given ontology, the best performing system and parameters is known and can be used in further work. 2) Baselines for ontology recognition against a human annotated gold standard are established.


top
P22:
Functional Annotation Prediction, Assessment, and Enrichment in Pathogenic Mycobacteria

Presenting Author: Benjamin Garcia, University of Colorado Anschutz Medical Campus, National Jewish Health

Author(s):
Gargi Datta, National Jewish Health, University of Colorado Anschutz Medical Campus, United States
Rebecca Davidson, National Jewish Health, United States
Michael Strong, National Jewish Health, University of Colorado Anschutz Medical Campus, United States
Benjamin Garcia, National Jewish Health, University of Colorado Anschutz Medical Campus, United States

Abstract:
Often, in nonmodel organisms, a large percentage of genes in a genome are understudied or lack suggestive knowledge of protein function. In Mycobacterium tuberculosis and Mycobacterium abscessus, for instance, approximately 40% of the genes lack annotation. This makes the analysis of gene enrichment and differential expression analysis difficult in some cases, as many genes identified in omic scale studies have no known function. Utilizing reliable functional annotation prediction methodologies helps us to better interrogate large-scale biological datasets resulting from transcriptome, proteomic, and other large-scale experiments. Once gene functions are predicted, they can then be assessed and used in gene enrichment studies to better understand the overall impact of biological perturbations or conditional comparison studies, revealing a deeper understanding and identification of genes potentially involved in virulence, survivability, drug resistance, and general metabolism.


top
P23:
PROGmiR: An online tool to study prognostic implications of miRNAs in a variety of cancers

Presenting Author: Chirayu Goswami, IU School of Medicine, Indiana University Pudrue University Indianapolis

Author(s):
Harikrishna Nakshatri, Indiana University Pudrue University Indianapolis, United States

Abstract:
miRNAs have been discovered as important biomarkers, and their prognostic implications have been studied in multiple cancers. A lot of data from such studies is already available in public domain, and more is being deposited. But, no resource is available till date that allows users to study prognostics of miRNAs of interest utilizing the wealth of available data. We have developed a tool that allows users to study prognostics of miRNAs in a variety of cancers. We have compiled data from public repositories such as GEO, and recently developed “The Cancer Genome Atlas (TCGA)” to create a unified database of miRNA prognostics in cancers. This tool is named “PROGmiR” and it is available at www.compbio.iupui.edu/progmir.
Currently our tool can be used to study 'overall survival' implications of approximately 1050 human miRNAs in 16 cancer types. The data has been compiled from both miRNA array (TCGA and GEO) and sequencing (TCGA) platforms. Plots can be created to study 3 yr, 5yr, and full follow-up survival. As more and more data becomes available in the future, we plan to expand the repository of our tool to be able to analyze prognostics of miRNAs in more number of cancers, and also by incorporating more survival functions in our tool, to let users study other prognostic features, such as metastasis free survival and relapse free survival. We believe this tool will be very helpful for users to link miRNA expression with cancer outcome and to design mechanistic studies.


top
P24:
In-silico identification of an epithelial core signature in human tumors

Presenting Author: Chirayu Goswami, IU School of Medicine, Indiana University Pudrue University Indianapolis

Author(s):
Oscar Cano, Indiana University Purdue University Indianapolis, United States
Yesim Polar, Indiana University Purdue University Indianapolis, United States
Murat Dundar, Indiana University Purdue University Indianapolis, United States
Sunil Badve, Indiana University Purdue University Indianapolis, United States

Abstract:
Gene expression analysis is performed on grossly selected specimens often without any microscopic analysis of tumor content. The variability in amount of epithelial and stromal cells may lead to misleading differential expression analysis and selection for erroneous targets for therapeutics. It is also often unclear, whether the genes identified are stromal or epithelial in origin. The goal of this study was to identify genes that define core epithelial phenotype. These genes could provide means of normalization of expression data.
We used CABIG GSK cell line data consisting of 950 samples to identify an epithelial signature which can be used to classify unknown tumor samples into High or Low epithelial content categories. 10 carcinoma samples from 11 tissue types each (n=110) and an equal number of non-carcinoma samples (n=110) were randomly selected. We used a 3 step bioinformatics approach to identify the epithelial signature. First, we identified 1455 differentially expressed genes between carcinoma and non- carcinoma groups using ANOVA analysis. Using these 1455 genes in carcinoma group, we identified 5 gene co-expression modules using Weighted Gene Co-expression Network Analysis (WGCNA). Lastly, we did PAM analysis identifying 64 genes capable of discriminating carcinoma from non carcinoma samples. We merged the PAM analysis results with results of WGCNA analysis in step 2 to identify a smaller 22 gene Core epithelial signature and a larger 42 gene accessory epithelial signature. A smaller subset of signature can successfully classify tumor samples into high and low epithelial content categories.


top
P25:
Perception of Local People and Concerned Authority Regarding Poaching of Wild Animals in Nepal (A case study of Chitwan National Park)

Presenting Author: Shailesh Acharya, Hedmark University Campus

Author(s):
Prabhakar Guragain, Hedmark University College, Norway

Abstract:
Poaching had threatened the biodiversity of wild endangered species of Nepal due to their high demand in international market.This study was conducted with broad objective to know the antipoaching activity that is prevailing in Nepal and its efficiency with its effect in tourism sector. Chitwan National Park was found to be renowned for its unique biodiversity and its outstanding natural beauty. But poaching was found to be its one of the biggest problem. The study was proposed to know the cause of poaching activities, resolving and identifying the trade roots, to know the weep holes in biodiversity conservation and contribute for biodiversity conservation. Data were collected from both primary and secondary sources from local peoples living in buffer zone of CNP and park concerned authority and the army. Data collected from the local people and concerned authority shows that the poaching problem was mainly due to high market value, lack of awareness, unemployment and insurgency. About 50% of revenue collected from tourism sector was found to be allocated for the development of buffer zone and sustaining biodiversity by effective antipoaching program. For sustainability of the antipoaching program, it is recommended to review and strengthening of security measures in all areas to minimize assess of poachers. High level antipoaching unit is required whose components include the chief warder, District forest officer and regular game scouts since such as anti-poaching unit would have provision of law informant outside the protected areas and national level committee is also necessary to form.


top
P26:
Understanding Heart Remodelling in Congestive Heart Failure through Analysis of High-Dimensional miRNA, mRNA, and Clinical Data

Presenting Author: Michael Hinterberg, University of Colorado School of Medicine

Author(s):
David P. Kao, M.D., University of Colorado, United States
Lawrence E. Hunter, Ph.D., University of Colorado, United States

Abstract:
Congestive heart failure (CHF) is the leading cause of hospitalization among Medicare patients, leading to costly treatment and reduced quality of life for over 5 million people annually in the U.S. alone. Impaired cardiac function can be reduced pharmacologically in some but not all patients through the use of beta blocker treatment, but the mechanisms for heart remodeling are not well understood. The “Beta Blocker Effects on Remodeling and Gene Expression” (BORG) study contains genotype information in the form of miRNA and mRNA expression data from CHF heart tissue, as well as corresponding phenotype information from electronic medical records, across multiple timepoints during treatment. By analyzing these high-dimensional data temporally (before, during, and after treatment), we can elucidate possible differences in CHF heart damage and remodeling, and therefore predict responsiveness of patients to beta blockers.

Using non-linear and linear classification techniques, we identified a miRNA that was highly predictive for beta blocker responsiveness. This miRNA has previously been implicated in cardiomyocyte proliferation and cardiac remodeling. Additionally, we have identified a set of genes functionally involved in metabolic syndrome (ADIPOQ, LEP, and PLIN1) which exhibit a highly-correlated change in expression during treatment. These findings may represent a new understanding of cardiac improvement during beta blocker therapy and possible interactions with metabolic syndrome, as well as enhanced sub-categorization of cardiomyopathy phenotypes.


top
P27:
A new Markov random field approach to predicting gene-disease relationships

Presenting Author: Yuxiang Jiang, Indiana University Bloomington

Author(s):
Sean Mooney, Buck Institute for Research on Aging, United States
Predrag Radivojac, Indiana University Bloomington, United States

Abstract:
Identification of disease associated genes on a genome-wide scale is one of the most important problems in bioinformatics, especially when we are exposed with increasing number of high-throughput data sets. This problem becomes more challenging when multiple different relational ties exist between genes, including protein-protein interactions, sequence similarity, gene co-expression levels etc. Therefore, it would be crucial to integrate the heterogeneous information sources in order to provide accurate predictions. In this work, we adopted a Markov random field approach to depict the overall picture of association probability dependencies among genes. In our approach, edge weights of the Markov network, which parameterize the conditional probabilities are trained using machine learning techniques (kernel linear regression), and this serves as essential step in fusing multiple data sets.

Most, if not all existing works on candidate gene prioritization provide scores for each disease independently. In our approach, we attempt to incorporate the relationships between diseases and also predict gene-disease associations jointly for all diseases. Finally, we present a new large and manually curated data set of gene-disease relationships.


top
P28:
Consensus Gene Ranking Method for Genome-wide Functional Genetic Screens

Presenting Author: Jihye Kim, University of Colorado Denver

Abstract:
Recently, targeted therapies have shown promising results in clinic for treating tumors driven by oncogenes. However, therapeutic and targeted inhibition of most oncogenes is not possible, and in those cases in which it is, the tumors eventually acquired new mutations in oncogenes or activating compensatory pathways to by-pass blockade by these targeted therapies. Genome-wide functional genetic screens can reveal these compensatory pathways, suggesting rational combinations with targeted therapies for cancer treatment. Analyzing and interpreting these massive sequencing data for reliable hits remain a challenge. Previously, we have developed BiNGS!SL-seq, a bioinformatics workflow to identify synthetic lethality genes from functional genetic screens by deep sequencing. We have employed this computational method in identifying synthetic lethal pathways for leukemia, lung and colorectal cancers. In our original implementation, we have used weighted Z-score as a method to rank and score synthetic lethality genes. Although we have identified validated hits from the screens, high false positive rate remains an issue. Here, we would like to extend our computational method to incorporate other statistical methods such as Redundant SiRNA Activity (RSA) analysis and Kolmogorov-Smirnov (KS) test to reduce false positives. We have implemented these statistical methods and systematically compared with the weighted Z-score method. We have compared the ranking and reproducibility of the synthetic lethality genes and the enriched pathways from these different statistical methods. We also compared the number of validated hits in these statistical methods to determine false positive rates. Overall, the consensus gene ranking from these methods yielded the best results.


top
P29:
Biologically plausible enhancements to models of protein interaction network evolution

Presenting Author: Scott Kirkpatrick, California State University, Chico

Author(s):
Ian Eckert, California State University, Chico, United States
Luis Cheung, California State University, Chico, United States
Todd Gibson, University of California, Chico, United States

Abstract:
Models of biological evolution play an important role in inferring evolutionary mechanisms underlying the formation of protein interaction networks. In particular, generative mechanics which model biological and evolutionary phenomena allow direct inference of the role each mechanic plays in the evolution of the interaction network.

In previously-published research we enhanced the traditional duplication and divergence network evolution model by associating each interaction a protein participates in with an interaction site on the protein. These interaction sites, which are analogous to protein domains, are modeled as heritable traits of the protein, as well as the unit of post-duplication divergence (subfunctionalization). This enhanced model of network evolution produces networks with topological characteristics which more closely match empirical values than the model absent the interaction site enhancement. Here, we introduce further biologically-plausible mechanics to the model. In particular, we integrate gene fusion and gene fission events into the model, report on its effect on network topology, and infer the role these mechanics play in the formation of empirical protein interaction networks.


top
P30:
Analytical Characterization of a Discrete Next Generation Sequencing Panel for Clinical Testing.

Presenting Author: Eric Klee, Mayo Clinic

Author(s):
ZJ Tu, Mayo Clinic, United States
Lisa Peterson, Mayo Clinic, United States
Xiao-yu Liu, University of Wisconsin, United States
Matthew Ferber, Mayo Clinic, United States

Abstract:
The field of clinical diagnostics has embraced the opportunity next generation sequencing (NGS) enabled to expand the portfolio of genetic targets profiled for a given disease. This shift from traditional Sanger sequencing to NGS has necessitated bioinformatics solutions focused on clinical testing requirements, often requiring retooling of software solutions initially developed to answer research questions. While developing a hereditary colon cancer (HCC) genetic test, we analyzed 14 genes across 40 samples with both Sanger sequencing and Illumina NGS. We also developed a synthetic dataset with a diverse mutational profile. These datasets enabled us to accurately characterize the analytic performance of our HCC panel using Illumina 100 base single-read sequencing and analysis with the commercial CLC Server bioinformatics toolset. We evaluated measurements of sensitivity, specificity, accuracy, reproducibility, and limit of detection. The synthetic dataset was used to identify thresholds for detection of insertions and deletions. Finally, in light of the decreasing sequencing costs and increasing interest in whole exome and whole genome sequencing, we evaluated the ramifications of using these methods to profile our 14 gene panel at a clinical sequencing depth of 100x. Together, this study defined the analytic limits associated with our NGS clinical panel test and provided a cursory guidance towards the evolution of testing methods away from discrete target panels and toward whole exome or whole genome sequencing.


top
P31:
Flexible Detection of RNA Polymerase II Pausing Changes

Presenting Author: Junnam Lee, Soongsil University

Author(s):
Sansoo Kim, Soongsil University, Korea, Rep
David Bentley, University of Colorado Denver, United States
Hyunmin Kim, University of Colorado Denver, United States

Abstract:
Transcription is the first fundamental regulatory step of gene expression and is controlled in part by dynamic changes in RNA Polymerase II (Pol II) phosphorylation. The proper expression of mRNAs requires for timely delivery of Pol II to the appropriate promoters, productive elongation and co-transcriptional splicing and cleavage/polyadenylation at the polyA site. All of these steps are integrated with one another. Flux control of elongating Pol II through the promoter-proximal pause sites is important for regulation of a large fraction of genes in multicellular organisms. As a result of promoter-proximal pausing and premature termination, Pol II binding patterns are characterized by high densities at the promoter-proximal sites and lower densities within the gene body. Altering the rate of the flux from pausing to productive elongation is widely thought to be a mechanism for rapidly regulating transcription in response to developmental and environmental cues.
Recently, a series of new high-throughput assays such as ChIP-seq, Gro-seq and NET-seq became available to study Pol II pausing dynamics. Mathematically, Pol II occupancy densities produce mixed phases of sharp and broad peaks of which amounts are higher than the background noise. Due to a lack of computational tools for recognizing and differentiating these Pol II patterns, we propose a new method which considers the following issues: (1) defining peak boundaries; (2) normalizing signals based on assay types; (3) calculation of flux index between the defined regions; (4) comparing flux patterns from two experiments. To validate the method we use ChIP-seq and NET-seq datasets.


top
P32:
Identifying Huntington’s Disease-Associated Genes Through Integrative Regulatory Network Analysis

Presenting Author: Biao Li, The Buck Institute for Research on Aging

Author(s):
Sean Mooney, The Buck Institute for Research on Aging, United States

Abstract:
A challenge in elucidating genetic factors involved in human disease is to locate genes with low-to-medium penetrance or epistatic effects, which are typically difficult to detect in cohorts with modest sample size. Recently, high-throughput technologies have made possible generation of massive data that could reveal cellular regulatory and transcriptional activities. Analyses integrating these system-wide information, especial network data, therefore have been proposed and shown potential to identify disease relevant genes that would be hard to discover otherwise. Here we present such a network analysis of Huntington’s Disease (HD), a neurodegenerative disorder, based on genotype and gene expression data collected from 217 patients and 171 healthy controls. Firstly we created two HD-specific networks from (1) expression quantitative loci (eQTL) mapping and (2) mining regulatory data from the ENCODE project. And then we prioritized genes via RankBoost that is a boosting-based technique. Through this work we show the strength of network analysis as respect to supplementing locus-based complex disease gene mapping.


top
P33:
Hypergraph kernels for protein function prediction using protein complexes

Presenting Author: Jose Lugo-Martinez, Indiana University

Author(s):
Predrag Radivojac, Indiana University, United States

Abstract:
Computational approaches to predict protein function play a key role in understanding life at the molecular level. A common approach involves the use of kernel methods and SVMs as means of assigning function from protein-protein interaction (PPI) networks. PPIs consist of proteins as vertices and various interactions between proteins as edges. Several graph-based kernel methods have been proposed in the literature, including random walk kernels and shortest path kernels. However, these methods are subject to the inherent limitations of graph representations, that is, graphs suffer from an information loss since every edge can only connect two nodes. For example, a protein complex cannot be distinguished from a set of proteins that interact only pairwise, perhaps even in different tissues or at different stages during the cell lifetime. Therefore, there is a need for kernels methods defined over hypergraphs. In this study, we developed a hypergraph kernel approach for the task of predicting protein function from protein complex data.

Using protein complex data from CORUM, we predict functional class membership between metabolic and energy annotated proteins. We expect our approach to perform favorably against established approaches, as well as to provide evidence of the modeling power of hypergraphs.


top
P34:
Discovering Transcription Factor Regulatory Targets Using Gene Expression and Binding Data

Presenting Author: Mark Maienschein-Cline, University of Chicago

Author(s):
Roger Sciammas, University of Chicago, United States
Aaron Dinner, University of Chicago, United States

Abstract:
Identifying the target genes regulated by transcription factors (TFs) is the most basic step in understanding gene regulation. Recent advances in high-throughput sequencing technology, together with chromatin immunoprecipitation (ChIP), enable mapping TF binding sites genome-wide, but it is not possible to infer function from binding alone. This is especially true in mammalian systems, where regulation often occurs through long-range enhancers in gene-rich neighborhoods, rather than proximal promoters, preventing straightforward assignment of a binding site to a target gene.

Here, I present EMBER (Expectation Maximization of Binding and Expression pRofiles), a novel method that integrates high-throughput binding data (e.g., ChIP-chip or ChIP-seq) with gene expression data (e.g., DNA microarray or RNA-seq) via an unsupervised machine learning algorithm for inferring the gene targets of sets of TF binding sites. Genes selected are those that match over-represented expression patterns, and the output of the method includes both the putative gene targets and the inferred mode of gene regulation. I will present two applications of the method, time permitting. The first validates the putative targets from EMBER by confirming a role for the TFs ERα, RARα, and RARγ in human breast cancer development, whereas the conventional approach of assigning regulatory targets based on proximity does not. The second elucidates the context-dependent regulatory modes of the TF IRF-4 in mouse effector B cell fate choice.


top
P35:
A proposal for web-based reviews of supplements and other nutraceuticals in aging and aging related disease

Presenting Author: Jackson Miller, The Buck Institute for Research on Aging

Abstract:
There is perhaps no greater market for nutraceuticals than in aging and in aging related diseases. Vendors market products everywhere from scientific fact to scientific quackery, none with FDA approval. We are proposing to develop a forum for scientific review of the classes of these products using web-based technologies. Our currently titled resource, MedVerified, is focused on providing reviews of products and services with clear coverage on topics related to aging and age associated conditions for consumers. The main objective is to inform the user and MedVerified will provide unbiased advice and reviews to individuals without medical expertise.
MedVerified seeks to provide product reviews from experienced and knowledgeable professionals. The core of the reviews will be science based and will be disseminated to and for the public investigating matters of clinical relevance. MedVerified will be a network of consumers, volunteers and medical advisors. We are providing individual evaluations of the safety and efficacy of products and services related to aging and, if known, the organization behind them. Users of the resource will rely upon the presented information to be impartial, scientifically accurate, and unbiased from commercial interests or competing products.
Although MedVerified is presently in the concept stage, our objective is to help and inform individuals. We will be simultaneously promoting health and health related products. MedVerified will be a resource to help people make informed decisions to maintain their health as well as help them if they do become diagnosed with an age associated condition.


top
P36:
Genome-wide Conservation of Codon Adaptation in fast-growing Bacteria

Presenting Author: Patrick O'Neill, University of Maryland Baltimore County

Abstract:
Protein coding sequences do not make use of synonymous codons
with equal frequency, but instead exhibit distinct patterns of
codon usage, or codon usage bias (CUB). Despite its importance
both in genetic engineering applications and comparative
genomics, CUB is not yet fully understood. To this end, we
introduce the self-consistent normalized Relative Codon
Adaptation (scnRCA) method, an index of CUB in bacterial genomes
resilient to both amino acid and compositional biases which
requires no external reference set. We conducted a comprehensive
benchmarking of scnRCA using available microarray expression data
for moderate- and fast-growing bacterial species, assessed by
correlation of scnRCA scores with gene expression values. In all
cases, the performance of the scnRCA algorithm was compared
directly with the self-consistent implementation of the Codon
Adaptation Index (scCAI), the de facto standard. We find that
scnRCA provides a significant increase in correlation with
expression data for genomes that exhibit compositional biases, on
which scCAI performs poorly.

We exploited the improved performance of scnRCA on
compositionally biased genomes to analyze the 12 available
complete genome sequences of Pseudomonas and Psychrobacter
species, whose divergent %GC contents provide an ideal case study
for the co-evolution of mutational and translational biases. Our
results reveal strong correlation of codon adaptation among
orthologous genes despite negative correlation in compositional
bias. This agreement, moreover, is not limited to highly
expressed genes but manifests throughout the genome. Most
surprisingly, correlation in scnRCA values extends across several
bacterial phyla (Actinobacteria, Firmicutes and Proteobacteria),
suggesting a universal "architecture" for fast-growers.


top
P37:
Integrated database for analyzing the impact of coding SNPs in the human genome

Presenting Author: Tal Oron, Buck Institute

Author(s):
Judy Shi, Buck Institute, United States
Biao Li, Buck Institute, United States
Predrag Radivojac, Indians University, United States
Sean Mooney, Buck Institute, United States

Abstract:
We have developed a mutation database as a platform to investigate the effect of DNA single base substitutions on the function of human proteins. Our database adopted a dual-centric approach as it provides annotations on the gene/DNA level as well as on the protein level while preserving the associations between them. We have integrated somatic and germ-line mutations from the public databases UniProt, dbSNP, and Cosmic by mapping all mutation to a pre-defined set of protein sequences and their associated transcript sequences, and to the GRCh37 genome assembly. This process has enabled us to converge identical mutations and resulted with a non-redundant set. For each substitution, we have calculated the related positions on all the gene’s alternative splicing products. In addition, we have calculated MutPred score for each amino-acid substitution to assess the potential impact of this substitution on the function and structure of each protein. We have also, mapped all protein variants to PDB files using Dunbrack’s PDB database to provide structure positions. We have retrieved in total 112,251; 150,497; 520,131 single amino-acid protein substitutions from UniProt, Cosmic, and dbSNP respectively. These amino-acid substitutions were mapped to 59,571; 78,783; 320,173 DNA single base substitutions respectively. Our joined non-redundant database after converging identically annotated mutations consists of 685,659 single amino-acid substitutions on the protein level which originated from 420,283 DNA single base substitutions.


top
P38:
Provenance of unaligned reads in ChIP-Seq studies

Presenting Author: Zachary Ouma, Ohio State University

Author(s):
Erich Grotewold, Ohio State University , United States
Katherine Mejia-Guerra, Ohio State University, United States

Abstract:
Chromatin Immunoprecipitation followed by massively-parallel sequencing (ChIP-Seq) is an indispensable tool in understanding the dynamics and evolution of regulatory circuitry of prokaryotes and eukaryotes. ChIP-Seq studies aim to decipher gene regulatory mechanisms by mapping genome-wide transcription factor binding sites (TFBSs). Aligning millions of short sequences (reads) to the reference genome is the first fundamental step in the analysis pipeline.

Whereas not all reads align to their reference genomes, the source of unaligned reads has not been systematically explored. We describe a computational approach to establish the source of unaligned reads from several major model organisms (Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana) and Zea mays in which ChIP-Seq efforts are underway as a first step in establishing the architecture of gene regulatory networks (GRN). The analysis of raw reads obtained from NCBI Short Read Archive (SRA) revealed a significant level of contamination in ChIP-Seq unaligned reads with sequences of bacterial and metazoan origin, irrespective of the source of chromatin used for the ChIP-Seq studies. In agreement with other sequencing studies, our results indicated that human sequences are the main source of contamination. Unexpectedly, however, was the observation that some of the selected unaligned reads data sets contained significant numbers of legitimate reads that have mappable properties, but were missed out by researchers in the alignment process. This highlights a need to improve the currently utilized alignment algorithms.


top
P39:
Identifying molecular mechanisms underlying variable penetrance

Presenting Author: Kymberleigh Pagel, Indiana University Bloomington

Author(s):
Predrag Radivojac, Indiana University Bloomington, United States
Sean Mooney , Buck Institute, United States

Abstract:
Penetrance is the proportion of individuals who present with a particular disease phenotype given that they have the underlying genotype. The causes of variable penetrance range from environmental factors and epistatic interactions, among many others. Penetrance of hereditary diseases is an important consideration for mutation carriers undergoing genetic counseling, as well as researchers trying to determine the molecular mechanisms of a disease. We seek to investigate molecular effects of missense mutations among disease causative mutations with varying levels of penetrance. A dataset of single amino acid substitutions with known clinical significance will be compiled from sources such as dbSNP and the OMIM database for use in this study. To draw conclusions about the molecular differences between completely penetrant and incompletely penetrant mutations we will provide comparison of sequence variation and predicted properties of residues. Several different methods of comparison between mutations will be shown to isolate the most informative features relevant to mutation penetrance. Analysis of penetrant mutations will include comparison of the evolutionary conservation both in the mutated residues and the entire protein, assessment of the severity of amino acid substitutions with regard to physiochemical properties, and calculation of the relative degree of variation in genes which are subject only to completely penetrant mutations versus genes with incompletely penetrant mutations.


top
P40:
Modelling three protozoan parasitic diseases using host-parasite protein-protein interaction networks

Presenting Author: Paurush Praveen, University of Bonn

Author(s):
Erfan Younesi, Fraunhofer SCAI, Germany
Martin Hofmann-Apitius, Fraunhofer SCAI, Germany

Abstract:
The drive towards understanding parasitic diseases can be aided with protein-protein interaction networks to unravel the disease mechanism at molecular level. Since, the cases of parasitic diseases like malaria, sleeping sickness etc. are about the emergence of cross species interactions at the protein level, a model describing the such inter-species interactions can capture the phenomenon more rationally and synergistically than intra-species networks and help to understand their pathophysiology. However, such studies are limited due to lack of protein-protein interaction information between host and pathogen proteins.
Here we propose and present an approach to overcome such disease models by expanding the protein-protein interactions data, in three parasitic diseases namely, malaria, sleeping sickness and cattle theileriosis (east cost fever). The information space for host-pathogen protein-protein interaction was created using literature, and prediction approach based on homology, gene ontology terms and protein domain data in addition to the interaction databases. Our network models could explain the mechanisms involved in east coast fever, malaria, and trypanosomiasis. Furthermore, they also propose the possible hypotheses for disease mechanisms that can be tested experimentally.
In future, we plan to set up joint efforts with the scientific community for a compendium of models representing host-pathogen interactions for a wide variety of neglected diseases ultimately leading to availability of a freely accessible network model database for such diseases which can be ideally used for the identification of new knowledge and application in the domain


top
P41:
Constructing informative prior from multiple knowledge sources to improve network inference.

Presenting Author: Paurush Praveen, University of Bonn

Author(s):
Holger Fröhlich, University of Bonn, Germany

Abstract:
Statistical methods to infer cellular networks from high throughput experiments have gained popularity in recent years. However, the inherent noise in experimental data together with low sample size restricts their performance resulting in high false positives and false negatives. Incorporating prior knowledge during the learning process has been proposed as a way to address this problem, and in principle a mechanism has been devised for this (Mukherjee & Speed, 2008). However, so far little attention has been paid to the fact that prior knowledge is typically distributed among multiple, heterogeneous knowledge sources (e.g. GO, KEGG, HPRD etc.).
We present two approaches to construct an informative network prior from multiple knowledge sources. The first method is a latent factor model using Bayesian inference. Our second model is the Noisy-OR model, which assumes that the overall prior is a non-deterministic effect of participating information sources. Both models are compared to a naïve method, which assumes knowledge sources to be independent. Extensive simulation studies on artificially created networks as well as full KEGG pathways reveal a significant improvement by both suggested methods compared to the naïve model. The performance of the latent factor model increases with larger network sizes, whereas for smaller networks the Noisy-OR model appears superior. Furthermore, we show that our informative priors significantly enhance the reconstruction accuracy of Bayesian Network and Nested Effects Models. Finally, two examples, one from breast cancer and one from murine stem cell development exhibit the application of our approach.


top
P42:
Classifying metagenomic reads based upon compositional properties.

Presenting Author: Zachary Romer, Loyola University Chicago

Author(s):
Bryan Quach, Loyola University Chicago, United States
Dannish Ghazali, Loyola University Chicago, United States
Catherine Putonti, Loyola University Chicago, United States

Abstract:
Mapping reads from large metagenomic samples of thousands of potentially unknown species presents a challenge when considering sequences beyond marker genes. As such several tools have been developed, which given the sheer number of metagenomic reads generated by next generation technologies, must rely on heuristics to employ their search. Both similarity-based, often utilizing BLAST-like searches, and composition-based means of ascertaining the putative species and/or gene from which the read came have been developed. While the former has greater accuracy and specificity, composition-based approaches are less costly with respect to both time and memory. Herein we present an efficient classifier of metagenomic data involving the integration of composition and homology based comparisons, providing increased accuracy and expediency. This method for DIScovery through COmpositional profiles, known as DISCO, utilizes multidimensional k-mer frequency information combined with a probabilistic framework to predict the taxonomic group in which sequencing reads belong. Integrating multiple values of k facilitates the recognition of various sequence signals, e.g. codon usage, tetranucleotide usage, etc. Both sequencing reads and available genomic sequences are mapped to this multidimensional compositional space, generating the probability in which each read belongs to particular species and genera.


top
P43:
Pharmacogenetic Annotation on Human Non-synonymous Variants By Machine Learning

Presenting Author: Chet Seligman, Buck Institute for Research in Aging

Author(s):
Biao Li, Buck Institute for Research in Aging, United States
Sean Mooney, Buck Institute for Research in Aging, United States
Janita Thusberg, Buck Institute for Research in Aging, United States

Abstract:
Advance in sequencing technologies has enabled large-scale genotyping and resulted in massive human genomic variant data. Interpreting these variants in the context of biomedical relevance, however, poses great challenges due to the sheer size of the data and the lack of standard annotation tools. One important property of human variants is their influence on the drug response, and a small number of non-synonymous variants have been established to have pharmacodynamic (PD) and pharmacokinetic (PK) consequences. In this study, we built an automatic annotation tool through supervised learning to expand human pharmacogenetic variants. Based on PharmGKB, we manually curated 71 PK and 55 PD as well as 3 PK-PD variants through literature search. An array of 130 evolutionary and physicochemical features was then generated for PK and PD variants, and 36889 neutral variants from UniProt as well. We created a random forest-based classifier and verified it on an independent set of annotated PK/PD variants. Finally we applied the model to all variants on plausible pharmacogenetic proteins and provided a list of novel pharmacogenetic variant candidates.


top
P44:
Investigating the Prevalence of CRISPRs in Bacterial Genomes

Presenting Author: Michael Shaffer, Loyola University Chicago

Author(s):
Catherine Putonti, Loyola University Chicago, United States

Abstract:
The Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) system is a defense mechanism various bacteria and archaea have developed to protect against viral infections. CRISPR arrays within the prokaryotic genome include subsequences of viral genomic sequence(s) providing “immunity” to the virus in the future; when the prokaryote is infected by this virus in the future, the CRISPR system will recognize the virus and thwart its attack. As these arrays contain what are known as spacer sequences for the recognition of numerous different viral species, the CRISPR array provides insight into present as well as past viral attacks. Furthermore, investigation of these spacers can uncover new viral species as well as novel proteins. Identifying viruses from CRISPR arrays is a challenge. Given virus’ high mutation rates and the assumption that viral sequences in nature will vary from sequenced isolates, classifying CRISPR spacers is an instance of the fuzzy string search problem. We have developed a new algorithmic approach for identifying and classifying CRISPR spacers based upon viral sequence data currently available. Our approach starts with the use of a seeding algorithm to look efficiently for exact matches through a lazy suffix tree implementation. The functionality of putative matches are then further examined, referencing GO. Various seed sizes and close match search methods have been tested to limit the number of false seeds while also keeping performance at a reasonable level. In addition, we also present the results of our analysis of CRISPR arrays within publicly available prokaryotic genomes.


top
P45:
Computational Aspects of the Molecular Evolutionary Mechanism of Occult Hepatitis B Virus Infection

Presenting Author: Guifang Shang, University of Canberra

Author(s):
simon easteal, Australian National University, Australia
Alice Richardson, University of Canberra, Australia
Michelle Gahan, University of Canberra, Australia
Brett Lidbury, Australian National University, Australia

Abstract:
Occult hepatitis B virus infection (OBI) is characterized by the presence of hepatitis B virus (HBV) DNA and the absence of detectable hepatitis B surface antigen. HBV contains a 3.2-kb partially double-stranded DNA genome encoding the S, C, X and P proteins. This project aims to identify computationally the molecular evolutionary mechanism of OBI.

More than three thousand complete HBV genomes were retrieved from GeneBank, including 137 of OBI and 3547 of overt HBV infection (nonOBI). These genomes were aligned by using Geneious MAFFT with the default settings of input parameters and phylogenetic trees for OBI and nonOBI sequences were drawn. Fisher’s Exact test was used to identify the amino acid sequence differences between OBI and nonOBI site-by-site. Q values were obtained based on false discovery rate (FDR) using QVALUES package.

Phylogenetic tree analyses showed that OBI and nonOBI nucleotide sequences were closely related. However, the Fisher exact test analysis showed that there are statistically significant differences for OBI versus nonOBI amino acid sequence. The most variable viral protein was the S region with 92.07% of OBI amino acids significantly different compared to nonOBI HBV.

This work shows that the amino acid variations in the S region appear to play an important role in the molecular evolution of OBI. Amino acid variation in the S region also explains the limitations of hepatitis B surface antigen immunoassays for detecting OBI strains.


top
P46:
Detection of Differential Polyadenylation as an Alternative Cancer Marker

Presenting Author: Jimin Shin, Soongsil University

Author(s):
David Bentley, University of Colorado Denver, United States
Chaeyoung Lee, Soongsil University, Korea, Rep
Hyunmin Kim, University of Colorado Denver, United States

Abstract:
RNA processing by cleavage and polyadenylation (polyA) is the mechanism by which almost all eukaryotic messenger RNAs acquire their mature 3’ ends. Regulation of 3’ end formation by altered polyA site choice or alternative polyadenylation (APA) is a widespread phenomenon associated with differentiation, proliferation and oncogenesis. With advanced NGS technology, large amounts of polyA site mapping data are becoming available including studies of normal versus cancer cells. The computational analysis of APA is currently limiting progress due to the lack of proper statistical background models and quantitative measurements of changes in use of multiple polyA sites at individual genes.
Therefore, we developed a reproducible computational framework to calculate the statistical significance of the differential APA profiles. There are two main tasks in the framework: (1) peak identification and (2) peak comparison. For the first task, we propose to conduct a non-parametric normalization using LASSO algorithm to panelize the peak patterns implying artifacts. For the second task, we have coined a new method called PSI(PolyA Shifting Index). This method has advantages over the linear trend statistic (Agresti 2002). First, PSI can capture non-linear trends in which changes in use of polyA sites at intermediate locations are considered. Second, the PSI statistic is not biased toward changes in polyA site choice over a long distance. Using the public cancer datasets, we will conduct the comparison study focusing on how the statistical significance reveals biological meaning.


top
P47:
MOLECULAR MODELING STUDY OF ONCOGENINC Y220C MUTATION OF TUMOR SUPPRESSOR P53: STUDY TOWARDS THE RATIONAL DESIGN OF P53 STABILIZING DRUGS

Presenting Author: Nisha Shri, The Tamilnadu Dr.M.G.R. Medical University, Chennai, India

Author(s):
Naresh Chandra, K.K College of Pharmacy, India
Keja Lakshmi, K.K College of Pharmacy, India
Nisha Shri, The Tamilnadu Dr.M.G.R. Medical University, India
Asif Naqvi, BioDiscovery Solutions for future, India

Abstract:
Most frequent genetic alteration in human cancer is the mutation of the tumor suppressor p53 which is an essential mediator of cellular response to oncogenic stresses. The tumor suppressor protein p53 is a transcription factor that plays a key role in the prevention of cancer development. The p53 cancer mutation Y220C induces formation of a cavity on the protein's surface that can accommodate stabilizing small molecules.
With a presumption that binding of small molecule targeted to the mutational cavity will stabilize the protein we have attempted with the help of virtual screening and molecular docking approach using Lamarckian Genetic Algorithm to study the binding of p53 cancer mutation Y220C with different class of Benzothiazoles.
3000 molecules on the basis of structural similarity and substructure search of Benzothiazole were studied for the binding with P53- Y220C, PDB ID- 2X0U. Molecular Docking was carried out and the docking result of the 3000 molecules demonstrated that the binding energies were in the range of -8.86 kcal/mol to -1.27 kcal/mol, with the minimum binding energy of -8.86 kcal.mol. The molecule G91 showed Drug Likeness score of -0.40 with Mol PSA as 56.17 A2 and MolVol as 315.66 A3. The MolLogS was -4.11 with solubility of -4.48 and drug score of 0.59. The molecule showed no indication for mutagenicity, & tumorigenicity. Also, no indication for irritating & reproductive effects found.


Our research provides a blueprint for the design of more potent and specific drugs that rescue p53-Y220C.Further in-vitro and in-vivo study is required on these molecules.


top
P48:
Scaffold_builder for combining de novo and reference-guided assembly

Presenting Author: Genivaldo Silva, SDSU

Author(s):
T. David Matthews, SDSU, United States
Keri Elkins, SDSU, United States
Elizabeth Dinsdale, SDSU, United States
Robert Edwards, SDSU, United States
Bas Dutilh, Nijmegen Centre for Molecular Life Sciences, Radboud University Nijmegen Medical Centre, Netherlands

Abstract:
Genome sequencing is routine, however genome assembly still remains a challenge despite the computational advances of the last decade. In particular, the abundance of repeat elements in the genome makes it difficult to assemble a single complete sequence. Identical repeats shorter than the average read length will generally be assembled without issue. However, longer repeats such as ribosomal RNA (rrn) operons cannot be accurately assembled using existing tools. The application scaffold_builder was designed to generate scaffolds – super contigs of sequences joined by N-bases – using the homology provided by a closely related reference sequence. The application was evaluated using simulated pyrosequencing reads of the bacterial genomes Escherichia coli 042, Lactobacillus salivarius UCC118 and Salmonella enterica subsp. enterica serovar Typhi str. P-stx-12. Moreover, we sequenced two genomes from Salmonella Typhimurium LT2 G455 and Salmonella Typhimurium SDT1291 and show that scaffold_builder decreases the number of contigs by 69.5% while increasing their average length by 225%. Scaffold_builder was written in Python and can be downloaded from http://edwards.sdsu.edu/research/scaffold_builder.


top
P49:
A molecular dynamics approach for the analysis of the docking of ligands to buried protein binding sites

Presenting Author: Awantika Singh, University of Arkansas for medical sciences and University of Arkansas at little rock

Author(s):
Philip Breen, UAMS, United States
Martin Hauer-Jensen, UAMS, United States
Cesar Compadre, UAMS, United States

Abstract:
A molecular dynamics approach for the analysis of the docking of ligands to buried protein binding sites
Awantika Singh12, Philip J. Breen2, Martin Hauer-Jensen2, and Cesar M2. Compadre,1UAMS/UALR Joint Bioinformatics Program 2Department of Pharmaceutical Sciences, University of Arkansas for Medical Sciences
Molecular docking protocols are widely used in the identification of new prototype compounds in drug development programs. In many cases these methods are successful in identifying interesting leads. However, in the cases when the docking cavity is deeply buried inside a protein or is an open-close cavity, they often fail to produce reasonable results. In this study, we used molecular dynamics simulations to explore the binding of ligands to the α-tocopherol transfer protein (ATTP). This protein is responsible for maintaining the plasma levels of α-tocopherol and the other vitamin E analogues. Although, a high resolution X-resolution of the protein is available, the binding mode predicted by the most popular docking techniques did not correlated well with the experimental binding of the various vitamin analogues. In our approach molecular dynamics simulations were performed in vacuo with a dielectric constant of 4 using an NVT ensemble. For the analysis 20 ns production simulation were conducted. A total of 100,000 conformations for each molecule were saved and the structural, dynamic, and energetic properties were derived from the analyses of these snapshots. The results of the analysis successfully correlated with the experimental results. However, optimization of the approach is necessary for automated analysis of large datasets.


top
P50:
Using Index-based alignments to determine clade specificity in metagenomic samples

Presenting Author: Ashok Sivakumar, Johns Hopkins University

Author(s):
Benjamin Langmead, Johns Hopkins University, United States

Abstract:
Current taxonomic profiling tools for metagenomic shotgun sequencing often rely upon biomarkers determined apriori from known databases to assign clade-specific abundance levels. In contrast, this work uses index-based alignment results to evaluate similarity across sequences and identifies unique k-mers to establish potential taxonomic profiles. This approach is demonstrated for a random subset of bacterial genomes with multiple levels of clade-specificity.


top
P51:
Multiple Sequence Alignment Using Motif Assembly

Presenting Author: Charnelle Smoak, North Carolina Agricultural and Technical State University

Author(s):
Albert Esterline, North Carolina A&T State University, United States
Gregory Goins, North Carolina A&T State University, United States
Alex Ropelewski, Pittsburgh Supercomputing Center, United States

Abstract:
Multiple sequence alignments are useful for conducting phylogenetic inference and are used as a means to infer the structure and function of a protein in conjunction with previously obtained knowledge. Alignments produced by current multiple sequence alignment methods rarely produce the best model of sequence evolution, requiring hand-editing of alignments to ensure that important biological motifs are aligned. We devised a multiple sequence alignment pipeline which anchors these motifs and builds a multiple sequence alignment around them eliminating the need to hand-adjust alignments around key motifs. This alignment method produces superior alignments than native alignment algorithms.


top
P52:
Protein-level annotation of missense mutations in ovarian cancer, acute myeloid leukemia and glioblastoma multiforme

Presenting Author: Janita Thusberg, Buck Institute for Research on Aging

Author(s):
Sean Mooney, Buck Institute for Research on Aging, United States

Abstract:
Heterogeneity of mutation profiles in cancer patients calls for detailed annotation of the putative downstream effects of mutations. A small subset of somatic mutations is expected to function as drivers of tumor progression, and identifying the variants responsible for tumor progression among dozens, or in some cases even hundreds within a patient, is not a trivial task. Protein-level annotation of mutations may reveal functions beyond the analysis of genes enriched in mutations, providing molecular level explanations for the putative role of missense mutations in cancer progression. By elucidating protein-level effects of missense mutations common patterns among different cancers can be found.
Coding variants identified from exon-capture and whole genome sequencing of ovarian cancer, glioblastoma multiforme, and acute myeloid leukemia samples have been bioinformatically characterized and annotated by a suite of applications to identify variants likely to disrupt molecular function or clinical phenotype and hypothesize their molecular effects. Based on a training set of disease causing mutations and neutral polymorphisms, the program MutPred utilizes a Random Forest method to predict functional variants using proteomic, genomic and bioinformatic attributes. Since these attributes are based on known and predicted protein functional sites, the disrupted function can be quantitatively hypothesized. The MutPred scores and functional attributes will identify somatic variants likely to drive tumor progression. The protein-level interpretations of the impact of mutations extend the frequently mutated gene model used in identification of cancer drivers.


top
P53:
Systems biology study of automated gene functional annotations reveals highly predictable concepts in various biomedical ontologies.

Presenting Author: Tobias Wittkop, Buck Institute for Research on Aging

Author(s):
Adrian Bivol, Buck Institute for Research on Aging, United States
Darcy Davis, Buck Institute for Research on Aging, United States
Sean Mooney, Buck Institute for Research on Aging, United States

Abstract:
Many tools have been developed for prediction of gene-disease associations or functions of genes and proteins, and this continues to be a highly active area of bioinformatics research. Typically, these methods predict terms, or concepts, from ontologies such as Gene Ontology (GO) or prioritize genes based on a list of known disease genes. These efforts have typically focused on a relatively limited set of terms, focusing on terminologies that only describe functions and diseases, largely overlooking other ontologies that are available. In our previous work we applied advanced string matching (via the NCBO annotator web-service) on highly curated descriptive texts for genes in order to associate genes with terms from over 250 ontologies from the National Center for Biomedical Ontology (NCBO).
Here, we set out to broadly evaluate these novel annotations and identify those concepts of publicly available ontologies that can be predicted using a generalized tool for prediction of annotations. We analyzed hundreds of thousands of terms using the fast and accurate gene prioritization tool GeneMANIA in a systematic cross-validation of all terms of size 5 to 1000 genes. We identified terms that perform better than expected by chance using randomly generated gene sets and show that both manually curated terms in GO and automatically recognized terms can be used to develop reasonable predictive models. In all, we characterize terms in over 250 ontologies and identify more than 127,000 statistically significant terms that can be predicted on human genes.


top
P54:
Integration of Multiple Genomic Assays in Predicting Breast Cancer Survival

Presenting Author: Howard Yang, National Cancer Institute

Abstract:
We developed a molecular signature by integrating multiple genomic and epigenomic data. These include DNA copy number, gene expression, microRNA, and methylation data from TCGA breast tumor samples. Our signature consisted of genes, methylation and microRNA markers that were selected based on histopathological data and information content of the high-dimensional data. Given the background of genome wide DNA copy number variations, we studied the gene regulation with the interactions among gene expression, microRNA expression and DNA methylation. We applied the Kaplan-Meier model to study the association of the signature with overall survival or distant metastasis free survival. We also conducted multivariate analysis using Cox proportional hazards regression model and showed that our signature had independent effect in survival prediction after adjusting for the variables such as age, ER, tumor size, grade and node. The signature was applied to independent datasets including GEO dataset GSE6532, NKI, and Breast Tumor 2000 to examine the generalization of the signature in survival prediction. We used Principal component analysis to construct a base line signature using the selected genes or a random set of genes. Our signature consistently outperformed the base line signature.


top
P55:
Gene Content Based Functional Trait Analysis through Comparative Microbial Whole Genome Sequences

Presenting Author: Erliang Zeng, University of Notre Dame

Author(s):
Wei Zhang, University of Notre Dame, United States
Dan Liu, University of Notre Dame, United States
Stuart Jones, University of Notre Dame, United States
Scott Emrich, University of Notre Dame, United States

Abstract:
Microbial communities perform many important ecological functions across a wide range of natural and man-made environments. Recently, trait-based approaches for microbial communities have been identified as an emerging area because increasing availability of whole genome sequences has provided a growing opportunity to explore genetic foundations of a variety of ecologically important traits. We proposed a machine learning framework to quantitatively link genomic features with functional traits. Using genes from bacteria genomes grouped into Cluster of Orthologs (COGs) as starting features, the TF-IDF technique from the text mining domain was applied to accommodate the abundance and importance of each COG. After TF-IDF processing, COGs were ranked using different feature selection methods to help identify their relevance to the functional traits. Network analysis of top ranked COGs showed that functional trait related COGs were enriched and modularized for the trait of interest. The use of comparative whole genome sequence data allows connecting functional traits with genomic contexts. Our extensive experimental results demonstrate that functional traits of interest can be detected using our approach. Significantly, this method has the potential to provide novel biological insights in ecological contexts not only by connected genes to traits but looking at combinations of these genes/traits in increasingly available whole genome data.


top
P56:
PePr: a Peak-calling and prioritization pipeline to test group differences in ChIP-Seq data

Presenting Author: Yanxiao Zhang, University of Michigan

Author(s):
Yu-Hsuan Lin, University of Michigan, United States
Timothy Johnson, University of Michigan, United States
Laura Rozek, University of Michigan, United States
Jaeseok Han, University of Michigan, United States
Maureen Sartor, University of Michigan, United States

Abstract:
ChIP-Seq is now the standard method to identify genome-wide DNA-binding sites for transcription factors and histone modifications. As use of this technique grows, there is a growing need to analyze experiments with biological replicates, especially for epigenomic experiments where variation among biological samples is high. A substantial number of peak-calling programs exist to automatically identify binding sites from ChIP-Seq data; however to our knowledge, no existing program takes into account variance among biological replicates. Even with transcription factor experiments, accounting for variation has the potential to identify binding sites more consistently functionally relevant in the context under study.
With this in mind, we developed a novel ChIP-Seq Peak-calling and Prioritization pipeline (PePr) that uses a sliding window approach and models read counts across replicates and between groups with a negative binomial distribution. PePr empirically estimates the optimal shift/fragment size and sliding window width, and estimates dispersion from the local genomic area. Regions with less variability across replicates are ranked more favorably than regions with greater variability. We compared the performance of PePr to commonly-used existing methods on both transcription factor and histone modification datasets containing biological replicates. With the transcription factor ChIP-Seq data, PePr outperformed other methods in both motif analysis and visual inspection of the peaks that are uniquely identified or ranked much higher by each method. For the histone modification data, which exhibited significantly higher variation among samples, PePr achieved better specificity than alternative approaches by identifying regions that are more consistently differential between the two sample groups.


top
P57:
Analysis of Genome Wide DNA Methylation Changes in Human Skeletal Muscle Associated with Aging

Presenting Author: Artem Zykovich, Buck Institute for Research on Aging

Author(s):
Alan Hubbard, University of California Berkeley, United States
Mark Tarnopolsky, McMaster University, United States
Simon Melov, Buck Institute for Research on Aging, United States
Sean Mooney, Buck Institute for Research on Aging, United States

Abstract:
Aging is accompanied by the reduction of muscle mass and function, and leads to a decrease in mobility. Here we report for the first time a genome-wide study of DNA methylation dynamics in skeletal muscle derived from healthy individuals during normal human aging. Biopsies of 24 young and 24 older males were taken, DNA extracted, labeled, and applied to the 450K Illumina methylation array. We identified 5963 CpG sites that are differentially methylated with age (dmCpG). Of these, 92% of them, were hypermethylated (became more methylated with age), while the remaining 8% were hypomethylated (became less methylated with age). These dmCpG sites were significantly enriched within a gene. Structurally, dmCpG were underrepresented in promoters and are overrepresented in the middle and 3’ end of genes. We found that dmCpG sites with age are prone to be in CpG islands (CGI) and CGI shores, but only if they are located within a gene body and not overlapping with a promoter. We found that the majority of genes which were altered by methylation with age are characterized by unchanged expression. For genes that have changes in methylation and expression, we found that majority of hypomethylated genes had an increase in gene expression, and for hypermethylated genes the opposite was true. This result suggests that some minimal number of dmCpG sites or select sites are required to be altered in order to correlate with gene expression. Finally, we identified over 200 biomarkers that near perfectly predict human biological age.


top
P58:
Deletion of COX-2 gene augments atherosclerosis and vascular inflammation in ApoE-deficient mice, independently of local prostacyclin production

Presenting Author: Nicholas Kirkby, Imperial College London

Author(s):
Martina Lundberg, Imperial College London, United Kingdom
Tim Warner, Barts & the London School of Medicine, United Kingdom
Mark Paul-Clark, Imperial College London, United Kingdom
Jane Mitchell, Imperial College London, United Kingdom

Abstract:
Widely used COX-2 inhibitor drugs (e.g. Vioxx, Alleve, Volteran) are associated with an increase in atherothrombotic events in man, which is often said to reflect reduced vascular production of the cardioprotective hormone, prostacyclin. Here we attempted to model this by determining the effect of COX-2 deletion in atherosclerosis-prone ApoE-null mice. Deletion of COX-2 in ApoE-deficient mice increased atherosclerotic lesion development, but did not alter the ability of affected vessels to synthesise prostacyclin. In order to identify novel pathways by which COX-2 at other sites might regulate atherosclerotic disease we performed transcriptomic analysis of tissues displaying high (lung) and low (liver) COX-2 gene expression. These experiments identified 91 (lung) and 94 (liver) genes differentially expressed in COX-2-deficient animals. Ingenuity pathway analysis of these demonstrated a significant up-regulation of pathways associated with T- and B-lymphocyte function in both tissues. To validate these, we measured circulating inflammatory cell counts and cytokine levels, and found that deletion of COX-2 increased lymphocyte counts and the lymphocyte-related cytokines, IL-2 and IL-12. In addition, immunohistochemistry indicated that COX-2 deletion resulted in accumulation of T-lymphocytes in the adventitial layer of atherosclerotic vessels. Taken together, these data demonstrate that COX-2 normally acts to limit atherosclerotic disease, independent of local prostacyclin synthesis, and suggest a novel mechanism by which COX-2 may influence atherosclerotic disease at distant sites by regulation of lymphocyte function. Using studies of this type, we hope to identify the mechanisms of cardiovascular toxicity of COX-2 inhibitors, in order to develop a new generation of therapies.


top