19th Annual International Conference on
Intelligent Systems for Molecular Biology and
10th European Conference on Computational Biology

Proceedings Track Presentations

Presenting Author identified in bold


PT01Sunday, July 17: 10:45 a.m. - 11:10 a.m.
Sorting the nuclear proteome

Room: Hall L/M
Subject: Protein Structure and Function

Author(s):
Denis Bauer, The University of Queensland, Australia
Kai Willadsen, The University of Queensland, Australia
Fabian Buske, The University of Queensland, Australia
Kim-Anh Le Cao, The University of Queensland, Australia
Timothy Bailey, The University of Queensland, Australia
Graham Dellaire, Dalhousie University, Canada
Mikael Boden, The University of Queensland, Australia

Session Chair: Anna Tramontano

Motivation: Quantitative experimental analyses of the nuclear interior reveal a morphologically structured yet dynamic mix of membraneless compartments. Major nuclear events depend on the functional integrity and timely assembly of these intra-nuclear compartments. Yet-unknown drivers of protein mobility ensure that they are in the right place at the time when they are needed. Results: This study investigates determinants of associations between eight intra-nuclear compartments and their proteins in heterogeneous genome-wide data. We develop a model based on a range of candidate determinants, capable of mapping the intra-nuclear organisation of proteins. The model integrates protein interactions, protein domains, post-translational modification sites and protein sequence data. The predictions of our model are accurate with a mean AUC (over all compartments) of 0.71. We present a complete map of the association of 3567 mouse nuclear proteins with intra-nuclear compartments. Each decision is explained in terms of essential interactions and domains, and qualified with a false discovery assessment. Using this resource, we uncover the collective role of transcription factors in each of the compartments. We create diagrams illustrating the outcomes of a Gene Ontology enrichment analysis. Associated with an extensive range of transcription factors, the analysis suggests that PML bodies coordinate regulatory immune responses.


top
PT02Sunday, July 17: 10:45 a.m. - 11:10 a.m.
vipR: Variant identification in pooled DNA using R

Room: A353
Subject: Sequence Analysis

Author(s):
Andre Altmann, Max Planck Institute of Psychiatry, Germany
Peter Weber, Max Planck Institute of Psychiatry, Germany
Carina Quast, Max Planck Institute of Psychiatry, Germany
Monika Rex-Haffner, Max Planck Institute of Psychiatry, Germany
Elisabeth B. Binder, Max Planck Institute of Psychiatry, Germany
Bertram Müller-Myhsok, Max Planck Institute of Psychiatry, Germany

Session Chair: John Kececioglu

Motivation: High-throughput-sequencing (HTS) technologies are the method of choice for screening the human genome for rare sequence variants causing susceptibility to complex diseases. Unfortunately, preparation of samples for a large number of individuals is still very cost- and labor-intensive. Thus, recently, screens for rare sequence variants were carried out in samples of pooled DNA, in which equimolar amounts of DNA from multiple individuals are mixed prior to sequencing with HTS. The resulting sequence data, however, poses a bioinformatics challenge: the discrimination of sequencing errors from real sequence variants present at a low frequency in the DNA pool. Results: Our method vipR uses data from multiple DNA pools in order to compensate for differences in sequencing error rates along the sequenced region. More precisely, instead of aiming at discriminating sequence variants from sequencing errors, vipR identifies sequence positions that exhibit significantly different minor allele frequencies in at least two DNA pools using the Skellam distribution. The performance of vipR was compared to three other models on data from a targeted re-sequencing study of the TMEM132D locus in 600 individuals distributed over four DNA pools. Performance of the methods was computed on SNPs that were also genotyped individually using a MALDI-TOF technique.  On a set of 82 sequence variants vipR achieved an average sensitivity of 0.80 at an average specificity of 0.92, thus outperforming the reference methods by at least 0.17 in specificity at comparable sensitivity. Availability: The code of vipR is freely available via: http://sourceforge.net/projects/htsvipr/


top
PT03Sunday, July 17: 11:15 a.m. - 11:40 a.m.
Environment specific substitution tables improve membrane protein alignment

Room: Hall L/M
Subject: Protein Structure and Function

Author(s):
Jamie R. Hill, University of Oxford, United Kingdom
Sebastian Kelm, University of Oxford, United Kingdom
Jiye Shi, UCB Celltech, United Kingdom
Charlotte M. Deane, University of Oxford, United Kingdom

Session Chair: Anna Tramontano

Motivation: Membrane proteins are both abundant and important in cells, but the small number of solved structures restricts our understanding of them. Here we consider whether membrane proteins undergo different substitutions from their soluble counterparts and whether these can be used to improve membrane protein alignments, and therefore improve prediction of their structure.Results:We construct environment-specific substitution tables for membrane proteins. As data is scarce, we develop a general metric to assess the quality of asymmetric substitution tables. Membrane proteins show markedly different substitution preferences from soluble proteins. For example, substitution preferences in lipid tail-contacting parts of membrane proteins are found to be distinct from all environments in soluble proteins, including buried residues. From a principal component analysis of the tables, the first component can be broadly identified as a measure of hydrophobicity, whilst the second relates to secondary structure. We demonstrate the use of our tables in sequence-to-structure alignments of membrane proteins using the FUGUE alignment program. On average, in the 10-25% sequence identity range, alignments are improved by 28 correctly aligned residues compared with the default FUGUE tables. Coordinate generation from our alignments yields improved structure models.Availability:Substitution tables are available at http://www.stats.ox.ac.uk/proteins/resources.


top
PT04Sunday, July 17: 11:15 a.m. - 11:40 a.m.
IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming

Room: A353
Subject: Sequence Analysis

Author(s):
Kengo Sato, University of Tokyo, Japan
Yuki Kato, Nara Institute of Science and Technology, Japan
Michiaki Hamada, University of Tokyo, Japan
Tatsuya Akutsu, Kyoto University, Japan
Kiyoshi Asai, University of Tokyo, Japan

Session Chair: John Kececioglu

Motivation: Pseudoknots found in secondary structures of a number of functional RNAs play various roles in biological processes. Recent methods for predicting RNA secondary structures cover certain classes of pseudoknotted structures, but only a few of them achieve satisfying predictions in terms of both speed and accuracy. Results: We propose IPknot, a novel computational method for predicting RNA secondary structures with pseudoknots based on maximizing expected accuracy of a predicted structure. IPknot decomposes a pseudoknotted structure into a set of pseudoknot-free substructures and approximates a base-pairing probability distribution that considers pseudoknots, leading to the capability of modeling a wide class of pseudoknots and running quite fast. In addition, we propose a heuristic algorithm for refining base-paring probabilities to improve the prediction accuracy of IPknot. The problem of maximizing expected accuracy is solved by using integer programming with threshold cut. We also extend IPknot so that it can predict the consensus secondary structure with pseudoknots when a multiple sequence alignment is given. IPknot is validated through extensive experiments on various data sets, showing that IPknot achieves better prediction accuracy and faster running time as compared with several competitive prediction methods. Availability:The program of IPknot is available at http://www.ncrna.org/software/ipknot/. IPknot is also available as a web server at http://rna.naist.jp/ipknot/.


top
PT05Sunday, July 17: 11:45 a.m. - 12:10 p.m.
Sequence-based prediction of protein crystallization, purification, and production propensity

Room: Hall L/M
Subject: Protein Structure and Function

Author(s):
Marcin J. Mizianty, University of Alberta, Canada
Lukasz Kurgan, University of Alberta, Canada

Session Chair: Anna Tramontano

Motivation: X-ray crystallography-based protein structure determination, which accounts for majority of solved structures, is characterized by relatively low success rates. One solution is to build tools which support selection of targets that are more likely to crystallize. Several in-silico methods that predict propensity of diffraction-quality crystallization from protein chains were developed. We show that their predictive quality deteriorates over time, which calls for new solutions. We propose a novel approach that alleviates drawbacks of the existing methods by using a recent dataset and improved protocol to annotate progress along the crystallization process, by predicting the success of the entire process and steps which result in the failed attempts, and by utilizing a compact and comprehensive set of sequence-derived inputs to generate accurate predictions. Results: The proposed PPCpred predict propensity for production of diffraction-quality crystals, production of crystals, purification, and production of the protein material. PPCpred utilizes comprehensive inputs based on energy and hydrophobicity indices, composition of certain amino acid types, predicted disorder, secondary structure and solvent accessibility, and content of certain buried and exposed residues. Our method significantly outperforms alignment-based predictions and several modern crystallization propensity predictors. ROC curves show that PPCpred is particularly useful for users who desire high TP-rates, i.e., low rate of mispredictions for solvable chains. Our model reveals several intuitive factors that influence the success of individual steps and the entire crystallization process, including the content of Cys, buried His and Ser, hydrophobic/hydrophilic segments, and number of disordered segments. Availability: http://biomine.ece.ualberta.ca/PPCpred/ Supplementary information: http://biomine.ece.ualberta.ca/PPCpred/Supplement.pdf


top
PT06Sunday, July 17: 11:45 a.m. - 12:10 p.m.
Meta-IDBA: De Novo Assembler for Mixed Species in Metagenomic Data

Room: A353
Subject: Sequence Analysis

Author(s):
Yu Peng, The University of Hong Kong, Hong Kong
Henry C.M. Leung, The University of Hong Kong, Hong Kong
S.M. Yiu, The University of Hong Kong, Hong Kong
Francis Chin, The University of Hong Kong, Hong Kong

Session Chair: John Kececioglu

Motivation: The next-generation sequencing techniques allow us to sequence reads from a microbial environment for analyzing the microbial community. However, assembling mixed reads from differ-ence species to contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from single genome, there is no assembler for assembling reads in me-tagenomic data without reference genome sequence. Moreover, the performances of these assemblers on metagenomic data are bad because of the existence of common regions in the genomes of subspecies and species which make the assembly problem much more complicated.Results: In this paper, we introduced the Meta-IDBA algorithm for assembling reads in metagenomic data contains multiple genomes from different species. Meta-IDBA can separate reads from different species by certain properties in the de Bruijn graph. It can also cap-ture the slight variants of genomes of subspecies from the same species by multiple alignments and represents the genome of spe-cies by the consensus sequence. Comparison of the performances of Meta-IDBA and existing assemblers Velvet and Abyss in different metagenomic data sets showed that Meta-IDBA can reconstruct longer contigs.Availability: Meta-IDBA toolkit is available at our website http://www.cs.hku.hk/~alse/metaidba.


top
PT07Sunday, July 17: 12:15 p.m. - 12:40 p.m.
A method for probing the mutational landscape of amyloid structure

Room: Hall L/M
Subject: Protein Structure and Function

Author(s):
Charles W. O'Donnell, Massachusetts Institute of Technology, United States
Jerome Waldispuhl, McGill, Canada
Mieszko Lis, Massachusetts Institute of Technology, United States
Randal Halfmann, Whitehead Institute for Biomedical Research, United States
Srinivas Devadas, Massachusetts Institute of Technology, United States
Susan Lindquist, Whitehead Institute for Biomedical Research, United States
Bonnie Berger, Massachusetts Institute of Technology, United States

Session Chair: Anna Tramontano

Motivation: Proteins of all kinds can self-assemble into highly-ordered beta-sheet aggregates known as amyloid fibrils, important both biologically and clinically. However, the specific molecular structure of a fibril can vary widely depending on sequence and environmental conditions — even single point mutations can drastically alter function and pathogenicity. Unfortunately, experimental structure determination has proven extremely difficult with only a handful of NMR-based models proposed, suggesting a need for computational methods. Results: We present AmyloidMutants, a statistical mechanics approach for de novo prediction and analysis of wild-type and mutant amyloid structures. Based on the premise of protein mutational landscapes, AmyloidMutants energetically quantifies the effects of sequence mutation on fibril conformation and stability. Tested on full-length, non-mutant amyloid structures with known NMR chemical shift data, AmyloidMutants provides a two-fold improvement in prediction accuracy over existing tools. Moreover, AmyloidMutants is the only method to predict complete super-secondary structures, enabling accurate discrimination of topologically-dissimilar amyloid conformations that correspond to the same sequence locations. Applied to mutant prediction, AmyloidMutants identifies a global conformational switch between ABeta and its highly-toxic “Iowa” mutant in agreement with a recent experimental model based on partial chemical shift data. Predictions on mutant, yeast-toxic strains of HETs suggest similar alternate folds. When applied to wild-type HET-s and a HET-s mutant with core asparagines replaced by glutamines (both highly-amyloidogenic chemically-similar residues abundant in many amyloids), AmyloidMutants surprisingly predicts a dramatically reduced capacity of the glutamine mutant to form amyloid. We confirm this finding by conducting mutagenesis experiments.


top
PT08Sunday, July 17: 12:15 p.m. - 12:40 p.m.
A Conditional Random Fields Method for RNA Sequence-Structure Relationship Modeling and Conformation Sampling

Room: A353
Subject: Sequence Analysis

Author(s):
Zhiyong Wang, Toyota Technological Institute at Chicago, United States
Jinbo Xu, Toyota Technological Institute at Chicago, United States

Session Chair: John Kececioglu

Accurate tertiary structures are very important for the functional study of non-coding RNA molecules. However, predicting RNA tertiary structures is extremely challenging because of a large conformation space to be explored and lack of an accurate scoring function differentiating the native structure from decoys. The fragment-based conformation sampling method (e.g., FARNA) bears shortcomings that the limited size of a fragment library makes it infeasible to represent all possible conformations well. A recent dynamic Bayesian network method BARNACLE overcomes the issue of fragment assembly. In addition, neither of these methods makes use of sequence information in sampling conformations. Here, we present a new probabilistic graphical model, Conditional Random Fields (CRFs), to model RNA sequence-structure relationship, which enables us to accurately estimate the probability of a RNA conformation from sequence. Coupled with a novel tree-guided sampling scheme, our CRF model is then applied to RNA conformation sampling. Experimental results show that our CRF method can model RNA sequence-structure relationship well and sequence information is important for conformation sampling. Our method, named as TreeFolder, generates a much higher percentage of native-like decoys than FARNA and BARNACLE, although we use the same simple energy function as BARNACLE.


top
PT09Sunday, July 17: 2:30 p.m. - 2:55 p.m.
Piecewise Linear Approximation of Protein Structures using the Principle of Minimum Message Length

Room: Hall L/M
Subject: Protein Structure and Function

Author(s):
Arun Konagurthu, Monash University, Australia
Lloyd Allison, Monash University, Australia
Peter Stuckey, University of Melbourne, Australia
Arthur Lesk, Pennsylvania State University, United States

Session Chair: Yves Moreau

Simple and concise representations of protein folding patterns provide powerful abstractions for visualizations, comparisons, classifications, searching and aligning structural data. Structures are often abstracted by replacing standard secondary structural features -- that is, helices and strands of sheet -- by vectors or linear segments. Relying solely on standard secondary structure may result in a significant loss of structural information. Further, traditional methods of simplification crucially depend on the consistency and accuracy of external methods to assign secondary structures to protein coordinate data. Although many methods exist automatically to identify secondary structure, the impreciseness of definitions, along with errors and inconsistencies in experimental structure data, drastically limit their applicability to generate reliable simplified representations, especially for structural comparison. This paper introduces a mathematically rigorous algorithm to delineate protein structure using the elegant statistical and inductive inference framework of Minimum Message Length (MML). Our method generates consistent and statistically robust piecewise linear explanations of protein coordinate data, resulting in a powerful and concise representation of the structure. The delineation is completely independent of the approaches of using hydrogen-bonding patterns or inspecting local substructural geometry that the current methods use. Indeed, as is common with applications of the MML criterion, this method is free of parameters and thresholds, in striking contrast to the existing programs which are often beset by them. The analysis of results over a large number of proteins suggests that the method produces consistent delineation of structures that encompasses, among others, the segments corresponding to standard secondary structure.


top
PT10Sunday, July 17: 2:30 p.m. - 2:55 p.m.
Discovering and visualising indirect associations between biomedical concepts

Room: A353
Subject: Text Mining

Author(s):
Yoshimasa Tsuruoka, Japan Advanced Institute of Science and Technology, Japan
Makoto Miwa, he University of Tokyo, Japan
Kaisei Hamamoto, The National Centre for Text Mining, Japan
Jun'Ichi Tsujii, The University of Tokyo; The National Centre for Text Mining;The University of Manchester, Japan
Sophia Ananiadou, The University of Manchester, United Kingdom

Session Chair: Eric Xing

Motivation: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner. Results: This paper describes FACTA+, a real-time text mining system for finding and visualising indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualise indirect associations between important biomedical concepts such as genes, diseases, and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (a) detecting bio-molecular events in text using a machine learning model, (b) discovering hidden associations using co-occurrence statistics between concepts, and (c) visualising associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving bio-molecular events and visualising indirect associations of concepts with both their categories and importance. Availability: FACTA+ is available as a web application at http://refine1-nactem.mc.man.ac.uk/facta/, and its visualiser is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/.


top
PT11Sunday, July 17: 3:00 p.m. - 3:25 p.m.
QAARM: Quasi-anharmonic auto-regressive model reveals molecular recognition pathways in ubiquitin

Room: Hall L/M
Subject: Protein Structure and Function

Author(s):
Andrej Savol, University of Pittsburgh, United States
Virginia Burger, University of Pittsburgh, United States
Pratul Agarwal, Oakridge National Labs, United States
Arvind Ramanathan, Oakridge National Labs, United States
Chakra Chennubhotla, University of Pittsburgh, United States

Session Chair: Yves Moreau

Motivation: Molecular dynamics (MD) simulations have dramatically improved the atomistic understanding of protein motions, energetics, and function.  These growing data sets have necessitated a corresponding emphasis on trajectory analysis methods for characterizing simulation data, particularly since functional protein motions and transitions are often rare and/or intricate events. Observing that such events give rise to long-tailed spatial distributions, we recently introduced a higher-order statistics based dimensionality reduction method, called quasi-anharmonic analysis, for identifying biophysically-relevant reaction coordinates and substates within MD simulations. Lacking a predictive component, QAA is extended here within a general auto-regressive (AR) model appreciative of the trajectory's temporal dependencies and the specific, local dynamics accessible to a protein within identified energy wells. Within a QAA-derived subspace, these metastable states and their transition rates are extracted using hierarchical Markov clustering and provide parameter sets for the second-order AR model. We show the learned model can be extrapolated to synthesize trajectories of arbitrary length, and, in relating local fluctuations to damped oscillators, that compact linear models can exploit the statistical regularities of protein dynamics. Results: Our model uses hierarchical clustering to learn key sub-states and dynamic modes of motion from a 0.5$mu$s ubiquitin simulation. Auto-regressive modeling within and between states enables a compact and generative description of the conformational landscape as it relates to functional transitions between binding poses


top
PT12Sunday, July 17: 3:00 p.m. - 3:25 p.m.
MeSH: A Window into Full-Text for Document Summarization

Room: A353
Subject: Text Mining

Author(s):
Sanmitra Bhattacharya, The University of Iowa, United States
Viet Ha-Thuc, The University of Iowa, United States
Padmini Srinivasan, The University of Iowa, United States

Session Chair: Eric Xing

Motivation: Previous research in the biomedical domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full-texts. Full-texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full-texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full-texts. Specifically, we create reduced versions of full-text documents that contain only important portions. Long- term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here we focus on designing summarization strategies. In particular we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full-text documents.
Results: Our experiments confirm the ability of our approach to pick the important text portions. Using the ROUGE measures for evaluation we were able to achieve maximum ROUGE-1, ROUGE- 2 and ROUGE-SU4 F-scores of 0.4150, 0.1435 and 0.1782, respectively for our MeSH term-based method versus the maximum baseline scores of 0.3815, 0.1353 and 0.1428, respectively. Using a MeSH profile-based strategy we were able to achieve maximum ROUGE F-scores of 0.4320, 0.1497 and 0.1887, respectively. Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full-texts.


top
PT13Sunday, July 17: 3:30 p.m. - 3:55 p.m.
Multi-view Methods for Protein Structure Comparison using Latent Dirichlet Allocation

Room: Hall L/M
Subject: Protein Structure and Function

Author(s):
Shivashankar S, IT Madras, India
Srivathsan S, Anna University, India
Ravindran B, IT Madras, India
Ashish Tendulkar, IT Madras, India

Session Chair: Yves Moreau

With rapidly expanding protein structure databases, efficiently retrieving structures similar to a given protein is an important problem. It involves two major issues: (i) effective protein structure representation that captures inherent relationship between fragments and facilitates efficient comparison between the structures,(ii) effective framework to address different retrieval requirements. Recently, researchers proposed vector space model of proteins using bag of fragments representation (FragBag), which corresponds to the basic information retrieval model. In this paper we propose an improved representation of protein structures using Latent Dirichlet Allocation (LDA) topic model. Another important requirement is to retrieve proteins, whether they are either close or remote homologs.In order to meet diverse objectives, we propose multi-view point based framework that combines multiple representations and retrieval techniques.  We compare the proposed representation and retrieval framework on the benchmark dataset developed by Kolodny and co-workers. The results indicate that the proposed techniques outperform the state of art methods.


top
PT14Sunday, July 17: 3:30 p.m. - 3:55 p.m.
A Folding Algorithm for Extended RNA Secondary Structures

Room: A353
Subject: Sequence Analysis

Author(s):
Christian Hoener Zu Siederdissen, University of Vienna, Austria
Stephan Bernhart, University of Vienna, Austria
Peter Stadler, Leipzig University, Germany
Ivo Hofacker, University of Vienna, Austria

Session Chair: Eric Xing

Motivation: RNA secondary structure contains many non-canonical base pairs of different pair families. Successful prediction of these structural features leads to improved secondary structures with applications in tertiary structure prediction and simultaneous folding and alignment. Results: We present a theoretical model capturing both RNA pair families and extended secondary structure motifs with shared nulcleotides using 2-diagrams. We accompany this model with a number of programs for optimization of parameters and prediction of structures. Availability: All sources (optimization routines, RNA folding, RNA evaluation, extended secondary structure visualization) are published under the GPLv3 and available at: www.tbi.univie.ac.at/~choener/segfold/


top
PT15Sunday, July 17: 4:00 p.m. - 4:25 p.m.
Template-free detection of macromolecular complexes in cryo electron tomograms

Room: Hall L/M
Subject: Protein Structure and Function

Author(s):
Min Xu, University of Southern California, United States
Martin Beck, European Molecular Biology Laboratory, Germany
Frank Alber, University of Southern California, United States

Session Chair: Yves Moreau

Motivation: Cryo electron tomography (CryoET) produces 3D density maps of biological specimen in its near native states. Applied to small cells cryoET produces 3D snapshots of the cellular distributions of large complexes. However, retrieving this information is non-trivial due to the low resolution and low signal to noise ratio in tomograms. Current pattern recognition methods identify complexes by matching known structures to the cryo electron tomogram. However, so far only a small fraction of all protein complexes have been structurally resolved. It is therefore of great importance to develop template-free methods for the discovery of previously unknown protein complexes in cryo electron tomograms. Results: Here, we have developed an inference method for the template-free discovery of frequently occurring protein complexes in cryo electron tomograms. We provide a first proof-of-principle of the approach and assess its applicability using realistically simulated tomograms, allowing for the inclusion of noise and distortions due to missing wedge and electron optical factors. Our method is a step towards the template-free discovery of the shapes, abundance and spatial distributions of previously unknown macromolecular complexes in whole cell tomograms.


top
PT16Sunday, July 17: 4:00 p.m. - 4:25 p.m.
Error correction of high-throughput sequencing datasets with non-uniform coverage

Room: A353
Subject: Sequence Analysis

Author(s):
Paul Medvedev, University of California, San Diego, United States
Eric Scott, University of California, San Diego, United States
Pavel Pevzner, University of California, San Diego, United States

Session Chair: Eric Xing

The continuing improvements to high-throughput sequencing (HTS) platforms have begun to unfold a myriad of new and exciting applications. As a result, error correction of sequencing reads remains a key component in many applications. Though several tools do an excellent job of correcting datasets where the reads are sampled close to uniformly, the problem of correcting reads coming from drastically non-uniform datasets, such as those from single cell sequencing, remains open.


top
PT17Monday, July 18: 10:45 a.m. - 11:10 a.m.
Generative Probabilistic Models for Protein-Protein Interaction Networks – The Biclique Perspective

Room: Hall L/M
Subject: Protein Interactions and Molecular Networks

Author(s):
Regev Schweiger, The Hebrew University of Jerusalem, Israel
Michal Linial, The Hebrew University of Jerusalem, Israel
Nathan Linial, The Hebrew University of Jerusalem, Israel

Session Chair: Robert Russell

Motivation: Much of the large-scale molecular data from living cells can be represented in terms of networks. Such networks occupy a central position in cellular systems biology. In the protein-protein interaction (PPI) network, nodes represent proteins and edges rep-resent connections between them, based on experimental evidence. PPI networks are rich and complex, so that a mathematical model is sought to capture their properties and shed light on PPI evolution. The mathematical literature contains various generative models of random graphs. It is a major, still largely open question, which of these models (if any) can properly reproduce various biologically-interesting networks. Here we consider this problem where the graph at hand is the PPI network of Saccharomyces cerevisiae. We are trying to distinguishing between a model family which performs a process of copying neighbors, represented by the Duplication-Divergence (DD) model, and models which do not copy neighbors, with the Barabási-Albert (BA) preferential attachment model as a leading example.Results: The property of the network that we observe is the distribu-tion of maximal bicliques in the graph. This is a novel criterion to distinguish between models in this area. It is particularly appropriate for this purpose, since it reflects the graph’s growth pattern under either model. This test clearly favors the DD model. In particular, for the BA model the vast majority (92.9%) of the bicliques with both sides ≥4 must be already embedded in the model’s seed graph, whereas the corresponding figure for the DD model is only 5.1%. Our results, based on the biclique perspective, conclusively show that a naïve unmodified DD model can capture a key aspect of PPI networks. Supplementary information: Supplementary data are available at Bioinformatics online.


top
PT18Monday, July 18: 10:45 a.m. - 11:10 a.m.
Epistasis detection on quantitative phenotypes by exhaustive enumeration using GPUs

Room: A353
Subject: Disease Models and Epidemiology

Author(s):
Tony Kam-Thong, Max Planck Institute of Psychiatry, Germany
Benno Puetz, Max Planck Institute of Psychiatry, Germany
Nazanin Karbalai, Max Planck Institute of Psychiatry, Germany
Bertram Mueller-Myhsok, Max Planck Institute of Psychiatry, Germany
Karsten Borgwardt, Max-Planck-Institutes, Germany

Session Chair: Niko Beerenwinkel

Motivation: In recent years, numerous genome-wide association studies (GWAS) have been conducted to identify genetic makeup that explains phenotypic differences observed in human population. Analytical tests on single loci are readily available and embedded in common genome analysis software toolset. The search for significant epistasis (gene-gene interactions) still poses as a computational challenge for modern day computing systems, due to the large number of hypotheses that have to be tested.In this article, we present an approach to epistasis detection by exhaustive testing of all possible SNP pairs.Results:The search strategy based on the Hilbert-Schmidt Independence Criterion (HSIC) can help delineate various forms of statistical dependence between the genetic markers and the phenotype.The actual implementation of this search is done on the highly parallelized architecture available on Graphics Processing Units (GPU) rendering the completion of the full search feasible within a day.


top
PT19Monday, July 18: 11:15 a.m. - 11:40 a.m.
RINQ: Reference-based Indexing for Network Queries

Room: Hall L/M
Subject: Protein Interactions and Molecular Networks

Author(s):
Gunhan Gulsoy, University of Florida, United States
Tamer Kahveci, University of Florida, United States

Session Chair: Robert Russell

We consider the problem of similarity queries in biological network databases. Given a database of networks, similarity query returns all the database networks whose similarity (i.e., alignment score) to a given query network is at least a specified similarity cutoff value. Alignment of two networks is a very costly operation, which makes exhaustive comparison of all the database networks with a query impractical. To tackle this problem, we develop a novel indexing method, named RINQ ({f R}eference-based {f I}ndexing for Biological {f N}etwork {f Q}ueries). Our method uses a set of reference networks to eliminate a large portion of the the database quickly for each query. A reference network is a small biological network. We precompute and store the alignments of all the references with all the database networks. When our database is queried, we align the query network with all the reference networks. Using these alignments, we calculate upper and lower bounds to the alignment score of each database network with the query network.With the help of upper and lower bounds, we eliminate majority of the database networks without aligning them to the query network. We also quickly identify a small portion of these as guaranteed to be similar to the query. We perform pairwise alignment only for the remaining networks. We also propose a supervised method to pick references that have a large chance of filtering the unpromising database networks.  Extensive experimental evaluation suggests that (1) our method reduced the running time of a single query on a database of around 300 networks from over two days to only eight hours; (2) our method outperformed the state of the art method Closure Tree and SAGA by a factor of three or more; (3) our method successfully identified statistically and biologically significant relationships across networks and organisms.


top
PT20Monday, July 18: 11:15 a.m. - 11:40 a.m.
Detecting epistatic effect in association studies at a genomic level based on an ensemble approach

Room: A353
Subject: Disease Models and Epidemiology

Author(s):
Jing Li, Case Western Reserve University, United States
Benjamin Horstman, Case Western Reserve University, United States
Yixuan Chen, Case Western Reserve University, United States

Session Chair: Niko Beerenwinkel

Background Most complex diseases involve multiple genes and their interactions. Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect SNPs essentially based on their marginal associations with phenotypes.Methodology/Principal Findings We extend the basic AdaBoost algorithm by incorporating an intuitive importance score based on Gini impurity to select candidate SNPs. Permutation tests are used to control the statistical significance. We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it to existing epistatic detection algorithms.Conclusions/Significance Our results indicate that our approach is valid efficient for GWAS, and on disease models with epistasis has more power than existing programs.


top
PT21Monday, July 18: 11:45 a.m. - 12:10 p.m.
Construction of Co-complex Score Matrix for Protein Complex Prediction from AP-MS Data

Room: Hall L/M
Subject: Protein Interactions and Molecular Networks

Author(s):
Zhipeng Xie, Fudan University, China
Chee Keong Kwoh, Nanyang Technological University, Singapore
Xiaoli Li, Singapore Institute of Infocomm Research, Singapore
Min Wu, Nanyang Technological University, Singapore

Session Chair: Robert Russell

Motivation: Protein complexes are of great importance for unraveling the secrets of cellular organization and function. The AP-MS technique has provided an effective high-throughput screening to directly measure the co-complex relationship among multiple proteins, but its performance suffers from both false positives and false negatives. To computationally predict complexes from AP-MS data, most existing approaches either required the additional knowledge from known complexes (supervised learning), or had numerous parameters to tune. Method: In this article, we propose a novel unsupervised approach, without relying on the knowledge of existing complexes. Our method probabilistically calculates the affinity between two proteins, where the affinity score is evaluated by a Co-Complexed Score or C2S in brief. In particular, our method measures the log likelihood ratio of two proteins being co-complexed to being drawn randomly, and we then predict protein complexes by applying hierarchical clustering algorithm on the C2S score matrix. Results: Compared with existing approaches, our approach is computationally efficient and easy to implement. It has just one parameter to set and its value has little effect on the results. It can be applied to different species as long as the AP-MS data is available. Despite its simplicity, it is competitive or superior in performance over many aspects when compared with the state-of-the-art predictions performed by supervised or unsupervised approaches.


top
PT22Monday, July 18: 11:45 a.m. - 12:10 p.m.
Efficient spatial segmentation of large imaging mass spectrometry data sets with spatially-aware clustering

Room: A353
Subject: Mass Spectrometry and Proteomics

Author(s):
Theodore Alexandrov, University of Bremen, Germany
Jan Hendrik Kobarg, University of Bremen, Germany

Session Chair: Niko Beerenwinkel

Imaging mass spectrometry (IMS) is one of the few measurement technology s of biochemistry which, given a thin sample, is able to reveal its spatial chemical composition in the full molecular range. IMS produces a hyperspectral image, where for each pixel a high-dimensional mass spectrum is measured. Currently, the technology  is mature enough and one of the major problems preventing its spreading  is the under-development of computational methods for mining huge IMS data sets. This paper proposes a novel approach for spatial segmentation of an IMS data set, which is constructed considering the important issue of pixel-to-pixel variability. We segment pixels by clustering their mass spectra. Importantly, we incorporate spatial relations between pixels into clustering, so that pixels are clustered together with their neighbors. We propose two methods. One is non-adaptive, where pixel neighborhoods are selected in the same manner for all pixels. The second one respects the structure observable in the data. For a pixel, its neighborhood is defined taking into account similarity of its spectrum to the spectra of adjacent pixels. Both methods have the linear complexity and require linear memory space (in the number of spectra). The proposed segmentation methods are evaluated on two IMS data sets: a rat brain coronal section and a section of a neuroendocrine tumor. They discover anatomical structure, discriminate the tumor region, and highlight functionally similar regions. Moreover, our methods provide segmentation maps of similar or better quality if compared to the other state-of-the-art methods, but outperform them in runtime and/or memory.


top
PT23Monday, July 18: 12:15 p.m. - 12:40 p.m.
Uncover disease genes by maximizing information flow in the phenome-interactome network

Room: Hall L/M
Subject: Protein Interactions and Molecular Networks

Author(s):
Yong Chen, Tsinghua University, China
Tao Jiang, University of California, Riverside, United States
Rui Jiang, Tsinghua University, China

Session Chair: Robert Russell

Motivation: Pinpointing genes that underlie human inherited diseases among candidate genes in susceptibility genetic regions is the primary step towards the understanding of pathogenesis of diseases. Although several probabilistic models have been proposed to prioritize candidate genes using phenotype similarities and protein-protein interactions, no combinatorial approaches have been proposed in the literature. Results: We propose the first combinatorial approach for prioritizing candidate genes. We first construct a phenome-interactome network by integrating the given phenotype similarity profile, protein-protein interaction network and associations between diseases and genes. Then we introduce a computational method called MAXIF to maximize the information flow in this network for uncovering genes that underlie diseases. We demonstrate the effectiveness of this method in prioritizing candidate genes through a series of cross-validation experiments, and we show the possibility of using this method to identify diseases with which a query gene may be associated. We demonstrate the competitive performance of our method through a comparison with two existing state-of-the-art methods, and we analyze the robustness of our method with respect to the parameters involved. As an example application, we apply our method to predict driver genes in 50 copy number aberration regions of melanoma. Our method is not only able to identify several driver genes that have been reported in the literature, it also shed some new biological insights on the understanding of the modular property and transcriptional regulation scheme of these driver genes.


top
PT24Monday, July 18: 12:15 p.m. - 12:40 p.m.
Automatic Tracing of 3D Neuron Structures Using All-Path Pruning

Room: A353
Subject: Bioimaging

Author(s):
Hanchuan Peng, Howard Hughes Medical Institute, United States
Fuhui Long, Howard Hughes Medical Institute, United States
Gene Myers, Howard Hughes Medical Institute, United States

Session Chair: Niko Beerenwinkel

Motivation: Digital reconstruction, or tracing, of 3D neuron struc-tures is critical toward reverse engineering the wiring and functions of a brain. However, despite a number of existing studies, this task is still challenging, especially when a 3D microscopic image has low single-to-noise ratio and fragmented neuron segments. Published work can handle these hard situations only by introducing global prior information, such as where a neurite segment starts and termi-nates. However manually incorporating such global information can be very time consuming. Thus a completely automatic approach for these hard situations is highly desirable. Results: We have developed an automatic graph algorithm, called All-Path Pruning (APP), to trace the 3D structure of a neuron. To avoid potential mis-tracing of some parts of a neuron, APP first pro-duces an initial over reconstruction, by tracing the optimal geodesic shortest path from the seed location to every possible destination voxel/pixel location in the image. Since the initial reconstruction contains all the possible paths and thus could contain redundant structural components, we simplify the entire reconstruction without compromising its connectedness by pruning the redundant structural elements using a new maximal-covering minimal-redundant (MCMR) subgraph algorithm. We show that MCMR has a linear computa-tional complexity. We also prove that it converges. We examined the performance of our method using challenging 3D neuronal image datasets of different model organisms including fruit fly and mouse.


top
PT25Monday, July 18: 2:30 p.m. - 2:55 p.m.
Physical Module Networks: an Integrative Approach for Reconstructing Transcription Regulation

Room: Hall L/M
Subject: Protein Interactions and Molecular Networks

Author(s):
Noa Novershtern, The Hebrew University of Jerusalem, Israel
Aviv Regev, The Broad Institute, United States
Nir Friedman, The Hebrew University of Jerusalem, Israel

Session Chair: Trey Ideker

Motivation: Deciphering the complex mechanisms by which regulatory networks control gene expression remains a major challenge. While some studies infer regulation from dependencies between the expression levels of putative regulators and their targets, others focus on measured physical interactions. Results: Here, we present Physical Module Networks, a unified framework that combines a Bayesian model describing modules of co-expressed genes and their shared regulation programs, and a Physical Interaction Graph, describing the protein-protein interactions and protein-DNA binding events that coherently underlie this regulation. Using synthetic data, we demonstrate that a Physical Module Network model has similar recall and improved precision compared to a simple Module Network, as it omits many false positive regulators. Finally, we show the power of Physical Module Networks to reconstruct meaningful regulatory pathways in the genetically perturbed yeast and during the yeast cell cycle, as well as during the response of primary epithelial human cells to infection with H1N1 influenza. Availability: The PMN software is available, free for academic use at http://www.compbio.cs.huji.ac.il/PMN/.


top
PT26Monday, July 18: 2:30 p.m. - 2:55 p.m.
Tanglegrams for Rooted Phylogenetic Trees and Networks

Room: A353
Subject: Evolution and Comparative Genomics

Author(s):
Celine Scornavacca, ZBIT, University of Tuebingen, Germany
Daniel H. Huson, ZBIT, University of Tuebingen, Germany

Session Chair: Teresa Przytycka

Motivation: In systematic biology, one is often faced with the task of comparing different phylogenetic trees, in particular in multi- gene analysis or cospeciation studies. One approach is to use a tanglegram in which two rooted phylogenetic trees are drawn opposite each other, using auxiliary lines to connect matching taxa. There is an increasing interest in using rooted phylogenetic networks to represent evolutionary history, so as to explicitly represent reticulate events, such as horizontal gene transfer, hybridization or reassortment. The question arises how to define and compute a tanglegram for such networks. Results: In this paper we present the first formal definition of a tanglegram between two rooted phylogenetic network and present a heuristic approach for computing them. We compare the performance of our method with existing tree tanglegram algorithms and also show a typical application to real biological data set. For maximum usability, the algorithm does not require that the trees or networks are bifurcating or bicombining, or that they are on identical taxon sets.Availability: The algorithm is implemented in our program Dendroscope 3, which is freely available from www.dendroscope.org.


top
PT27Monday, July 18: 3:00 p.m. - 3:25 p.m.
Identification of metabolic network models from incomplete high-throughput datasets

Room: Hall L/M
Subject: Protein Interactions and Molecular Networks

Author(s):
Sara Berthoumieux, INRIA, France
Matteo Brilli, CNRS, INRIA, France
Hidde De Jong, INRIA, France
Daniel Kahn, CNRS, INRA, France
Eugenio Cinquemani, INRIA, France

Session Chair: Trey Ideker

Motivation:High-throughput measurement techniques for metabolism and gene expression provide a wealth of information for the identification of metabolic network models. Yet, missing observations scattered over the dataset restrict the number of effectively available datapoints and make classical regression techniques inaccurate or inapplicable. Thorough exploitation of the data by identification techniques that explicitly cope with missing observations is therefore of major importance.


top
PT28Monday, July 18: 3:00 p.m. - 3:25 p.m.
Mapping ancestral genomes with massive gene loss: a matrix sandwich problem

Room: A353
Subject: Evolution and Comparative Genomics

Author(s):
Haris Gavranovic, University of Sarajevo, Bosnia And Herzegovina
Cedric Chauve, Simon Fraser University, Canada
Jerome Salse, INRA Clermont Ferrand, France
Eric Tannier, INRIA, France

Session Chair: Teresa Przytycka

Motivation: Ancestral genomes can provide a better way to understand the structural evolution of genomes than the comparison of extant genomes. Most ancestral genome reconstruction methods rely on universal markers, that is, homologous families of DNA segments present in exactly one exemplar in every considered species. Complex histories of genes or other markers, undergoing duplications and losses, are rarely taken into account. It follows that some ancestors are inaccessible by these methods, such as the proto-monocotyledon whose evolution involved massive gene loss following a whole genome duplication. Results:We propose a mapping approach based on the combinatorial notion of ``sandwich consecutive ones matrix'', which explicitly takes gene losses into account. We introduce combinatorial optimization problems related to this concept, and propose several results, including a heuristic and a lower bound on the optimal solution. We use these results to analyze real datasets of mammalian and plant genomes. We propose a configuration for proto-chromosomes of the monocot ancestor, and study the accuracy of this configuration. We also use our method to reconstruct the ancestral boreoeutherian genomes, which illustrates that the framework we propose is not specific to plant paleogenomics but is adapted to reconstruct ancestral genomes from extant genomes with a highly heterogeneous marker content.


top
PT29Monday, July 18: 3:30 p.m. - 3:55 p.m.
TREEGL: Reverse Engineering Tree-Evolving Gene Networks Underlying Developing Biological Lineages

Room: Hall L/M
Subject: Protein Interactions and Molecular Networks

Author(s):
Ankur Parikh, Carnegie Mellon University, United States
Wei Wu, University of Pittsburgh Medical Center, United States
Ross Curtis, Carnegie Mellon University, United States
Eric Xing, Carnegie Mellon University, United States

Session Chair: Trey Ideker

Estimating gene regulatory networks over biological lineages is central to a deeper understanding of how cells evolve during development and differentiation. However, one challenge in estimating such evolving networks is that their host cells not only contiguously evolve, but also branch over time. For example, a stem cell evolves into two more specialized daughter cells at each division, forming a tree of networks. Another example is in a laboratory setting: a biologist may apply several different drugs individually to malignant cancer cells to analyze the effects of each drug on the cells; the cells treated by one drug may not be intrinsically similar to those treated by another, but rather to the malignant cancer cells they were derived from. We propose a novel algorithm, Treegl, an l1 plus total variation penalized linear regression method, to effectively estimate multiple gene networks corresponding to cell types related by a tree-genealogy, based on only a few samples from each cell type. Treegl takes advantage of the similarity between related networks along the biological lineage, while at the same time exposing sharp differences between the networks. We demonstrate that our algorithm performs significantly better than existing methods via simulation. Furthermore we explore an application to a breast cancer dataset, and show that our algorithm is able to produce biologically valid results that provide insight into the progression and reversion of breast cancer cells.


top
PT30Monday, July 18: 3:30 p.m. - 3:55 p.m.
Predicting site-specific human selective pressure using evolutionary signatures

Room: A353
Subject: Evolution and Comparative Genomics

Author(s):
Javad Sadri, McGill University, Canada
Abdoulaye Banire Diallo, U. du Quebec a Montreal, Canada
Mathieu Blanchette, McGill University, Canada

Session Chair: Teresa Przytycka

Motivation: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare to our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available. Results: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current mutation rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches. Availability: The predictor and predictions made are available at http://www.mcb.mcgill.ca/~blanchem/sadri


top
PT31Monday, July 18: 4:00 p.m. - 4:25 p.m.
Optimal Discriminating Subnetwork Markers Predict Responses To Cancer Treatments

Room: Hall L/M
Subject: Protein Interactions and Molecular Networks

Author(s):
Phuong Dao, Simon Fraser University, Canada
Kendric Wang, University of British Columbia, Canada
Colin Collins, Vancouver Prostate Center, Canada
Martin Ester, Simon Fraser University, Canada
Anna Lapuk, Vancouver Prostate Center, Canada
Cenk Sahinalp, Simon Fraser University, Canada

Session Chair: Trey Ideker

Motivation: Molecular profiles of tumor samples have been widely and successfully used for classification problems. A number of algorithms have been proposed to predict classes of tumor samples based on expression profiles with relatively high performance. However, the main problem in this field is the lack of reproducibility and generalizability of predictive markers developed for even for most studied cancers such as breast cancer. Furthermore, prediction of response to treatment has proved to be even more challenging. Recent studies have clearly demonstrated the advantages of integrating protein-protein interaction data with gene expression profiles for development of subnetwork markers in classification problems. Novel approaches with improved reproducibility and generalizability are still highly needed for prediction of response to treatment in cancer and other diseases. Results: We describe a novel network based classification algorithm based on color coding technique to identify optimal subnetwork markers. Focusing on PPI networks, we apply our algorithm to drug response studies: we evaluate our algorithm using published cohorts of breast cancer patients treated with combination chemotherapy. We show that our method improves over previously published ones and demonstrate the higher and more stable performance of our subnetwork predictors compared to other subnetwork and single-gene markers. We also show that subnetwork markers give more reproducible results across independent cohorts and provide valuable insight into biological processes underlying mechanisms of response to therapy.


top
PT32Monday, July 18: 4:00 p.m. - 4:25 p.m.
PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions

Room: A353
Subject: Evolution and Comparative Genomics

Author(s):
Michael Lin, Massachusetts Institute of Technology, United States
Irwin Jungreis, Massachusetts Institute of Technology, United States
Manolis Kellis, Massachusetts Institute of Technology, United States

Session Chair: Teresa Przytycka

As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein-coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multi-species nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. We show that PhyloCSF’s classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study, and we provide a software implementation for use by the community. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues, and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE.


top
PT33Tuesday, July 19: 10:45 a.m. - 11:10 a.m.
The role of proteosome-mediated proteolysis in modulating the activity of potentially harmful transcription factor activity in Saccharomyces cerevisiae

Room: Hall L/M
Subject: Applied Bioinformatics

Author(s):
Nicola Bonzanni, VU University Amsterdam, Netherlands
Nianshu Zhang, Cambridge Systems Biology Centre, United Kingdom
Stephen G Oliver, Cambridge Systems Biology Centre, United Kingdom
Jasmin Fisher, Microsoft Research Cambridge, United Kingdom

Session Chair: Phil Bourne

The appropriate modulation of the stress response to variable environmental conditions is necessary to maintain sustained viability in Saccharomyces cerevisiae. Particularly, controlling the abundance of proteins that may have detrimental effects on cell growth is crucial to a rapid recovery from stress-induced quiescence. Prompted by qualitative modeling of the nutrient starvation response in yeast, we investigated in vivo the effect of proteolysis after nutrient starvation showing that, for the Gis1 transcription factor at least, proteasome-mediated control is crucial for a rapid return to growth. Additional bioinformatics analyses show that potentially toxic transcriptional regulators have a significantly lower protein half-life, a higher fraction of unstructured regions, and more potential PEST motifs than the non-detrimental ones. Furthermore, inhibiting proteasome activity tends to increase the expression of genes induced during the Environmental Stress Response more than those in the rest of the genome. Our combined results suggest that proteasome-mediated proteolysis of potentially toxic transcription factors tightly modulates the stress response in yeast.


top
PT34Tuesday, July 19: 10:45 a.m. - 11:10 a.m.
ccSVM: Correcting Support Vector Machines for confounding factors in biological data classification

Room: A353
Subject: Databases and Ontologies

Author(s):
Limin Li, MPIs Tübingen, Germany
Barbara Rakitsch, MPIs Tübingen, Germany
Karsten Borgwardt, MPIs Tübingen, Germany

Session Chair: Hagit Shatkay

section{Motivation:} Classifying biological data into different groups is a central task of bioinformatics, for instance, to predict the function of a gene or protein, the disease state of a patient or the phenotype of an individual based on its genotype. Support Vector Machines are a wide spread approach for classifying biological data, due to their high accuracy, their ability to deal with structured data such as strings, and the ease to integrate various types of data. However, it is unclear how to correct for confounding factors such as population structure, age or gender or experimental conditions in Support Vector Machine classification. section{Results:} In this article, we present a Support Vector Machine classifier that can correct the prediction for observed confounding factors. This is achieved by minimizing the statistical dependence between the classifier and the confounding factors. We prove that this formulation can be transformed into a standard Support Vector Machine with rescaled input data. In our experiments, our confounder correcting SVM (ccSVM) improves tumor diagnosis based on samples from different labs, tuberculosis diagnosis in patients of varying age, ethnicity, and gender, and phenotype prediction in the presence of population structure, and outperforms state-of-the-art methods in terms of prediction accuracy.section{Availability:} A ccSVM implementation will be made publicly available along with the publication.


top
PT35Tuesday, July 19: 11:15 a.m. - 11:40 a.m.
Mixed Model Coexpression (MMC): calculating gene coexpression while accounting for expression heterogeneity

Room: Hall L/M
Subject: Applied Bioinformatics

Author(s):
Nicholas Furlotte, University of California Los Angeles, United States
Hyun Min Kang, University of Michigan, United States
Chun Ye, The Broad Institute, United States
Eleazar Eskin, University of California Los Angeles, United States

Session Chair: Phil Bourne

The analysis of gene coexpression is at the core of many types of genetic analysis. The coexpression between two genes can be calculated by using a traditional Pearson’s correlation coefficient. However, unobserved confounding effects may cause inflation of the Pearson’s correlation so that uncorrelated genes appear correlated. Many general methods have been suggested which aim to remove the effects of confounding from gene expression data. However, the residual confounding which is not accounted for by these generic correction procedures has the potential to induce correlation between genes. Therefore, a method that specifically aims to calculate gene coexpression between gene expression arrays, while accounting for confounding effects, is desirable. In this paper, we present a statistical model for calculating gene coexpression called Mixed Model Coexpression (MMC), which models coexpression within a mixed model framework. Confounding effects are expected to be encoded in the matrix representing the correlation between arrays, the inter-sample correlation matrix. By conditioning on the information in the inter-sample correlation matrix, MMC is able to produce gene coexpressions that are not influenced by global confounding effects and thus significantly reduce the number of spurious coexpressions observed. We applied MMC to both human and yeast datasets and show it is better able to effectively prioritize strong coexpressions when compared to a traditional Pearson’s correlation and a Pearson’s correlation applied to data corrected with Surrogate Variable Analysis (SVA).


top
PT36Tuesday, July 19: 11:15 a.m. - 11:40 a.m.
Ontology Patterns for Tabular Representations of Knowledge on Neglected Tropical Diseases

Room: A353
Subject: Databases and Ontologies

Author(s):
Filipe Santana Da Silva, Universidade Federal de Pernambuco, Brazil
Daniel Schober, Universitäts Freiburg, Germany
Zulma Medeiros, FIOCRUZ, Brazil
Fred Freitas, Universidade Federal de Pernambuco, Brazil
Stefan Schulz, Medical University of Graz, Austria

Session Chair: Hagit Shatkay

Motivation: Ontology-like domain knowledge is frequently pub-lished in a tabular format embedded in scientific publications. We explore the re-use of such tabular content in the process of building NTDO, an ontology of neglected tropical diseases, where the representation of the interdependencies between hosts, pathogens, and vectors plays a crucial role. Results: As a proof of concept we analyzed a tabular compilation of knowledge about the pathogens, vectors and geographic loca-tions involved in the transmission of neglected tropical diseases. After a thorough ontological analysis of the domain of interest we formulated a comprehensive design pattern, rooted in the biomedi-cal domain upper level ontology BioTop. This pattern was imple-mented in a VBA script which takes cell contents of an Excel spreadsheet and transforms them into OWL-DL. After minor ma-nual postprocessing the correctness and completeness of the ontology was tested using pre-formulated competence questions as DL queries. The expected results could be reproduced by the ontology. The proposed approach is recommended for optimizing the acquisition of domain knowledge from tabular representations. Availability and Implementation: Domain examples, source code, and ontology are freely available on the web at http://purl.org/steschuKeywords: Knowledge Acquisition, Description Logics, Neglected Tropical Diseases


top
PT37Tuesday, July 19: 11:45 a.m. - 12:10 p.m.
A generalized model for multi-marker analysis of cell-cycle progression in synchrony experiments

Room: Hall L/M
Subject: Applied Bioinformatics

Author(s):
Michael B. Mayhew, Duke University, United States
Joshua W. Robinson, Duke University, United States
Boyoun Jung, Duke University, United States
Steven B. Haase, Duke University, United States
Alexander J. Hartemink, Duke University, United States

Session Chair: Phil Bourne

Motivation: To advance understanding of eukaryotic cell division, it is important to observe the process precisely. To this end, researchers monitor changes in dividing cells as they traverse the cell cycle, with the presence or absence of morphological or genetic markers indicating each cell’s position in a particular interval of the cell cycle. A wide variety of marker data is available, including information-rich cellular imaging data. However, few formal statistical methods have been developed to use these valuable data sources in estimating cell-cycle progression of a cell population. Furthermore, existing methods are designed to handle only a single binary marker of cell- cycle progression at a time. Consequently, they cannot facilitate comparison of experiments involving different sets of markers. Results: Here, we develop a new sampling model to accommodate an arbitrary number of different binary markers that characterize the progression of a population of dividing cells along a branching process. We engineer a strain of S. cerevisiae with fluorescently labeled markers of cell-cycle progression, and apply our new model to two image datasets we collected from the strain, as well as an independent dataset of different markers. We also use our model to estimate the duration of post-cytokinetic attachment between a Saccharomyces cerevisiae mother and daughter cell. The implementation is fast and extensible, and includes a graphical user interface. Our model provides a powerful and flexible cell-cycle analysis tool, suitable to any type or combination of binary markers.


top
PT38Tuesday, July 19: 11:45 a.m. - 12:10 p.m.
Detection and interpretation of metabolite-transcript co-responses using combined profiling data

Room: A353
Subject: Gene Regulation and Transcriptomics

Author(s):
Henning Redestig, RIKEN Plant Science Center, Japan
Ivan Gesteira Costa, Federal University of Pernambuco, Brazil

Session Chair: Hagit Shatkay

Motivation: Studying the interplay between gene expression and metabolite levels can provide important information about the physiology of stress responses and adaptation strategies. Performing transcriptomics and metabolomics in parallel during time-series experiments provide a systematic way to do this. Several such combined profiling data sets have been added to the public domain and form a valuable resource for hypothesis generating studies. Unfortunately, detecting co-responses between transcript levels and metabolite abundances is non-trivial since they can not be assumed to overlap directly with underlying biochemical pathways, may be subject to time delays and obscured by considerable noise. Results: Our aim is to predict pathway co-memberships between metabolites and genes based on their co-responses to applied stresses. We found that in the presence of strong noise and time-shifted responses, a hidden Markov model based similarity outperforms the simpler Pearson correlation but perform comparably or worse in their absence. Therefore, we propose a supervised method that use pathway information to summarize similarity statistics to a consensus statistic that is more informative than any of the single measures. Using four combined profiling data sets, we show that co-membership between metabolites and genes can be predicted for numerous KEGG pathways opening opportunities both for detecting transcriptionally regulated pathways and novel metabolically related genes.


top
PT39Tuesday, July 19: 12:15 p.m. - 12:40 p.m.
Systematic exploration of error sources in pyrosequencing flowgram data

Room: Hall L/M
Subject: Applied Bioinformatics

Author(s):
Susanne Balzer, Institute of Marine Research, Norway
Ketil Malde, Institute of Marine Research, Norway
Inge Jonassen, University of Bergen, Norway

Session Chair: Phil Bourne

Motivation: 454 Pyrosequencing, by Roche Diagnostics, has emerged as an alternative to Sanger sequencing when it comes to read lengths, performance, and cost, but shows higher per-base error rates. Although there are several tools available for noise-removal, targeting different application fields, data interpretation would benefit from a better understanding of the different error types. Results: By exploring 454 raw data, we quantify to what extent different factors account for sequencing errors. We use our findings to extend the flowsim pipeline with functionalities to simulate these errors, and thus enable a more realistic simulation of 454 pyrosequencing data with flowsim. Availability: The flowsim pipeline is freely available under the General Public License from http://biohaskell.org/Applications/FlowSim.


top
PT40Tuesday, July 19: 12:15 p.m. - 12:40 p.m.
From Sets to Graphs: Towards a Realistic Enrichment Analysis of Transcriptomic Systems

Room: A353
Subject: Gene Regulation and Transcriptomics

Author(s):
Ludwig Geistlinger, LMU München, Germany
Gergely Csaba, LMU München, Germany
Robert Küffner, LMU München, Germany
Nicola Mulder, University of Cape Town, South Africa
Ralf Zimmer, LMU München, Germany

Session Chair: Hagit Shatkay

Motivation: Current gene set enrichment approaches do not take interactions and associations between set members into account. Mutual activation and inhibition causing positive and negative correlation among set members are thus neglected. As a consequence, inconsistent regulations and contextless expression changes are reported and, thus, the biological interpretation of the result is impeded. Results: We analyzed established gene set enrichment methods and their result sets in a large-scale investigation of 1000 expression datasets. The reported statistically significant gene sets exhibit only average consistency between the observed patterns of differential expression and known regulatory interactions. We present Gene Graph Enrichment Analysis (GGEA) to detect consistently and coherently enriched gene sets, based on prior knowledge derived from directed gene regulatory networks (GRNs). Firstly, GGEA improves the concordance of pairwise regulation with individual expression changes in respective pairs of regulating and regulated genes, compared to set enrichment methods. Secondly, GGEA yields result sets where a large fraction of relevant expression changes can be explained by nearby regulators, such as transcription factors, again improving on set based methods. Thirdly, we demonstrate in additional case studies that GGEA can be applied to human regulatory pathways, where it sensitively detects very specific regulation processes, which are altered in tumors of the central nervous system. GGEA significantly increases the detection of gene sets where measured positively or negatively correlated expression patterns coincide with directed inducing or repressing relationships thus facilitating further interpretation of gene expression data. Availability: The method and accompanying visualization capabilities have been bundled into an R package and tied to a grahical user interface, the Galaxy workflow environment, that is running as a web server.


top
PT41Tuesday, July 19: 2:30 p.m. - 2:55 p.m.
An enhanced Petri-net model to predict synergistic effects of pairwise drug combinations from gene microarray data

Room: Hall L/M
Subject: Applied Bioinformatics

Author(s):
Jin Guangxu, The Methodist Hospital Research Institute & Cornell Weill Medical College, United States
Hong Zhao, The Methodist Hospital Research Institute & Cornell Weill Medical College, United States
Xiaobo Zhou, The Methodist Hospital Research Institute & Cornell Weill Medical College, United States
Stephen Wong, The Methodist Hospital Research Institute & Cornell Weill Medical College, United States

Session Chair: Eleazer Eskin

Motivation: Prediction of synergistic effects of drug combinations has traditionally been relied on phenotypic response data. However, such methods cannot be used to identify the molecular signaling mechanisms of synergistic drug combinations. In this paper, we propose an Enhanced Petri-Net (EPN) model to recognize the synergistic effects of drug combinations from the molecular response profiles, i.e., drug-treated microarray data. Methods: We addressed the downstream signaling network of the targets for the two individual drugs used in the combinations. Then, EPN was applied to the identified targeted signaling network. In EPN, drugs and signaling molecules are assigned to different types of places and the drug doses and expression of molecules are denoted by different color tokens. The expression changes of molecules caused by treatments of drugs are simulated by two actions of EPN, firing and blasting. Firing is to transit the drug and molecule tokens from one node or place to another and blasting is to reduce the number of molecule tokens by drug tokens in a molecule node. The goal of EPN is to mediate the state characterized by control condition without any treatment to that of treatment and to depict the drug effects on molecules by the drug tokens Results: We applied EPN to our generated pairwise drug combination microarray data. The synergistic predictions using EPN are consistent with those predicted using phenotypic response data. The molecules responsible for the synergistic effects with their associated feedback loops display the mechanisms of synergism.


top
PT42Tuesday, July 19: 2:30 p.m. - 2:55 p.m.
Small sets of interacting proteins suggest functional linkage mechanisms via Bayesian analogical reasoning

Room: A353
Subject: Gene Regulation and Transcriptomics

Author(s):
Ricardo Silva, University College London, United Kingdom
Katherine A. Heller, University of Cambridge, United Kingdom
Edoardo M. Airoldi, Harvard University, United States

Session Chair: Robert Murphy

Proteins coordinate their activity and interact in a number of ways to regulate cellular functions. Signal transduction pathways carry molecular signals from membrane proteins to transcription factor proteins that initiate or inhibit transcription. Chaperone protein complexes assemble to  assist the folding process of other proteins. Cellular processes and molecular functions require the concerted action of proteins and protein complexes. In several experimental settings, including synthetic genetic array analyses, genetic perturbation experiments and RNAi screens, scientists identify a small set of protein interactions of interest. A reasonable hypothesis that needs verification is that these interactions are the result of some functional biological logic, which is not directly observable and needs to be discovered. The next step in such analyses is to find other pairs of proteins that interact as a result of the same unknown biological logic, if any exist. A battery of functional enrichment tests is then performed on the combined set of protein pairs, which is larger and adds confidence in the results. However, extant methods for predicting protein interactions lead to predictions with low confidence, when applied to small sets of protein interactions. These methods leverage the attributes of the individual proteins directly, in a supervised learning setting, in order to rank protein pairs. A small set of protein interactions provides a small sample to train parameters of prediction methods, thus leading to low confidence. Here, we develop a computational approach to raking protein interactions that leverages statistical analogical reasoning, that is, the ability to learn and generalize relations between objects. Our approach is tailored to situations where the input set of protein interactions is small, and leverages the attributes of the individual proteins indirectly, in a Bayesian ranking setting that is perhaps closest to propensity scoring in mathematical psychology. We find that using analogical reasoning leads to good performance in identifying additional interactions starting from a small evidence set of interacting proteins, for which an underlying biological logic in terms of functional processes and signaling pathways can be established with some confidence. Our approach is scalable and can be applied to large databases with minimal computational overhead. Overall, our results suggest that analogical reasoning within a Bayesian ranking problem is a promising new approach in support of real-time biological discovery. Java code is available at: http://www.gatsby.ucl.ac.uk/~rbas/.


top
PT43Tuesday, July 19: 3:00 p.m. - 3:25 p.m.
Accurate Estimation of Heritability in Genome Wide Studies using Random Effects Models

Room: Hall L/M
Subject: Population Genomics

Author(s):
David Golan, Tel Aviv University, Israel
Saharon Rosset, Tel Aviv University, Israel

Session Chair: Eleazer Eskin

Random effect models have recently been introduced as an approach for analyzing genome wide association studies (GWAS), which allows estimation of overall heritability of traits without explicitly identifying the genetic loci responsible. Using this approach, Yang et al. (2010) have demonstrated that the heritability of height is much higher than the ~ 10% associated with identified genetic factors. However, Yang et al. (2010) relied on a heuristic for performing estimation in this model. We adopt the model framework of Yang et al. (2010) and develop a method for maximum likelihood (ML) estimation in this framework. Our method is based on MCEM (Wei et al., 1990), an expectation-maximization algorithm wherein a Markov chain Monte Carlo approach is used in the E-step. We demonstrate that this method leads to more stable and accurate heritability estimation compared to Yang et al.’s approach, and it also allows us to find ML estimates of the portion of markers which are causal, indicating whether the heritability stems from a small number of powerful genetic factors or a large number of less powerful ones.


top
PT44Tuesday, July 19: 3:00 p.m. - 3:25 p.m.
Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling

Room: A353
Subject: Gene Regulation and Transcriptomics

Author(s):
Paweł P. Łabaj, Boku University Vienna, Austria
German G. Leparc, Boku University Vienna, Austria
Bryan E. Linggi, Pacific Northwest National Laboratory, United States
Lye Meng Markillie, Pacific Northwest National Laboratory, United States
H. Steven Wiley, Pacific Northwest National Laboratory, United States
David P. Kreil, Boku University Vienna, Austria

Session Chair: Robert Murphy

Motivation: Measurement precision determines the power of any analysis to reliably identify significant signals, such as in screens for differential expression, independent of whether the experimental design incorporates replicates or not. With the compilation of largescale RNA-Seq data sets with technical replicate samples, however, we can now, for the first time, perform a systematic analysis of the precision of expression level estimates from massively parallel sequencing technology. This then allows considerations for its improvement by computational or experimental means. Results: We report on a comprehensive study of target coverage and measurement precision, including their dependence on transcript expression levels, read depth and other parameters. In particular, an impressive target coverage of 84% of the estimated true transcript population could be achieved with 331 million 50 bp reads, with diminishing returns from longer read lengths and even less gains from increased sequencing depths. Most of the measurement power (75%) is spent on only 7% of the known transcriptome, however, making less strongly expressed transcripts harder to measure. Consequently, less than 30% of all transcripts could be quantified reliably with a relative error < 20%. Based on established tools, we then introduce a new approach for mapping and analysing sequencing reads that yields substantially improved performance in gene expression profiling, increasing the number of transcripts that can reliably be quantified to over 40%. Extrapolations to higher sequencing depths highlight the need for efficient complementary steps. In discussion we outline possible experimental and computational strategies for further improvements in quantification precision.


top
PT45Tuesday, July 19: 3:30 p.m. - 3:55 p.m.
StructHDP: Automatic inference of number of clusters from admixed genotype data

Room: Hall L/M
Subject: Population Genomics

Author(s):
Suyash Shringarpure, Carnegie Mellon University, United States
Daegun Won, Carnegie Mellon University, United States
Eric Xing, Carnegie Mellon University, United States

Session Chair: Eleazer Eskin

Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multi-locus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user. Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process, we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data. Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We also used StructHDP to analyze a data set of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyze data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.


top
PT46Tuesday, July 19: 3:30 p.m. - 3:55 p.m.
An Integrative Clustering and Modeling Algorithm for Dynamical Gene Expression Data

Room: A353
Subject: Gene Regulation and Transcriptomics

Author(s):
Julia Sivriver, Hebrew University of Jerusalem, Israel
Naomi Habib, Hebrew University of Jerusalem, Israel
Nir Friedman, Hebrew University of Jerusalem, Israel

Session Chair: Robert Murphy

Motivation: The precise dynamics of gene expression is often crucial for proper response to stimuli. Time-course gene-expression profiles can provide insights about the dynamics of many cellular responses, but are often noisy and measured at arbitrary intervals, posing a major analysis challenge. Results: We developed an algorithm that interleaves clustering time- course gene-expression data with estimation of dynamic models of their response by biologically meaningful parameters. In combining these two tasks we overcome obstacles posed in each one. Moreover, our approach provides an easy way to compare between responses to different stimuli at the dynamical level. We use our approach to analyze the dynamical transcriptional responses to inflammation and anti-viral stimuli in mice primary Dendritic cells, and extract a concise representation of the different dynamical response types. We analyze the similarities and differences between the two stimuli and identify potential regulators of this complex transcriptional response.Availability: The code to our method is freely available (http://www.compbio.cs.huji.ac.il/∼DynaMiteC)


top
PT47Tuesday, July 19: 4:00 p.m. - 4:25 p.m.
Reconstruction of genealogical relationships with applications to Phase III of HapMap

Room: Hall L/M
Subject: Population Genomics

Author(s):
Sofia Kyriazopoulou-Panagiotopoulou, Stanford, United States
Dorna Kashefhaghighi, Stanford, United States
Sarah J. Aerni, Stanford, United States
Andreas Sundquist, Stanford, United States
Sivan Bercovici, Stanford, United States
Serafim Batzoglou, Stanford, United States

Session Chair: Eleazer Eskin

Motivation: Accurate inference of genealogical relationships between pairs of individuals is paramount in association studies, forensics and evolutionary analyses of wildlife populations. Current methods for relationship inference consider only a small set of close relationships and have limited to no power to distinguish between relationships with the same number of meioses separating the individuals under consideration (e.g. aunt-niece vs niece-aunt or first cousins vs great aunt-niece). In association studies, these limitations can lead to unnecessary exclusions of putative relatives. Results: We present CARROT (ClAssification of Relationships with ROTations), a novel framework for relationship inference that leverages linkage information to differentiate between rotated relationships, that is, between relationships with the same number of common ancestors and the same number of meioses separating the individuals under consideration. We demonstrate that CARROT clearly outperforms existing methods on simulated data. We also applied CARROT on four populations from Phase III of the HapMap Project and detected previously unreported pairs of third and fourth degree relatives.


top
PT48Tuesday, July 19: 4:00 p.m. - 4:25 p.m.
A novel computational framework for simultaneous integration of multiple functional genomic data to identify microRNA-gene regulatory modules

Room: A353
Subject: Gene Regulation and Transcriptomics

Author(s):
Shihua Zhang, University of Southern Canlifornia, United States
Qingjiao Li, University of Southern Canlifornia, United States
Juan Liu, Wuhan University, China
Xianghong Jasmine Zhou, University of Southern Canlifornia, United States

Session Chair: Robert Murphy

It is well known that microRNAs (miRNAs) and genes work cooperatively to form the key part of gene regulatory networks. However, the specific functional roles of most miRNAs and their combinatorial effects in cellular processes are still unclear. The availability of multiple types of functional genomic data provides unprecedented opportunities to study the miRNA-gene regulation. A major challenge is how to integrate the diverse genomic data to identify the regulatory modules of miRNAs and genes. Here we propose an effective data integration framework to identify the miRNA-gene regulatory co-modules. The miRNA and gene expression profiles are jointly analyzed in a multiple non-negative matrix factorization framework, and additional network data are simultaneously integrated in a regularized manner. Meanwhile, we employ the sparsity penalties to the variables to achieve modular solutions. The mathematical formulation can be effectively solved by an iterative multiplicative updating algorithm. We apply the proposed method to integrate a set of heterogeneous data sources including the expression profiles of miRNAs and genes on 385 human ovarian cancer samples, computational predicted miRNA-gene interactions, and gene-gene interactions. We demonstrate that the miRNAs and genes in most regulatory co-modules are significantly associated. Moreover, the co-modules are significantly enriched in known functional set such as miRNA clusters, GO biological processes and KEGG pathways, respectively. Furthermore, many miRNAs and genes in the co-modules are related with various cancers including ovarian cancer. Finally, we show that co-modules can stratify patients (samples) into groups with significant clinical characteristics.


top