Proceedings Track Presentations

Highlights, Late Breaking Research and Proceedings Track presentations will be presented by Theme.

Data:

Includes data and text-mining, ontologies, databases and machine learning approaches that do not fit in other categories.

Disease:

Includes analysis of mutations, phenotypes, drugs, epidemiology and other clinically relevant areas

Genes

Includes work in genes (including non-coding RNA), transcriptomes, genomes and variation.

Proteins:

Includes analysis of proteins and their structures and proteomics.

Systems:

This theme includes higher level systems such as cells, tissues, whole organisms and ecosystems. Includes systems biology, molecular interactions and genetic regulation.

Other:

Research areas that do not fall within the five (5) main thematic areas. The organizers may, at their discretion, move submissions to other thematic areas.

Conference proceedings will be available as an open access, online-only issue of Bioinformatics after June 15, 2015.

Data
Date:Tuesday, July 14 10:30 am - 10:50 amRoom: Liffey Hall 2

Authors:
Sheng Wang, University of Illinois at Urbana-Champaign, United States
Hyunghoon Cho, Massachusetts Institute of Technology, United States
Chengxiang Zhai, University of Illinois at Urbana, United States
Bonnie Berger, Massachusetts Institute of Technology, United States
Jian Peng, University of Illinois at Urbana-Champaign, United States

Area Session Chair: Ioannis Xenarios

Presentation Overview:
Motivation: Systematically predicting gene (or protein) function based on molecular interaction networks has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database. However, functional labels with only a few (<10) annotated genes, which constitute about half of the GO terms in yeast, mouse and human, pose a unique challenge in that any prediction algorithm that independently considers each label faces a paucity of information and thus is prone to capture non-generalizable patterns in the data, resulting in poor predictive performance. There exist a variety of algorithms for function prediction, but none properly address this “overfitting” issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog.

Results: We propose a novel function prediction algorithm, clusDCA, which transfers information between similar functional labels to alleviate the overfitting problem for sparsely annotated functions. Our method is scalable to datasets with a large number of annotations. In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information. Furthermore, we show that our method can accurately predict genes that will be assigned a functional label that has no known annotations, based only on the ontology graph structure and genes associated with other labels, which further suggests that our method effectively utilizes the similarity between gene functions.

Availability: https://github.com/wangshenguiuc/clusDCA
TOP

Date:Tuesday, July 14 12:20 pm - 12:40 pmRoom: Liffey Hall 2

Authors:
Davy Weissenbacher, Arizona State University, United States
Tasnia Tahsin, Arizona State University, United States
Rachel Beard, Arizona State University, United States
Mari Firago, Arizona State University, United States
Robert Rivera, Arizona State University, United States
Matthew Scotch, Arizona State University, United States
Graciela Gonzalez, Arizona State University, United States

Area Session Chair: Ioannis Xenarios

Presentation Overview:
Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles.

In this paper, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles in order to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic, and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a "metadata heuristic").

For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences.
TOP

Date:Monday, July 13 3:30 pm - 3:50 pmRoom: Wicklow Hall 2A

Authors:
Andrew Palmer, EMBL, Germany
Ekaterina Ovchinnikova, EMBL, Germany
Mikael Thune, Denator, Sweden
Regis Lavigne, Inserm U1085, France
Blandine Guevel, Inserm U1085, France
Andrey Dyatlov, Uni Bremen, Germany
Olga Vitek, Northeastern University, United States
Charles Pineau, Inserm U1085, France
Mats Boren, Denator, Sweden
Theodore Alexandrov, EMBL, Germany

Area Session Chair: Robert F. Murphy

Presentation Overview:
Motivation: Imaging Mass Spectrometry (IMS) is a maturating technique of molecular imaging. Confidence in the reproducible quality of IMS data is essential for its integration into routine use. However, the predominant method for assessing quality is visual examination, a time consuming, unstandardised and non-scalable approach. So far, the problem of assessing the quality has only been marginally addressed and existing measures do not account for the spatial information of IMS data. Importantly, no approach exists for unbiased evaluation of potential quality measures.

Results: We propose a novel approach for evaluating potential measures by creating a gold-standard set using collective expert judgements upon which we evaluated image-based measures. To produce a gold standard, we engaged 80 IMS experts, each to rate the relative quality between 52 pairs of ion images from MALDI- TOF IMS datasets of rat brain coronal sections. Experts’ optional feedback on their expertise, the task and the survey showed that (i) they had diverse backgrounds and sufficient expertise, (ii) the task was properly understood, and (iii) the survey was comprehensible. A moderate inter-rater agreement was achieved with Krippendorff’s alpha of 0.5. A gold-standard set of 634 pairs of images with accompanying ratings was constructed and showed a high agreement of 0.85. Eight families of potential measures with a range of parameters and statistical descriptors, giving 143 in total, were evaluated. Both signal-to-noise and spatial chaos based measures performed highly with a correlation of 0.7 to 0.9 with the gold standard ratings. Moreover, we showed that a composite measure with the linear coefficients (trained on the gold standard with regularised least squares optimisation and lasso) showed a strong linear correlation of 0.94 and an accuracy of 0.98 in predicting which image in a pair was of higher quality.

Availabiility: The anonymised data collected from the survey and the Matlab source code for data processing can be found at: https: //github.com/alexandrovteam/IMS_quality.
TOP

Date:Monday, July 13 4:10 pm - 4:30 pmRoom: Wicklow Hall 2A

Authors:
Alice Schoenauer Sebag, Mines ParisTech - INSERM - Agro Paristech, France
Sandra Plancade, INRA, France
Céline Raulet-Tomkiewicz, INSERM - Paris V, France
Robert Barouki, INSERM - Paris V, France
Jean-Philippe Vert, Mines ParisTech - Institut Curie, France
Thomas Walter, Institut Curie, France

Area Session Chair: Robert F. Murphy

Presentation Overview:
Motivation: Motility is a fundamental cellular attribute, which plays a major part in processes ranging from embryonic development to metastasis. Traditionally, single cell motility is often studied by live cell imaging. Yet, such studies were so far limited to low throughput. In order to systematically study cell motility at a large scale, we need robust methods to quantify cell trajectories in live cell imaging data.

Results: The primary contribution of this paper is to present MotIW, a generic workflow for the study of single cell motility in High-Throughput (HT) time-lapse screening data. It is composed of cell tracking, cell trajectory mapping to an original feature space, and hit detection according to a new statistical procedure. We show that this workflow is scalable and demonstrate its power by application to simulated data, as well as large-scale live cell imaging data. This application enables the identification of an ontology of cell motility patterns in a fully unsupervised manner.

Availability: Python code and examples available at http://cbio.ensmp.fr/~aschoenauer/motiw.html
Contact: thomas.walter@mines-paristech.fr
TOP

Date:Tuesday, July 14 12:00 pm - 12:20 pmRoom: Liffey Hall 2

Authors:
Ke Liu, Fudan University, China
Shengwen Peng, Fudan University, China
Junqiu Wu, Central South University, China
Chengxiang Zhai, UIUC, United States
Hiroshi Mamitsuka, Kyoto University, Japan
Shanfeng Zhu, Fudan University, China

Area Session Chair: Ioannis Xenarios

Presentation Overview:
Motivation: Medical Subject Headings (MeSH) are used by National Library of Medicine (NLM) to index almost all citations in MEDLINE, which greatly facilitates the applications of biomedical information retrieval and text mining. To reduce the time and financial cost of manual annotation, NLM has developed a software package, Medical Text Indexer (MTI), for assistin MeSH annotation, which uses {\it k}-nearest neighbors (KNN), pattern matching and indexing rules. Other types of information, such as prediction by MeSH classifiers (trained separately), can also be used for automatic MeSH annotation. However, existing methods cannot effectively integrate multiple evidence for MeSH annotation.

Methods: We propose a novel framework, MeSHLabeler, to integrate multiple evidence for accurate MeSH annotation by using "learning to rank''. Evidence includes numerous predictions from MeSH classifiers, KNN, pattern matching, MTI and the correlation between different MeSH terms, etc. Each MeSH classifier is trained independently, and thus prediction scores from different classifiers are incomparable. To address this issue, we have developed an effective score normalization procedure to improve the prediction accuracy.

Result: MeSHLabeler won the first place in Task 2A of 2014 BioASQ challenge, achieving the Micro F-measure of 0.6248 for 9,040 citations provided by the BioASQ challenge.
Note that this accuracy is around 9.15\% higher than 0.5724, obtained by
MTI.
Availability: The software is available upon request.
TOP

Disease
Date:Sunday, July 12 3:30 pm - 3:50 pmRoom: The Auditorium

Authors:
A. Grant Schissler, The University of Arizona, United States
Vincent Gardeux, The University of Arizona, United States
Qike Li, The University of Arizona, United States
Ikbel Achour, The University of Arizona, United States
Haiquan Li, The University of Arizona, United States
Walter W Piegorsch, The University of Arizona, United States
Yves A Lussier, The University of Arizona, United States

Area Session Chair: Yana Bromberg

Presentation Overview:
Motivation: The conventional approach to personalized medicine relies on molecular data analytics across multiple patients. The path to precision medicine lies with molecular data analytics that can discover interpretable single-subject signals (N-of-1). We developed a global framework, N-of-1-pathways, for a mechanistic-anchored approach to single-subject gene expression data analysis. We pre-viously employed a metric that could prioritize the statistical signifi-cance of a deregulated pathway in single subjects, however, it lacked in quantitative interpretability (e.g., the equivalent to a gene expression fold-change).

Results: In this study, we extend our previous approach with the application of statistical Mahalanobis distance to quantify personal pathway-level deregulation. We demonstrate that this approach, N-of-1-pathways Paired Samples Mahalanobis Distance (N-OF-1-PATHWAYS-MD), detects deregulated pathways (empirical simula-tions), while not inflating false positive rate using a study with biolog-ical replicates. Finally, we establish that N-OF-1-PATHWAYS-MD scores are, biologically significant, clinically relevant, and are predic-tive of breast cancer survival (p<0.05, n=80 invasive carcinoma; TCGA RNA-sequences).

Conclusion: N-of-1-pathways MD provides a practical approach towards precision medicine. The method generates the magnitude and the biological significance of personal deregulated pathways results derived solely from the patient’s transcriptome. These path-ways offer the opportunities for deriving clinically actionable deci-sions that have the potential to complement the clinical interpretabil-ity of personal polymorphisms obtained from DNA acquired or inher-ited polymorphisms and mutations. In addition, it offers an opportuni-ty for applicability to diseases in which DNA changes may not be relevant, and thus expand the “interpretable ‘omics” of single sub-jects (e.g. personalome).

Availability: http://www.lussierlab.net/publications/N-of-1-pathways
Contact: yves@email.arizona.edu
TOP

Date:Sunday, July 12 3:50 pm - 4:10 pmRoom: The Auditorium

Authors:
Amin Allahyar, Delft University of Technology, Netherlands
Jeroen De Ridder, Delft University of Technology, Netherlands

Area Session Chair: Yana Bromberg

Presentation Overview:
Motivation: Breast cancer outcome prediction based on gene expression profiles is an important strategy for personalize patient care. To improve performance and consistency of discovered markers of the intial molecular classifiers, Network based Outcome Prediction methods (NOPs) have been proposed. In spite of the initial claims, recent studies revealed that neither performance nor consistency can be improved using these methods. NOPs typically rely on the construction of meta-genes by averaging the expression of several genes connected in a network that encodes protein interactions or pathway information. In this paper, we expose several fundamental issues in NOPs that impede on the prediction power, consistency of discovered markers and obscures biological interpretation.

Results: To overcome these issues, we propose FERAL, a network- based classifier that hinges upon the Sparse Group Lasso which performs simultaneous selection of marker genes and training of the prediction model. An important feature of FERAL, and a significant departure from existing NOPs, is that is uses multiple operators to summarize genes into meta-genes. This gives the classifier the opportunity to select the most relevant meta-gene for each gene set. Extensive evaluation revealed that the discovered markers are markedly more stable across independent datasets. Moreover, interpretation of the marker genes detected by FERAL reveals valuable mechanistic insight into the aetiology of breast cancer.
TOP

Date:Tuesday, July 14 10:30 am - 10:50 amRoom: The Liffey A

Authors:
Yoo-Ah Kim, NCBI/NLM/NIH, United States
Dongyeon Cho, NCBI/NLM/NIH, United States
Phuong Dao, NCBI/NLM/NIH, United States
Teresa Przytycka, NCBI/NLM/NIH, United States

Area Session Chair: Louxin Zhang

Presentation Overview:
The data gathered by the Pan-Cancer initiative has created an unprecedented opportunity for illuminating common features across different cancer types. However separating tissue specific features from across cancer signatures has proven to be challenging. One of the often-observed properties of the mutational landscape of cancer is the mutual exclusivity of cancer driving mutations. Even though studies based on individual cancer types suggested that mutually exclusive pairs often share the same functional pathway, the relationship between across cancer mutual exclusivity and functional connectivity has not been previously investigated. Here we introduce a classification of mutual exclusivity into three basic classes: within tissue type exclusivity, across tissue type exclusivity, and between tissue type exclusivity. We then combined across-cancer mutual exclusivity with interactions data to uncover pan-cancer dysregulated pathways. Our new method, Mutual Exclusivity Module Cover (MEMCover) not only identified previously known Pan-Cancer dysregulated sub-networks but also novel subnetworks whose across cancer role has not been appreciated well before. In addition, we demonstrate the existence of mutual exclusivity hubs, putatively corresponding to cancer drivers with strong growth advantages. Finally, we show that while mutually exclusive pairs within or across cancer types are predominantly functionally interacting, the pairs in between cancer mutual exclusivity class are more often disconnected in functional networks.
TOP

Date:Sunday, July 12 4:10 pm - 4:30 pmRoom: The Auditorium

Authors:
Nora Katharina Speicher, Max Planck Institute for Informatics, Germany
Nico Pfeifer, Max Planck Institute for Informatics, Germany

Area Session Chair: Yana Bromberg

Presentation Overview:
Despite ongoing cancer research, available therapies are still limited in quantity and effectiveness, and making treatment decisions for individual patients remains a hard problem. Established subtypes, which help guide these decisions, are mainly based on individual data types. However, the analysis of multidimensional patient data involving the measurements of various molecular features could reveal intrinsic characteristics of the tumor. Large-scale projects accumulate this kind of data for various cancer types, but we still lack the computational methods to reliably integrate this information in a meaningful manner. Therefore, we apply and extend current multiple kernel learning for dimensionality reduction approaches. On the one hand, we add a regularization term to avoid overfitting during the optimization procedure, and on the other hand, we show that one can even use several kernels per data type and thereby alleviate the user from having to choose the best kernel functions and kernel parameters for each data type beforehand.

We have identified biologically meaningful subgroups for five different cancer types. Survival analysis has revealed significant differences between the survival times of the identified subtypes, with P-values comparable or even better than state-of-the-art methods. Moreover, our resulting subtypes reflect combined patterns from the different data sources, and we demonstrate that input kernel matrices with only little information have less impact on the integrated kernel matrix. Our subtypes show different responses to specific therapies, which could eventually assist in treatment decision making.

TOP

Date:Monday, July 13 3:50 pm - 4:10 pmRoom: The Auditorium

Authors:
Yang Chen, Case Western Reserve University, United States
Li Li, Case Western Reserve University, United States
Guo-Qiang Zhang, Case Western Reserve University, United States
Rong Xu, Case Western Reserve University, United States

Area Session Chair: Paul Horton

Presentation Overview:
Motivation: Discerning genetic contributions to diseases not only enhances our understanding of disease mechanisms, but also leads to translational opportunities for drug discovery. Recent computational approaches incorporate disease phenotypic similarities to improve the prediction power of disease gene discovery. However, most current studies used only one data source of human disease phenotype. We present an innovative and generic strategy for combining multiple different data sources of human disease phenotype and predicting disease associated genes from integrated phenotypic and genomic data.

Methods: To demonstrate our approach, we explored a new phenotype database from biomedical ontologies and constructed Disease Manifestation Network (DMN). We combined DMN with mimMiner, which was a widely-used phenotype database in disease gene prediction studies. We developed a network analysis approach to predict disease-gene associations from the integrated disease phenotype networks and a gene network.

Results: Our approach achieved significantly improved performance over a baseline method, which used only one phenotype data source. In the leave-one-out cross validation and de novo gene prediction analysis, our approach achieved the area under the curves (AUCs) of 90.7% and 90.3%, which are significantly higher than 84.2% (pTOP

Date:Monday, July 13 10:10 am - 10:30 amRoom: Liffey Hall 2

Authors:
David Amar, Tel Aviv University, Israel
Daniel Yekutieli, Tel Aviv University, Israel
Adi Maron-Katz, Tel Aviv University, Israel
Talma Hendler, Tel Aviv University, Israel
Ron Shamir, Tel Aviv University, Israel

Area Session Chair: Yves Moreau

Presentation Overview:
Motivation: Detecting modules of coordinated activity is fundamental in the analysis of large biological studies. For two-dimensional data (e.g. genes x patients) this is often done via clustering or biclustering. More recently, studies monitoring patients over time have added another dimension. Analysis is much more challenging in this case, especially when time measurements are not synchronized. New methods that can analyze 3-way data are thus needed.

Results: We present a new algorithm for finding coherent and flexible modules in 3-way data. Our method can identify both core modules that appear in multiple patients and patient-specific augmentations of these core modules that contain additional genes. Our algorithm is based on a hierarchical Bayesian data model and Gibbs sampling. The algorithm outperforms extant methods on both simulated and real data.The method successfully dissected key components of septic shock response from time series measurements of gene expression. Detected patient-specific module augmentations were informative for disease outcome. In analyzing brain fMRI time series of subjects at rest, it detected the pertinent brain regions involved.

Availability: R code and data are available at http://acgt.cs.tau.ac.il/twigs/
TOP

Genes
Date:Monday, July 13 2:00 pm - 2:20 pmRoom: The Liffey A

Authors:
Andrea Ocone, Helmholtz Center Munich, Germany
Laleh Haghverdi, Helmholtz Center Munich, Germany
Nikola S. Mueller, Helmholtz Center Munich, Germany
Fabian J. Theis, Helmholtz Zentrum München; German Research Center for Environmental Health, Germany

Area Session Chair: Uwe Ohler

Presentation Overview:
Motivation: High-dimensional single-cell snapshot data is becoming widespread in the systems biology community, as a mean to understand biological processes at the cellular level. However, as temporal information is lost with such data, mathematical models have been limited to capture only static features of the underlying cellular mechanisms.

Results: Here, we present a modular framework which allows to recover the temporal behaviour from single-cell snapshot data and reverse engineer the dynamics of gene expression. The framework combines a dimensionality reduction method with a cell time-ordering algorithm to generate pseudo time-series observations. These are in turn used to learn transcriptional ODE models and do model selection on structural network features. We apply it on synthetic data and then on real hematopoietic stem cells data, to reconstruct gene expression dynamics during differentiation pathways and infer the structure of a key gene regulatory network.
TOP

Date:Sunday, July 12 12:00 pm - 12:20 pmRoom: The Liffey A

Authors:
Timothy J. Close, University of California, Riverside, United States
Stefano Lonardi, University of California, Riverside, United States
Hamid Mirebrahim, University of California, Riverside, United States

Area Session Chair: Siu Ming Yiu

Presentation Overview:
We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e., coverage of 1,000x or higher). Our proposed meta-assembler SLICEMBLER partitions the input data into optimal- sized “slices” and uses a standard assembly tool (e.g., Velvet, SPAdes, IDBA, Ray) to assemble each slice individually. SLICEMBLER uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly.

To improve its efficiency, SLICEMBLER uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8,000x coverage) and simulated data show that SLICEMBLER significantly improves the quali- ty of the assembly compared to the performance of the base as- sembler. In fact, most of the times SLICEMBLER generates error-free assemblies. We also show that SLICEMBLER is much more resistant against high sequencing error rate than the base assembler. SLICEMBLER can be accessed at http://slicembler.cs.ucr.edu/
TOP

Date:Sunday, July 12 12:20 pm - 12:40 pmRoom: The Liffey A

Authors:
Martin Muggli, Colorado State University, United States
Simon Puglisi, University of Helsinki, Finland
Roy Ronen, University of California, San Diego, United States
Christina Boucher, Colorado State University, United States

Area Session Chair: Siu Ming Yiu

Presentation Overview:
Motivation: A crucial problem in genome assembly is the discov- ery and correction of misassembly errors in draft genomes. We develop a method called MISSEQUEL that enhances the quality of draft genomes by identifying misassembly errors and their break- points using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source compu- tational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularen- sis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and Francisella tularensis, and used real optical mapping data for rice and budgerigar.

Results: Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembed contigs in assemblies of Francisella tularensis, and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembed contigs in assemblies of loblolly pine. Using the real optical mapping data, we correctly iden- tified 75% of extensively misassembled contigs and 100% of locally misassembled contigs in rice, and 77% of extensively misassembled contigs and 80% of locally misassembled contigs in budgerigar.

Availability: MISSEQUEL can be used as a post-processing step in combination with any genome assembler and is freely available at http://www.cs.colostate.edu/seq/
TOP

Date:Sunday, July 12 2:40 pm - 3:00 pmRoom: The Liffey A

Authors:
Salem Malikic, Simon Fraser University, Canada
Ibrahim Numanagić, Simon Fraser University, Canada
Victoria Pratt, Indiana University School of Medicine, United States
Todd Skaar, IUPUI, United States
David A. Flockhart, Indiana University School of Medicine, United States
S. Cenk Sahinalp, Simon Fraser University, Canada

Area Session Chair: Reinhard Schneider

Presentation Overview:
Motivation: CYP2D6 is highly polymporphic gene which encodes the (CYP2D6) enzyme, involved in the metabolism of 20-25% of all clinically prescribed drugs and other xenobiotics in the human body. CYP2D6 genotyping is recommended prior to treatment decisions involving one or more of the numerous drugs sensitive to CYP2D6 allelic composition. In this context High Throughput Sequencing (HTS) technologies provide a promising time-efficient and cost- effective alternative to currently used genotyping techniques. In order to achieve accurate interpretation of HTS data, however, one needs to overcome several obstacles such as high sequence similarity and genetic recombinations between CYP2D6 and evolutionarily related pseudogenes CYP2D7 and CYP2D8, high copy number variation among individuals, and short read lengths generated by HTS technologies.

Results: In this work, we present the first algorithm to computationally infer CYP2D6 genotype at basepair resolution from HTS data. Our algorithm is able to resolve complex genotypes, including alleles that are the products of duplication, deletion and fusion events involving CYP2D6 and its evolutionarily related cousin CYP2D7. Through extensive experiments using simulated and real datasets we show that our algorithm accurately solves this important problem with potential clinical implications.

Availability: Cypiripi is available at http://sfu-compbio.github.io/cypiripi.
Contact: S. Cenk Sahinalp (cenk@sfu.ca)
TOP

Date:Monday, July 13 10:10 am - 10:30 amRoom: The Liffey A

Authors:
Cheng Yuan, Michigan State University, United States
Jikai Lei, Michigan State University, United States
James Cole, Michigan State University, United States
Yanni Sun, Michigan State University, United States

Area Session Chair: Jerome Waldispuhl

Presentation Overview:
Metagenomic data, which contains sequenced DNA reads of uncultured microbial species from environmental samples, provide a unique opportunity to thoroughly analyze microbial species that have never been identified before. Reconstructing 16S ribosomal RNA, a phylogenetic marker gene, is usually required to analyze the composition of the metagenomic data. However, massive volume of dataset, high sequence similarity between related species, skewed microbial abundance, and lack of reference genes make 16S rRNA reconstruction difficult. Generic de novo assembly tools are not optimized for assembling 16S rRNA genes.
In this work, we introduce a targeted rRNA assembly tool, REAGO (REconstruct 16S ribosomal RNA Genes from metagenOmic data). It addresses the above challenges by combining secondary structure-aware homology search, properties of rRNA genes, and de novo assembly. Our experimental results show that our tool can correctly recover more rRNA genes than several popular generic metagenomic assembly tools and specially designed rRNA construction tools.

Availability: The source code of REAGO is freely available at github.
Contact: chengy@msu.edu and yannisun@msu.edu
TOP

Date:Monday, July 13 12:20 pm - 12:40 pmRoom: The Liffey A

Authors:
Valentin Zulkower, INRIA Grenoble-Rhône-Alpes, France
Michel Page, INRIA Grenoble-Rhône-Alpes, IAE Grenoble, France
Delphine Ropers, INRIA Grenoble-Rhône-Alpes, UJF Grenoble, France
Johannes Geiselmann, INRIA Grenoble-Rhône-Alpes, UJF Grenoble, France
Hidde de Jong, INRIA Grenoble-Rhône-Alpes, France

Area Session Chair: Jerome Waldispuhl

Presentation Overview:
Motivation: Time-series observations from reporter gene experiments
are commonly used for inferring and analyzing dynamical models
of regulatory networks. The robust estimation of promoter activities
and protein concentrations from primary data is a difficult problem
due to measurement noise and the indirect relation between the
measurements and quantities of biological interest.

Results: We propose a general approach based on regularized linear
inversion to solve a range of estimation problems in the analysis of
reporter gene data, notably the inference of growth rate, promoter
activity, and protein concentration profiles. We evaluate the validity
of the approach using in-silico simulation studies, and observe
that the methods are more robust and less biased than indirect
approaches usually encountered in the experimental literature based
on smoothing and subsequent processing of the primary data. We
apply the methods to the analysis of fluorescent reporter gene data
acquired in kinetic experiments with Escherichia coli. The methods
are capable of reliably reconstructing time-course profiles of growth
rate, promoter activity, and protein concentration from weak and noisy
signals at low population volumes. Moreover, they capture critical
features of those profiles, notably rapid changes in gene expression
during growth transitions.

Availability: The methods described in this paper are made available
as a Python package (LGPL licence) and also accessible through a
web interface. For more information, see https://team.inria.
fr/ibis/wellinverter.
Contact: Hidde.de-Jong@inria.fr
TOP

Date:Sunday, July 12 2:20 pm - 2:40 pmRoom: The Liffey B

Authors:
Mingfu Shao, EPFL, Switzerland
Bernard Moret, EPFL, Switzerland

Area Session Chair: Janet Kelso

Presentation Overview:
Motivation: Large-scale evolutionary events such as genomic rearrangements and segmental duplications form an important part of the evolution of genomes and are widely studied from both biological and computational perspectives. A basic computational problem is to infer these events in the evolutionary history for given modern genomes, a task for which many algorithms have been proposed under various constraints. Algorithms that can handle both rearrangements and content-modifying events such as duplications and losses remain few and limited in their applicability.

Results:We study the comparison of two genomes under a model including general rearrangements (through DCJ) and segmental duplications. We formulate the comparison as an optimization problem, and describe an exact algorithm to solve it by using an integer linear program. We also devise a sufficient condition and an efficient algorithm to identify optimal substructures, which can simplify the problem while preserving optimality. Using the optimal substructures with the ILP formulation yields an exact, yet practical, algorithm -- the first practical method to provide exact solutions to the problem of comparing two arbitrary genomes under rearrangements and duplications. We then apply our algorithm to assign in-paralogs and orthologs (a necessary step in handling duplications), and compare its performance with that of the state-of-the-art method MSOAR (an approximation method), using both simulations and real data. On simulated datasets our method outperforms MSOAR by a significant margin, and on 5 well-annotated species, MSOAR achieves high accuracy, yet our method performs slightly better on each of the 10 pairwise comparisons.

Availability: http://lcbb.epfl.ch/softwares/coser
Contact: mingfu.shao@epfl.ch
TOP

Date:Sunday, July 12 3:30 pm - 3:50 pmRoom: The Liffey B

Authors:
Damian Roqueiro, ETH Zurich, Switzerland
Menno Witteveen, ETH Zurich, Switzerland
Verneri Anttila, Broad Institute of MIT and Harvard, United States
Gisela Terwindt, Leiden University Medical Center, Netherlands
Arn van den Maagdenberg, Leiden University Medical Center, Netherlands
Karsten Borgwardt, ETH Zurich, Switzerland

Area Session Chair: Janet Kelso

Presentation Overview:
Motivation: Predicting disease phenotypes from genotypes is a key challenge in medical applications in the postgenomic era. Large training datasets of patients that have been both genotyped and phenotyped are the key requisite when aiming for high prediction accuracy. With current genotyping projects producing genetic data for hundreds of thousands of patients, large-scale phenotyping has become the bottleneck in disease phenotype prediction.

Results: Here we present an approach for imputing missing disease phenotypes given the genotype of a patient. Our approach is based on co-training, which predicts the phenotype of unlabeled patients based on a second class of information, e.g. clinical health record information. Augmenting training datasets by this type of in silico phenotyping can lead to significant improvements in prediction accuracy. We demonstrate this on a dataset of patients with two diagnostic types of migraine, termed migraine with aura and migraine without aura, from the International Headache Genetics Consortium.

Conclusions: Imputing missing disease phenotypes for patients via co-training leads to larger training datasets and improved prediction accuracy in phenotype prediction.
TOP

Date:Tuesday, July 14 11:40 am - 12:00 pmRoom: The Auditorium

Authors:
Salim Akhter Chowdhury, Carnegie Mellon University, United States
E. Michael Gertz, NCBI/NLM/NIH, United States
Darawalee Wangsa, NCI/NIH, United States
Kerstin Heselmeyer-Haddad, NCI/NIH, United States
Thomas Ried, NCI/NIH, United States
Alejandro Schaffer, NCBI/NLM/NIH, United States
Russell Schwartz, Carnegie Mellon University, United States

Area Session Chair: Niko Beerenwinkel

Presentation Overview:
Motivation: Phylogenetic algorithms have begun to see widespread use in cancer research to reconstruct processes of evolution in tumor progression. Developing reliable phylogenies for tumor data requires quantitative models of cancer evolution that include the unusual genetic mechanisms by which tumors evolve, such as chromosome abnormalities, and allow for heterogeneity between tumor types and individual patients. Previous work on inferring phylogenies of single tumors by copy number evolution assumed models of uniform rates of genomic gain and loss across different genomic sites and scales, a substantial oversimplification necessitated by a lack of algorithms and quantitative parameters for fitting to more realistic tumor evolution models.

Results: We propose a framework for inferring models of tumor progression from single-cell gene copy number data, including variable rates for different gain and loss events. We propose a new algorithm for identification of most parsimonious combinations of single gene and single chromosome events. We extend it via dynamic programming to include genome duplications. We implement an expectation maximization (EM)-like method to estimate mutation-specific and tumor-specific event rates concurrently with tree reconstruction. Application of our algorithms to real cervical cancer data identifies key genomic events in disease progression consistent with prior literature. Classification experiments on cervical and tongue cancer datasets lead to improved prediction accuracy for the metastasis of primary cervical cancers and for tongue cancer survival.

Availability: Our software (FISHtrees) and two datasets are available at ftp://ftp.ncbi.nlm.nih.gov/pub/FISHtrees.
TOP

Date:Tuesday, July 14 12:00 pm - 12:20 pmRoom: The Auditorium

Authors:
Mohammed El-Kebir, Brown University, United States
Layla Oesper, Brown University, United States
Hannah Acheson-Field, Brown University, United States
Ben Raphael, Brown University, United States

Area Session Chair: Niko Beerenwinkel

Presentation Overview:
Motivation: DNA sequencing of multiple samples from the same tumor provides data to analyze the process of clonal evolution in the population of cells that give rise to a tumor.

Results: We formalize the problem of reconstructing the clonal evolution of a tumor using single-nucleotide mutations as the Variant Allele Frequency Factorization Problem (VAFFP). We derive a combinatorial characterization of the solutions to this problem and show that the problem is NP-complete. We derive an integer linear programming solution to the VAFFP in the case of error-free data and extend this solution to real data with a probabilistic model for errors. The resulting AncesTree algorithm is better able to identify ancestral relationships between individual mutations than existing approaches, particularly in ultra-deep sequencing data when high read counts for mutations yield high confidence variant allele frequencies.
TOP

Date:Monday, July 13 3:30 pm - 3:50 pmRoom: Liffey Hall 2

Authors:
Siavash Mirarab, University of Texas at Austin, United States
Tandy Warnow, The University of Illinois at Urbana-Champaign, United States

Area Session Chair: Knut Reinert

Presentation Overview:
Motivation: The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting (ILS), modelled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL (ECCB 2014), which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent- based methods on the datasets we examined (Mirarab et al., 2014a). ASTRAL heuristically solves an NP-hard problem in polynomial time, by constraining the search space through a set of allowed “bipartitions”. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent.

Results: We present a new version of ASTRAL, which we call ASTRAL-II. We will show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes), and has substantially better accuracy under some conditions. ASTRAL’s running time is $O(n^2k|X|^2)$, and ASTRAL-II’s running time is $O(nk|X|^2)$, where n is the number of species, k is the number of loci, and X is the set of allowed bipartitions for the search space.

Availability: ASTRAL-II is available in open source at https://github.com/smirarab/ASTRAL.
Contact: smirarab@gmail.com
TOP

Date:Sunday, July 12 4:10 pm - 4:30 pmRoom: The Liffey A

Authors:
Yeu-Chern Harn, University of North Carolina, Chapel Hill, United States
Matthew Powers, University of North Carolina, Chapel Hill, United States
Elizabeth Shank, University of North Carolina, Chapel Hill, United States
Vladimir Jojic, University of North Carolina, Chapel Hill, United States

Area Session Chair: Reinhard Schneider

Presentation Overview:
Motivation: The interactions between microbial colonies through chemical signaling is not well understood. A microbial colony can use different molecules to inhibit or accelerate the growth of other colonies. A better understanding of the molecules involved in these interactions could lead to advancements in health and medicine. Imaging mass spectrometry (IMS) applied to co-cultured microbial communities aims to capture the spatial characteristics of the colonies’ molecular fingerprints. These data are high-dimensional and require computational analysis methods to interpret.

Results: Here we present a dictionary learning method that deconvolves spectra of different molecules from IMS data. We call this method MOLecular Dictionary Learning (MOLDL). Unlike standard dictionary learning methods which assume Gaussian-distributed data, our method uses the Poisson distribution to capture the count nature of the mass spectrometry data. Also, our method incorporates universally applicable information on common ion types of molecules in MALDI mass spectrometry. This greatly reduces model parametrization and increases deconvolution accuracy by eliminating spurious solutions. Moreover, our method leverages the spatial nature of IMS data by assuming that nearby locations share similar abundances, thus avoiding overfitting to noise. Tests on simulated data sets show that this method has good performance in recovering molecule dictionaries. We also tested our method on real data measured on a microbial community composed of two species. We confirmed through follow-up validation experiments that our method recovered true and complete signatures of molecules. These results indicate that our method can discover molecules in IMS data reliably, and hence can help advance the study of interaction of microbial colonies.

Availability : The code used in this paper is available at: https://github.com/frizfealer/IMS_project
TOP

Proteins
Date:Sunday, July 12 12:00 pm - 12:20 pmRoom: The Liffey B

Authors:
Ramanuja Simha , University of Delaware, United States
Sebastian Briesemeister, University of Tuebingen, Germany
Oliver Kohlbacher, University of Tuebingen, Germany
Hagit Shatkay, University of Delaware, United States

Area Session Chair: Anna Tramontano

Presentation Overview:
Motivation: Proteins are responsible for a multitude of vital tasks in all living organisms. Given that a protein's function and role are strongly related to its subcellular location, protein location prediction is an important research area. While proteins move from one location to another and can localize to multiple locations, most existing location prediction systems assign only a single location per protein. A few recent systems attempt to predict multiple locations for proteins, however, their performance leaves much room for improvement. Moreover, such systems do not capture dependencies among locations and usually consider locations as independent. We hypothesize that a multi-location predictor that captures location inter-dependencies can improve location predictions for proteins.

Results:
We introduce a probabilistic generative model for protein localization, and develop a system based on it – which we call MDLoc – that utilizes inter-dependencies among locations to predict multiple locations for proteins. The model captures location inter-dependencies using Bayesian networks and represents dependency between features and locations using a mixture model. We use iterative processes for learning model parameters and for estimating protein locations. We evaluate our classifier MDLoc, on a dataset of single- and multi-localized proteins derived from the DBMLoc dataset, which is the most comprehensive protein multi-localization dataset currently available. Our results, obtained by using MDLoc, significantly improve upon results obtained by an initial simpler classifier, as well as on results reported by other top systems.

MDLoc is available at: http://www.eecis.udel.edu/~compbio/mdloc.
TOP

Date:Tuesday, July 14 10:50 am - 11:10 amRoom: The Liffey B

Authors:
Renzhi Cao, University of Missouri-Columbia, United States
Debswapna Bhattacharya, University of Missouri-Columbia, United States
Badri Adhikari, University of Missouri-Columbia, United States
Jilong Li, University of Missouri-Columbia, United States
Jianlin Cheng, University of Missouri-Columbia, United States

Area Session Chair: Francisco Melo Ledermann

Presentation Overview:
Motivation: Sampling structural models and ranking them are the two major challenges of protein structure prediction. Traditional protein structure prediction methods generally use one or a few quality assessment methods to select the best-predicted models, which cannot consistently select relatively better models and rank a large number of models well.

Results: Here, we develop a novel large-scale model quality assessment method in conjunction with model clustering to rank and select protein structural models. It unprecedentedly applied 14 model quality assessment methods to generate consensus model rankings, followed by model refinement based on model combination (i.e., averaging). Our experiment demonstrates that the large-scale model quality assessment approach is more consistent and robust in selecting models of better quality than any individual quality assessment method. Our method was blindly tested during the 11th Critical Assessment of Techniques for Protein Structure Prediction (CASP11) as MULTICOM group. It was officially ranked 3rd out of all 143 human and server predictors according to the total scores of the first models predicted for 78 CASP11 protein domains and 2nd according to the total scores of the best of the five models predicted for these domains. MULTICOM’s outstanding performance in the extremely competitive 2014 CASP11 experiment proves that our large-scale quality assessment approach together with model clustering is a promising solution to one of the two major problems in protein structure modeling.

Availability: The web server is available at: http://sysbio.rnet.missouri.edu/multicom_cluster/human/.
Contact: chengji@missouri.edu
TOP

Date:Sunday, July 12 10:10 am - 10:30 amRoom: The Liffey B

Authors:
Franziska Zickmann, Robert Koch Institute, Germany
Bernhard Renard, Robert Koch Institute, Germany

Area Session Chair: Anna Tramontano

Presentation Overview:
Summary: Ongoing advances in high-throughput technologies have facilitated accurate proteomic measurements and provide a wealth of information on genomic and transcript level. In proteogenomics, this multi-omics data is combined to analyze unannotated organisms and to allow more accurate sample-specific predictions.
Existing analysis methods still mainly depend on six-frame translations or reference protein databases that are extended by transcriptomic information or known single nucleotide polymorphisms (SNPs). However, six-frames introduce an artificial six fold increase of the target database and SNP integration requires a suitable database summarizing results from previous experiments.
We overcome these limitations by introducing MSProGene, a new method for integrative proteogenomic analysis based on customized RNA-Seq driven transcript databases. MSProGene is independent from existing reference databases or annotated SNPs and avoids large six-frame translated databases by constructing sample-specific transcripts. In addition, it creates a network combining RNA-Seq and peptide information that is optimized by a maximum-flow algorithm. It thereby also allows resolving the ambiguity of shared peptides for protein inference.
We applied MSProGene on three data sets and show that it facilitates a database-independent reliable yet accurate prediction on gene and protein level and additionally identifies novel genes.

Availability: MSProGene is written in Java and Python. It is open source and available at http://sourceforge.net/projects/msprogene/
TOP

Date:Sunday, July 12 10:50 am - 11:10 amRoom: The Liffey B

Authors:
Yana Safonova, St. Petersburg State University, Russian Federation
Stefano Bonissone, University of California, San Diego, United States
Eugene Kurpilyansky, St. Petersburg Academic University, Russian Federation
Ekaterina Starostina, St. Petersburg State University, Russian Federation
Alla Lapidus, St. Petersburg State University, Russian Federation
Jeremy Stinson, Genentech, United States
Laura Depalatis, Genentech, United States
Wendy Sandoval, Genentech, United States
Jennie Lill, Genentech, United States
Pavel Pevzner, University of California, San Diego, United States

Area Session Chair: Anna Tramontano

Presentation Overview:
The analysis of concentrations of circulating antibodies in serum (antibody repertoire) is a fundamental, yet poorly studied, problem in immunoinformatics. The two current approaches to the analysis of antibody repertoires (next generation sequencing (NGS) and mass spectrometry (MS)) present difficult computational challenges since antibodies are not directly encoded in the germline but are extensively diversified by somatic recombination and hypermutations. Therefore,
the protein database required for the interpretation of spectra from circulating antibodies is custom for each individual. While such a database can be constructed via NGS, the reads generated by NGS are error-prone and even a single nucleotide error precludes identification of a peptide by the standard proteomics tools. Here, we present the IgRepertoireConstructor algorithm that performs error-correction of immunosequencing reads and uses mass spectra to validate the constructed antibody repertoires.

Availability: IgRepertoireConstructor is open source and freely available as a C++ and Python program running on all Unix-compatible platforms.
The source code is available from http://bioinf.spbau.ru/igtools.
Contact: ppevzner@University of California, San Diego.edu
TOP

Date:Tuesday, July 14 11:40 am - 12:00 pmRoom: The Liffey B

Authors:
Yang Shen, Texas A&M University, United States
Tomasz Oliwa, Toyota Technological Institute as Chicago, United States

Area Session Chair: Francisco Melo Ledermann

Presentation Overview:
Motivation: It remains both a fundamental and practical challenge to understand and anticipate motions and conformational changes of proteins during their associations. Conventional normal mode analysis (NMA) based on anisotropic network model (ANM) addresses the challenge by generating normal modes reflecting intrinsic flexibility of proteins, which follows a conformational selection model for protein--protein interactions. But earlier studies have also found cases where conformational selection alone could not adequately explain conformational changes and other models have been proposed. Moreover, there is a pressing demand of constructing a much reduced but still relevant subset of protein conformational space in order to improve computational efficiency and accuracy in protein docking, especially for the difficult cases with significant conformational changes.

Method and Results: With both conformational selection and induced fit models considered, we extend ANM to include concurrent but differentiated intra- and inter-molecular interactions and develop an encounter complex-based NMA (cNMA) framework. Theoretical analysis and empirical results over a large data set of significant conformational changes indicate that cNMA is capable of generating conformational vectors considerably better at approximating conformational changes with contributions from both intrinsic flexibility and inter-molecular interactions than conventional NMA only considering intrinsic flexibility does. The empirical results also indicate that a straightforward application of conventional NMA to an encounter complex often does not improve upon NMA for an individual protein under study and intra- and inter-molecular interactions need to be differentiated properly. Moreover, in addition to induced motions of a protein under study, the induced motions of its binding partner as well as the coupling between the two sets of protein motions present in a near-native encounter complex lead to the improved performance. A study to isolate and assess the sole contribution of intermolecular interactions towards improvements against conventional NMA further validates the additional benefit from induced-fit effects. Taken together, these results provide new insights into molecular mechanisms underlying protein interactions and new tools for dimensionality reduction for flexible protein docking.

Availability: Source codes are available upon request.
TOP

Date:Tuesday, July 14 12:20 pm - 12:40 pmRoom: The Liffey B

Authors:
Xuefeng Cui, King Abdullah University of Science and Technology, Saudi Arabia
Hammad Naveed, King Abdullah University of Science and Technology, Saudi Arabia
Xin Gao, King Abdullah University of Science and Technology, Saudi Arabia

Area Session Chair: Francisco Melo Ledermann

Presentation Overview:
Motivation: Biological molecules perform their functions through
interactions with other molecules. Structure alignment of interaction
interfaces between biological complexes is an indispensable step in detecting
their structural similarities, which are key to understanding their
evolutionary histories and functions. Although various structure alignment
methods have been developed to successfully access the similarities of protein
structures or certain types of interaction interfaces, existing alignment tools
cannot directly align arbitrary types of interfaces formed by protein, DNA or
RNA molecules. Specifically, they require a "blackbox preprocessing" to
standardize interface types and chain identifiers. Yet their performance is
limited and sometimes unsatisfactory.

Results: Here we introduce a novel method, PROSTA-inter, that
automatically determines and aligns interaction interfaces between two
arbitrary types of complex structures. Our method uses sequentially remote
fragments to search for the optimal superimposition. The optimal residue
matching problem is then formulated as a maximum weighted bipartite matching
problem to detect the optimal sequence order-independent alignment. Benchmark
evaluation on all non-redundant protein-DNA complexes in PDB shows significant
performance improvement of our method over TM-align and iAlign (with the
"blackbox preprocessing"). Two case studies where our method discovers, for
the first time, structural similarities between two pairs of functionally
related protein-DNA complexes are presented. We further demonstrate the power
of our method on detecting structural similarities between a protein-protein
complex and a protein-RNA complex, which is biologically known as a protein-RNA
mimicry case.
TOP

Date:Tuesday, July 14 3:50 pm - 4:10 pmRoom: The Liffey B

Authors:
Mu Zhu, University of Waterloo, Canada
Forbes Burkowski, University of Waterloo, Canada
Mu Zhu, University of Waterloo, Canada

Area Session Chair: Donna Slonim

Presentation Overview:
Motivation: Inferring structural dependencies among a protein’s side
chains helps us understand their coupled motions. It is known that
coupled fluctuations can reveal pathways of communication used for
information propagation in a molecule. Side-chain conformations are
commonly represented by multivariate angular variables, but existing
partial correlation methods that can be applied to this inference task
are not capable of handling multivariate angular data. We propose
a novel method to infer direct couplings from this type of data, and
show that this method is useful for identifying functional regions and
their interactions in allosteric proteins.

Results: We developed a novel extension of canonical correlation
analysis (CCA), which we call “kernelized partial CCA” (or simply
KPCCA), and used it to infer direct couplings between side chains,
while disentangling these couplings from indirect ones. Using the
conformational information and fluctuations of the inactive structure
alone for allosteric proteins in the Ras and other Ras-like families,
our method identified allosterically important residues not only as
strongly coupled ones but also in densely connected regions of the
interaction graph formed by the inferred couplings. Our results were
in good agreement with other empirical findings. By studying distinct
members of the Ras, Rho, and Rab sub-families, we show further that
KPCCA was capable of inferring common allosteric characteristics in
the small G protein super-family.
TOP

Systems
Date:Sunday, July 12 10:10 am - 10:30 amRoom: Liffey Hall 2

Authors:
Zongliang Yue, Indiana University-Purdue University Indianapolis, United States
Madhura Kshirsagar, Indiana University-Purdue University Indianapolis, United States
Thanh Nguyen, Indiana University–Purdue University Indianapolis, United States
Thanh Nguyen, Indiana University–Purdue University Indianapolis, United States
Michael Neylon, Indiana University-Purdue University Indianapolis, United States
Liugen Zhu, Indiana University–Purdue University Indianapolis, United States
Timothy Ratliff, Purdue University, United States
Jake Chen, Indiana University-Purdue University Indianapolis, United States

Area Session Chair: Igor Jurisica

Presentation Overview:
In this paper, we described a new database framework to perform integrative “gene-set, network, and pathway analysis” (GNPA). In this framework, we integrated heterogeneous data on pathways, annotated list, and gene-sets (PAGs) into a PAG electronic repository (PAGER). PAGs in the database are organized into P-type, A-type, and G-type PAGs with a three-letter-code standard naming convention. The PAGER database currently compiles 44,313 genes from 5 species including human, 38,663 PAGs, 324,830 gene-gene relationships, and two types of 3,174,323 PAG-PAG regulatory relationships—co-membership based and regulatory relationship based. To help users assess each PAG’s biological relevance, we developed a cohesion measure called Cohesion Coefficient (CoCo), which is capable of disambiguating between biologically significant PAGs and random PAGs with an Area-Under-Curve (AUC) performance of 0.98. PAGER database was set up to help users to search and retrieve PAGs from its online web interface. PAGER enable advanced users to build PAG-PAG regulatory networks that provide complementary biological insights not found in gene set analysis or individual gene network analysis. We provide a case study using cancer functional genomics data sets to demonstrate how integrative GNPA help improve network biology data coverage and therefore biological interpretability.

The PAGER database can be accessible openly at http://discovery.informatics.iupui.edu/PAGER/.
TOP

Date:Sunday, July 12 2:00 pm - 2:20 pmRoom: Liffey Hall 2

Authors:
Felipe Llinares-Lopez, ETH Zürich, Switzerland
Dominik Grimm, ETH Zürich, Switzerland
Dean Bodenham, ETH Zurich, Switzerland
Udo Gieraths, ETH Zurich, Switzerland
Mahito Sugiyama, Osaka University, Japan
Beth Rowan, Max Planck Institut for Developmental Biology, Germany
Karsten Borgwardt, ETH Zurich, Switzerland

Area Session Chair: Nicolas Le Novere

Presentation Overview:
Motivation: Genetic heterogeneity, the fact that several sequence variants give rise to the same phenotype, is a phenomenon that is of the utmost interest in the analysis of complex phenotypes. Current approaches for finding regions in the genome that exhibit genetic heterogeneity suffer from at least one of two shortcomings: 1) they require the definition of an exact interval in the genome that is to be tested for genetic heterogeneity, potentially missing intervals of high relevance, or 2) they suffer from an enormous multiple hypothesis testing problem due to the large number of potential candidate intervals being tested, which results in either many false positives or a lack of power to detect true intervals.

Results: Here, we present an approach that overcomes both problems: It allows one to automatically find all contiguous sequences of SNPs in the genome that are jointly associated with the phenotype. It also solves both the inherent computational efficiency problem and the statistical problem of multiple hypothesis testing, which are both caused by the huge number of candidate intervals. We demonstrate on Arabidopsis thaliana GWAS data that our approach can discover regions that exhibit genetic heterogeneity and would be missed by single-locus mapping.
Conclusions: Our novel approach can contribute to the genome- wide discovery of intervals that are involved in the genetic heterogeneity underlying complex phenotypes.

Availability: The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/sis.html
Contact: felipe.llinares@bsse.ethz.ch
TOP

Date:Sunday, July 12 3:50 pm - 4:10 pmRoom: Liffey Hall 2

Authors:
Marinka Zitnik, University of Ljubljana, Slovenia
Blaz Zupan, University of Ljubljana, Slovenia

Area Session Chair: Nicolas Le Novere

Presentation Overview:
Motivation: Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions, their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of data sets.

Results: We present FuseNet, a Markov network formulation that infers networks from a collection of nonidentically distributed data sets. Our approach is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of data sets offers substantial gains relative to inference of separate networks for each data set. Our results demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies.
TOP

Date:Sunday, July 12 4:10 pm - 4:30 pmRoom: Liffey Hall 2

Authors:
Christopher Penfold, University of Warwick, United Kingdom
Jonathan Millar, University of Warwick, United Kingdom
David Wild, University of Warwick, United Kingdom

Area Session Chair: Nicolas Le Novere

Presentation Overview:
Motivation: The ability to jointly learn gene regulatory networks (GRNs) in, or leverage GRNs between, related species would allow the vast amount of legacy data obtained in model organisms to inform the GRNs of more complex, or economically or medically relevant counterparts. Examples include transferring information from Arabidopsis thaliana into related crop species for food security purposes, or from mice into humans for medical applications. Here we develop two related Bayesian approaches to network inference that allow GRNs to be jointly inferred in, or leveraged between, several related species: in one framework network information is directly propagated between species; in the second hierarchical approach, network information is propagated via an unobserved “hypernetwork”. In both frameworks information about network similarity is captured via graph kernels, with the networks additionally informed by species- specific time series gene expression data, when available, using Gaussian processes to model the dynamics of gene expression.

Results: Results on in silico benchmarks demonstrate that joint inference, and leveraging of known networks between species, offers better accuracy than stand alone inference. The direct propagation of of network information via the non-hierarchical framework is more appropriate when there are relatively few species, whilst the hierarchical approach is better suited when there are many species. Both methods are robust to small amounts of mislabelling of orthologues. Finally the use of S.cerevisiae data and networks to inform inference of networks in the budding yeast S.pombe predicts a novel role in cell cycle regulation for Gas1 (SPAC19B12.02c), a 1,3-beta-glucanosyltransferase.

Availability: Matlab code is available from a temporary anonymous url for peer review http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/
TOP

Date:Sunday, July 12 10:30 am - 10:50 amRoom: Liffey Hall 2

Authors:
Yoshihiro Yamanishi, Kyushu University, Japan
Yasuo Tabei, Japan Science and Technology Agency, Japan
Masaaki Kotera, Tokyo Institute of Technology, Japan

Area Session Chair: Igor Jurisica

Presentation Overview:
Motivation: Recent advances in mass spectrometry and the related metabolomics technology enable rapid and comprehensive analysis of a huge number of metabolites, however, biosynthetic and biodegra- dation pathways are known only for a small portion of metabolites, and most metabolic pathways remain uncharacterized.
Results: In this study, we develop a novel method for supervi- sed de novo metabolic pathway reconstruction with an improved graph alignment-based approach in the reaction-filling framework. We propose a novel chemical graph alignment algorithm, which we call PACHA (Pairwise Chemical Aligner), in order to detect regioisomer-sensitive connectivities between aligned substructures of two compound structures. Unlike other existing graph alignment methods, PACHA can efficiently detect only one common subgraph between two compounds. Our results show that the proposed method outperforms previous descriptor-based methods or existing graph alignment-based methods in the enzymatic reaction-likeness predic- tion for isomer-enriched reactions, and it is also useful for reaction annotation that assigns potential reaction characteristics such as EC numbers and PIERO terms to substrate-product pairs. Finally, we make a comprehensive enzymatic reaction-likeness prediction for all possible uncharacterized compound pairs, suggesting potential metabolic pathways of newly predicted substrate-product pairs.
TOP

Date:Sunday, July 12 11:40 am - 12:00 pmRoom: Liffey Hall 2

Authors:
Dorothee Childs, European Molecular Biology Laboratory, Heidelberg, Germany
Sergio Grimbs, Jacobs University Bremen, Germany
Joachim Selbig, University of Potsdam and Max-Planck Institute for Molecular Plant Physiology, Germany

Area Session Chair: Igor Jurisica

Presentation Overview:
Motivation: Structural kinetic modeling (SKM) is a framework to analyse whether a metabolic steady state remains stable under perturbation, without requiring detailed knowledge about individual rate equations.
It provides a representation of the system`s Jacobian matrix that depends solely on the network structure, steady state measurements, and the elasticities at the steady state.
For a measured steady state, stability criteria can be derived by generating a large number of structural kinetic models (SK-models) with randomly sampled elasticities and evaluating the resulting Jacobian matrices. The elasticity space can be analysed statistically in order to detect network positions that contribute significantly to the perturbation response.
Here we extend this approach by examining the kinetic feasibility of the elasticity combinations created during Monte Carlo sampling.

Results: Using a set of small example systems, we show that the majority of sampled SK-models would yield negative kinetic parameters if they were translated back into kinetic models. To overcome this problem, a simple criterion is formulated that mitigates such infeasible models.
After evaluating the small example pathways, the methodology was used to study two steady states of the neuronal TCA cycle and the intrinsic mechanisms responsible for their stability or instability. The findings of the statistical elasticity analysis confirm that several elasticities are jointly coordinated to control stability and that the main source for potential instabilities are mutations in the enzyme alpha-ketoglutarate dehydrogenase.
TOP

Date:Sunday, July 12 2:20 pm - 2:40 pmRoom: Liffey Hall 2

Authors:
Farhad Hormozdiari, University of California, Los Angeles, United States
Gleb Kichaev, University of California, Los Angeles, United States
Wen-Yun Yang, University of California, Los Angeles, United States
Bogdan Pasaniuc, University of California, Los Angeles, United States
Eleazar Eskin, University of California, Los Angeles, United States

Area Session Chair: Nicolas Le Novere

Presentation Overview:
Motivation: Although genome-wide association studies (GWAS)
have identified thousands of variants associated with common
diseases and complex traits, only a handful of these variants
are validated to be causal. We consider “causal variants” as
variants which are responsible for the association signal at a
locus. As opposed to association studies that benefit from linkage
disequilibrium (LD), the main challenge in identifying causal variants
at associated loci lies in distinguishing among the many closely
correlated variants due to LD. This is particularly important for model
organisms such as inbred mice, where LD extends much further than
in human populations, resulting in large stretches of the genome
with significantly associated variants. Furthermore, these model
organisms are highly structured, and require correction for population
structure to remove potential spurious associations.

Results: In this work, we propose CAVIAR-Gene, a novel method
that is able to operate across large LD regions of the genome while
also correcting for population structure. A key feature of our approach
is that it provides as output a minimally sized set of genes that
captures the genes which harbor causal variants with probability .
Through extensive simulations, we demonstrate that our method not
only speeds up computation, but also have an average of 10% higher
recall rate compared to the existing approaches. We validate our
method using a real mouse high-density lipoprotein data (HDL) and
show that CAVIAR-Gene is able to identify Apoa2 (a gene known to
harbor causal variants for HDL), while reducing the number of genes
that need to be tested for functionality by a factor of 2.

The software is freely available for download at genetics.cs.University of California, Los Angeles.edu/caviar
TOP

Date:Sunday, July 12 3:30 pm - 3:50 pmRoom: Liffey Hall 2

Authors:
Zhidong Tu, Icahn School of Medicine at Mount Sinai, United States
Pei Wang, Icahn School of Medicine at Mount Sinai, United States
Jialiang Yang, Icahn School of Medicine at Mount Sinai, United States
Francesca Petralia, Icahn School of Medicine at Mount Sinai, United States

Area Session Chair: Nicolas Le Novere

Presentation Overview:
Motivation: Gene regulatory network (GRN) inference based on genomic data is one of the most actively pursued computational biological problems. Since different types of biological data usually provide complementary information regarding the underlying GRN, a model that integrates big data of diverse types is expected to increase both the power and accuracy of GRN inference. Towards this goal, we propose a novel algorithm named iRafNet: integrative random forest for gene regulatory network inference.

Results: iRafNet is a flexible, unified integrative framework that allows information from heterogeneous data, such as protein-protein interactions, transcription factor (TF)-DNA binding, gene knock-down, to be jointly considered for GRN inference. Using test data from the DREAM4 and DREAM5 challenges, we demonstrate that iRafNet outperforms the original random forest based network inference algorithm (GENIE3), and is highly comparable to the community learning approach. We apply iRafNet to construct GRN in Saccharomyces cerevisiae and demonstrate that it improves the performance in predicting TF-target gene regulations and provides additional functional insights to the predicted gene regulations.
TOP

Date:Monday, July 13 10:10 am - 10:30 amRoom: The Liffey B

Authors:
James Zou, Microsoft Research, United States
Eran Halperin, Tel Aviv University, Israel
Esteban Burchard, University of California San Francisco, United States
Sriram Sankararaman, Harvard Medical School, United States

Area Session Chair: Hidde de Jong

Presentation Overview:
Motivation: A basic problem of broad public and scientific interest is to use the DNA of an individual to infer the genomic ancestries of the parents. In particular, we are often interested in the fraction of each parent's genome that come from specific ancestries (e.g. European, African, Native American, etc). This has many applications ranging from understanding the inheritance of ancestry-related risks and traits to quantifying human assortative mating patterns.

Results: We model the problem of parental genomic ancestry inference as a pooled semi-Markov process. We develop a general mathematical framework for pooled semi-Markov processes and construct efficient inference algorithms for these models. Applying our inference algorithm to genotype data from 231 Mexican trios and 258 Puerto Rican trios where we have the true genomic ancestry of each parent, we demonstrate that our method accurately infers parameters of the semi-Markov processes and parents' genomic ancestries. We additionally validated the method on simulations. Our model of pooled semi-Markov process and inference algorithms may be of independent interest in other settings in genomics and machine learning.
TOP

Date:Monday, July 13 10:30 am - 10:50 amRoom: The Liffey B

Authors:
Danny Park, University of California San Francisco, United States
Brielin Brown, University of California at Berkeley, United States
Celeste Eng, University of California San Francisco, United States
Scott Huntsman, University of California San Francisco, United States
Donglei Hu, University of California San Francisco, United States
Dara Torgerson, University of California San Francisco, United States
Esteban Burchard, University of California, San Francisco, United States
Noah Zaitlen, University of California San Francisco, United States

Area Session Chair: Hidde de Jong

Presentation Overview:
Motivation: Approaches to identifying new risk loci, training risk prediction models, imputing untyped variants, and fine-mapping causal variants from summary statistics of genome-wide association studies are playing an increasingly important role in the human genetics community. Current summary statistics based methods rely on global “best guess” reference panels in order to model the genetic correlation structure of the dataset being studied. This approach, especially in admixed populations, has the potential to produce misleading results, ignores variation in local structure, and is not feasible when appropriate reference panels are missing or small. Here we develop a method, Adapt-Mix, that combines information across all available reference panels to produce estimates of local genetic correlation structure for summary statistics based methods in arbitrary populations.

Results: We applied Adapt-Mix to estimate the genetic correlation structure of both admixed and non-admixed individuals using simulated and real data. We evaluated our method by measuring the performance of two summary statistics based methods: imputation and joint-testing. When using our method as opposed to the current standard of “best guess” reference panels, we observed a 28% decrease in mean-squared error for imputation and a 73.7% decrease in mean-squared error for joint-testing.

Availability: Our method is publicly available in a software package called ADAPT-Mix available at https://github.com/dpark27/adapt mix
TOP

Date:Tuesday, July 14 3:30 pm - 3:50 pmRoom: The Auditorium

Authors:
Yuriy Hulovatyy, University of Notre Dame, United States
Huili Chen, University of Notre Dame, United States
Tijana Milenkovic, University of Notre Dame, United States

Area Session Chair: Natasa Przulj

Presentation Overview:
Motivation: With increasing availability of temporal real-world networks, how to efficiently study these data? One can model a temporal network as a single aggregate static network, or as a series of time-specific snapshots, each being an aggregate static network over the corresponding time window. Then, one can use established methods for static analysis on the resulting aggregate network(s), but losing in the process valuable temporal information either completely, or at the interface between different snapshots, respectively. Here, we develop a novel approach for studying a temporal network more explicitly, by capturing inter-snapshot relationships.

Results: We base our methodology on well-established graphlets (subgraphs), which have been proven in numerous contexts in static network research. We develop new theory to allow for graphlet-based analyses of temporal networks. Our new notion of dynamic graphlets is different from existing dynamic network approaches that are based on temporal motifs (statistically significant subgraphs). The latter have limitations: their results depend on the choice of a null network model that is required to evaluate the significance of a subgraph, and choosing a good null model is non-trivial. Our dynamic graphlets overcome the limitations of the temporal motifs. Also, when we aim to characterize the structure and function of an entire temporal network or of individual nodes, our dynamic graphlets outperform the static graphlets. Clearly, accounting for temporal information helps. We apply dynamic graphlets to temporal age-specific molecular network data to deepen our limited knowledge about human aging.
TOP

Date:Monday, July 13 2:40 pm - 3:00 pmRoom: Liffey Hall 2

Authors:
Hui Liu, Changzhou University, China
Jianjiang Sun, Fudan University, China
Yanni Sun, Fudan University, China
Jihong Guan, Tongji University, China
Jie Zheng, Nanyang Technological University, Singapore
Shuigeng Zhou, Fudan University, China

Area Session Chair: Knut Reinert

Presentation Overview:
Motivation : Computational prediction of compound-protein interactions is of great importance for drug design and development, as genome-scale experimental validation of compound-protein interactions is not only time-consuming but also prohibitively expensive. With the availability of an increasing number of validated interactions, the performance of computational prediction approaches is severely impended by the lack of reliable negative compound-protein interaction samples. A systematic method of screening reliable negative sample becomes critical to improving the performance of in silico prediction methods.

Results : This paper aims at building up a set of highly credible negative samples of compound-protein interactions via an in silico screening method. As most existing computational models assume that similar compounds are likelyto interact with similar target proteins and achieve remarkable performance, it is rational to identify potential negative samples based on the converse negative proposition that the proteins dissimilar to every known/predicted target of a compound are not likely to be targeted by the compound, and vice versa. We integrated various resources, including chemical structures, chemical expression profiles and side effects of compounds, amino acid sequences, protein-protein interaction network, and functional annotations of proteins, into a systematic screening framework. We first tested the screened negative samples on six classical classifiers, and all these classifiers achieved remarkably higher performance on our negative samples than on randomly-generated negative samples for both human and C.elegans. We then verified the negative samples on three existing prediction models, including bipartite local model, Gaussian kernel profile, Bayesian matrix factorization, and found that the performances of these models are also significantly improved on the screened negative samples. Moreover, we validated the screened negative samples on a drug bioactivity dataset. Finally, we derived two sets of new interactions by training an SVM classifier on the positive interactions annotated in DrugBank and our screened negative interactions. The screened negative samples and the predicted interactions provide the research community with a useful resource for identifying new drug targets and a helpful supplement to the current curated compound-protein databases.

Availability: Supplementary files and a preliminary Web server of this work are available at: http://admis.fudan.edu.cn/negative-cpi/
TOP