Proceedings Track Presentations
As of May 13, 2014 (schedule subject to change)

Attention Conference Presenters - please review the Speaker Information Page available here.

All Highlights and Proceedings Track presentations are presented by scientific area part of the combined Paper Presentation schedule.
A full schedule of Paper Presentations can be found here.

Presenting Authors are are shown in bold:

PP02 Ragout - A reference-assisted assembly tool for bacterial genomes
Room: 304
Date: Sunday, July 13, 10:30 am - 10:55 am

Son Pham, University of California, San Diego, United States
Mikhail Kolmogorov, Saint-Petersburg Academic University, Russia
Benedict Paten, University of California, Santa Cruz, United States
Brian Raney, University of California, Santa Cruz, United States

Session Chair: Serafim Batzoglou
Abstract Show

Bacterial genomes are simpler than mammalian ones, and
yet assembling the former from the data currently generated by
high-throughput short read sequencing machines still results in
hundreds of contigs. To improve assembly quality, recent studies
have utilized longer Pacific Biosciences (PacBio) reads or jumping
libraries to connect contigs into larger scaffolds or help assemblers
resolve ambiguities in repetitive regions of the genome. However,
their popularity in contemporary genomic research is still limited
by high cost and error rates.

In this work, we explore the
possibility of improving assemblies by using complete genomes
from closely related species/strains. We present Ragout, a genome
rearrangement approach, to address this problem. In contrast with
most reference-guided algorithms, where only one reference genome
is used, Ragout uses multiple references along with the evolutionary
relationship among these references in order to determine the correct
order of the contigs. Additionally, Ragout uses the assembly graph
and multi-scale synteny blocks to reduce assembly gaps caused
by small contigs from the input assembly. In simulations as well
as real datasets, we believe that for common bacterial species,
where many complete genome sequences from related strains have
been available, the current high-throughput short read sequencing
paradigm is sufficient to obtain a single high-quality scaffold for
each chromosome. The Ragout software is freely available at:

PP04 AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references
Room: 304
Date: Sunday, July 13, 11:00 am - 11:25 am

Ergude Bao, University of California, Riverside, United States
Tao Jiang, University of California, Riverside, United States
Thomas Girke, University of California, Riverside, United States

Session Chair: Serafim Batzoglou
Abstract Show

De novo assemblies of genomes remain one of the most challenging applications in next generation sequencing. Usually, their results are incomplete and fragmented into hundreds of contigs. Repeats in genomes and sequencing errors are the main reasons for these complications. With the rapidly growing number of sequenced genomes, it is now feasible to improve assemblies by guiding them with genomes from related species.

Here we introduce AlignGraph, an algorithm for extending and joining de novo assembled contigs or scaffolds guided by closely related reference genomes. It aligns paired-end (PE) reads and pre-assembled contigs or scaffolds to a close reference. From the obtained alignments, it builds a novel data structure, called the paired-end multi-positional de Bruijn graph. The incorporated positional information from the alignments and PE reads allows us to extend the initial assemblies, while avoiding incorrect extensions and early terminations. In our performance tests, AlignGraph was able to substantially improve the contigs and scaffolds from several assemblers. For instance, 28.7-62.3% of the contigs of Arabidopsis thaliana and human could be extended, resulting in improvements of common assembly metrics, such as an increase of the N50 of the extendable contigs by 89.9-94.5% and 80.3-165.8%, respectively. In another test, AlignGraph was able to improve the assembly of a published genome (Arabidopsis strain Landsberg) by increasing the N50 of its extendable scaffolds by 86.6%. These results demonstrate AlignGraph’s efficiency in improving genome assemblies by taking advantage of closely related references.

PP05 Cross-study validation for assessment of prediction models and algorithms
Room: 302
Date: Sunday, July 13, 11:00 am - 11:25 am

Christoph Bernau, Leibniz Supercomputing Center, Germany
Markus Riester, Harvard School of Public Health, United States
Anne-Laure Boulesteix, LMU Munich, Germany
Giovanni Parmigiani, Dana-Farber Cancer Institute, United States
Curtis Huttenhower, Harvard School of Public Health, United States
Levi Waldron, City University of New York School of Public Health, United States
Lorenzo Trippa, Dana-Farber Cancer Institute,, United States

Session Chair: Bonnie Berger
Abstract Show

Motivation: Numerous competing algorithms for prediction modeling
in high-dimensional settings have been developed in the statistical
and machine learning literature. Learning algorithms and the
prediction models they generate are typically evaluated on the basis
of cross-validation error estimates in a few examplary datasets.
However, in most applications, the ultimate goal of prediction
modeling is to provide accurate predictions for independent samples
processed in different laboratories, and cross-validation within
examplary datasets may not adequately reflect performance in this

Methods: Systematic cross-study validation is performed in
simulations and in a collection of eight estrogen-receptor positive
breast cancer microarray gene expression datasets, with the objective
of predicting distant metastasis-free survival (DMFS). An evaluation
statistic, in this paper the C-index, is computed for all pairwise
combinations of training and validation datasets. We evaluate several
alternatives for summarizing the pairwise validation statistics, and
compare these to conventional cross-validation.

Results: We develop a systematic approach to “cross-study
validation” to replace or supplement conventional cross-validation for
evaluation of high-dimensional prediction models when independent
datasets are available. In data-driven simulations and in our
application to survival prediction with eight breast cancer microarray
datasets, standard cross-validation suggests inflated discrimination
accuracy for all competing algorithms when compared to cross-study
validation. Furthermore, the ranking of learning algorithms differs,
suggesting that algorithms performing best in cross-validation may
be suboptimal when evaluated through independent validation.
Availability: The survHD: Survival in High Dimensions package
( will be made available
through Bioconductor.

PP07 ExSPAnder: a Universal Repeat Resolver for DNA Fragment Assembly
Room: 304
Date: Sunday, July 13, 11:30 am - 11:55 pm

Andrey D. Prjibelski, St. Petersburg Academic University, Russia
Irina Vasilinetc, St. Petersburg Academic University, Russia
Anton Bankevich, St. Petersburg Academic University, Russia
Alexey Gurevich, St. Petersburg Academic University, Russia
Tatiana Krivosheeva, St. Petersburg Academic University, Russia
Sergey Nurk, St. Petersburg Academic University, Russia
Son Pham, University of California, San Diego, United States
Anton Korobeynikov, St. Petersburg Academic University, Russia
Alla Lapidus, St. Petersburg Academic University, Russia
Pavel Pevzner, University of California, San Diego, United States

Session Chair: Serafim Batzoglou
Abstract Show

Next-generation sequencing (NGS) technologies have raised a challenging de novo genome assembly problem that is furtheramplified in recently emerged single-cell sequencing projects. Whilevarious NGS assemblers can utilize information from several libraries of read-pairs, most of them were originally developed for a singlelibrary and do not fully benefit from multiple libraries. Moreover, mostassemblers assume uniform read coverage, condition that does nothold for single-cell projects where utilization of read-pairs is evenmore challenging. We have developed an exSPAnder algorithm thataccurately resolves repeats in the case of both single and multiplelibraries of read-pairs in both standard and single-cell assemblyprojects.

PP08 Large scale analysis of signal reachability
Room: 311
Date: Sunday, July 13, 12:00 pm - 12:25 pm

Andrei Todor, University of Florida, United States
Haitham Gabr, University of Florida, United States
Alin Dobra, University of Florida, United States
Tamer Kahveci, University of Florida, United States

Session Chair: Terry Gaasterland
Abstract Show

Motivation: Major disorders, such as leukemia, have been shown to alter the
transcription of genes. Understanding how gene regulation is
affected by such aberrations is of utmost importance. One promising
strategy towards this objective is to compute whether signals can
reach to the transcription factors through the transcription
regulatory network. Due to the uncertainty of the regulatory
interactions, this is a #P-complete problem and thus solving it for
very large transcription regulatory networks remains to be a

Results: We develop a novel and scalable method to compute the probability that a signal originating at any given set of source genes can
arrive at any given set of target genes (i.e., transcription
factors) when the topology of the underlying signaling network is
uncertain. Our method tackles this problem for large networks while
providing a provably accurate result. Our method follows a
divide-and-conquer strategy. We break down the given network into a
sequence of non-overlapping subnetworks such that reachability can
be computed autonomously and sequentially on each subnetwork. We
represent each interaction using a small polynomial. The product of
these polynomials express different scenarios when a signal can or
cannot reach to target genes from the source genes. We introduce
polynomial collapsing operators for each subnetwork. These operators
reduce the size of the resulting polynomial and thus the
computational complexity dramatically. We show that our method
scales to entire human regulatory networks in only seconds, while
the existing methods fail beyond a few tens of genes and
interactions. We demonstrate that our method can successfully
characterize key reachability characteristics of the entire
transcriptions regulatory networks of patients affected by eight
different subtypes of leukemia, as well as those from healthy
control samples.

PP10 GRASP: Analysis of genotype-phenotype results from 1,390 genome-wide association studies and corresponding open access database
Room: 311
Date: Sunday, July 13, 3:05 pm - 3:30 pm

Richard Leslie, University of Massachusetts Medical School, United States
Christopher O'Donnell, National Institute's of Health, United States
Andrew Johnson, National Institute's of Health, United States

Session Chair: Fran Lewitter
Abstract Show

We created a deeply extracted and annotated database of GWAS study results. GRASP v1.0 contains >6.2 million SNP-phenotype association from among 1,390 GWAS studies. We re-annotated GWAS results with 16 annotation sources including some rarely compared to GWAS results (e.g., RNAediting sites, lincRNAs, PTMs). RESULTS. GWAS have grown exponentially, with increases in sample sizes and markers tested, and continuing bias toward European ancestry samples. GRASP contains >100,000 phenotypes, roughly: eQTLs (71.5%), metabolite QTLs (21.2%), methylation QTLs (4.4%), and diseases, biomarkers and other traits (2.8%). cis-eQTLs, meQTLs, mQTLs and MHC region SNPs are highly enriched among significant results. After removing these categories, GRASP still contains a greater proportion of studies and results than comparable GWAS catalogs.

Cardiovascular disease and related risk factors predominate remaining GWAS results, followed by immunological, neurological and cancer traits. Significant results in GWAS display a highly gene-centric tendency. Sex chromosome X (OR=0.18[0.16-0.20]) and Y (OR=0.003[0.001-0.01]) genes are depleted for GWAS results. Gene length is correlated with GWAS results at nominal significance (P<0.05) levels. We show this gene length correlation decays at increasingly more stringent P-value thresholds. Potential pleiotropic genes and SNPs enriched for multi-phenotype association in GWAS are identified. However, we note possible population stratification at some of these loci. Finally, via re-annotation we identify compelling functional hypotheses at GWAS loci, in some cases unrealized in studies to date. CONCLUSION. Pooling summary-level GWAS results and re-annotating with bioinformatics predictions and molecular features provides a good platform for new insights. The GRASP database is available at

PP14 BlockClust: efficient clustering and classification of non-coding RNAs from short read profiles
Room: 304
Date: Sunday, July 13, 3:35 pm - 4:00 pm

Fabrizio Costa, University of Freiburg, Germany
Dominic Rose, University of Freiburg, Germany
Rolf Backofen, University of Freiburg, Germany
Pavankumar Videm, University of Freiburg, Germany

Session Chair: Cenk Sahinalp
Abstract Show

Non-coding RNAs play a vital role in many cellular processes such as
RNA splicing, translation, gene regulation. However the vast
majority of ncRNAs still have no functional annotation. One prominent
approach for putative function assignment is clustering of
transcripts according to sequence and secondary structure. However
sequence information is changed by post-transcriptional
modifications, and secondary structure is only a proxy for the true
three dimensional conformation of the RNA polymer. A different type
of information that does not suffer from these issues and that can
be used for the detection of RNA classes, is the pattern of
processing and its traces in small RNA-seq reads data.

Here we introduce BlockClust, an efficient approach to detect
transcripts with similar processing patterns. We propose a novel way
to encode expression profiles in compact discrete structures, which
can then be processed using fast graph kernel techniques. We perform
both unsupervised clustering and develop family specific
discriminative models; finally we show how the proposed approach is
both scalable, accurate and robust across different organisms,
tissues and cell lines.

PP15 Tertiary structure-based prediction for conformational B-cell epitopes through B factors
Room: 302
Date: Sunday, July 13, 3:35 pm - 4:00 pm

Jing Ren, University of Technology Sydney, Australia
Qian Liu, University of Technology Sydney, Australia
John Ellis, University of Technology Sydney, Australia
Jinyan Li, University of Technology Sydney, Australia

Session Chair: Toni Kazic
Abstract Show

Motivation: B-cell epitope is a small area on the surface of an antigen that binds to an antibody. Accurately locating epitopes is of critical importance for vaccine development. Compared with wet-lab methods, computational methods have strong potential for efficient and large-scale epitope prediction for antigen candidates at much lower cost. However, it is still not clear which features are good determinants for accurate epitope prediction, leading to the unsatisfactory performance of existing prediction methods. Method and results: We propose a much more accurate B-cell epitope prediction method. Our method uses a new feature B factor (obtained from X-ray crystallography), combined with other basic physicochemical, statistical, evolutionary and structural features of each residue. These basic features are extended by a sequence window and a structure window. All these features are then learned by a two-stage random forest model to identify clusters of antigenic residues and to remove isolated outliers. Tested on a dataset of 55 epitopes from 45 tertiary structures, we prove that our method significantly outperforms all three existing structure-based epitope predictors. Following comprehensive analysis, it is found that features such as B factor, relative accessible surface area and protrusion index play an important role in characterizing B-cell epitopes. Our detailed case studies on an HIV antigen and an influenza antigen confirm that our second stage learning is effective for clustering true antigenic residues and for eliminating self-made prediction errors introduced by the first-stage learning.

PP18 An Efficient Parallel Algorithm for Accelerating Computational Protein Design
Room: 302
Date: Sunday, July 13, 4:05 pm - 4:30 pm

Yichao Zhou, Tsinghua University, China
Wei Xu, Tsinghua University, China
Bruce R. Donald, Duke University, United States
Jianyang Zeng, Tsinghua University, China

Session Chair: Toni Kazic
Abstract Show

Motivation: Structure-based computational protein design is an
important topic in protein engineering. Under the assumption of a rigid
backbone and a finite set of discrete conformations for side-chains,
various methods have been proposed to address this problem. A
popular method is to combine the Dead-End Elimination (DEE) and
A* tree search algorithms, which provably finds the Global Minimum
Energy Conformation (GMEC) solution.

Results: In this paper, we improve the efficiency of computing A*
heuristic functions for protein design and also propose a variant of A*
algorithm in which the search process can be performed on GPUs in
a massively parallel fashion . In addition, we made some efforts to
address the memory exceeding problems in A* search. As a result,
our enhancements can achieve a significant speedup of the original
A* search for protein design in four orders of magnitude on big scale
test data, while still maintaining an acceptable memory overhead. Our
parallel A* search algorithm can be combined with iMinDEE, a recent
DEE criterion for rotamer pruning to further improve structure-based
computational protein design with the consideration of side-chain

PP19 Inductive Matrix Completion for Predicting Gene-Disease Associations
Room: 311
Date: Sunday, July 13, 4:35 pm - 5:00 pm

Nagarajan Natarajan, University of Texas at Austin, United States
Inderjit Dhillon, University of Texas at Austin, United States

Session Chair: Fran Lewitter
Abstract Show

Motivation: Most existing methods for predicting causal disease genes rely on specific type of evidence, and are therefore limited in terms of applicability. More often than not, the type of evidence available for diseases varies --- for example, we may know linked genes, keywords associated with the disease obtained by mining text, or co-occurrence of disease symptoms by observing patients. Similarly, the type of evidence available for genes varies --- for example, specific microarray probes convey information only for certain sets of genes. In this paper, we apply a novel matrix completion method recently developed by Jain 2013 to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene-disease associations. We construct features from different biological sources such as microarray expression data and disease-related textual data. A crucial advantage of the method is that it is inductive; it can be applied to diseases emph{not} seen at training time, unlike traditional matrix completion approaches and network-based inference methods that are transductive.

Results: Comparison with state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database shows that the proposed approach is substantially better - it has close to one-in-four chance of recovering a true association in the top 100 predictions, compared to the recently proposed method (second best) that has less than 15\% chance. We demonstrate that the inductive method is particularly effective for a query disease with no previously known gene associations, and for predicting novel genes, i.e., genes that are previously not linked to diseases. Thus the method is capable of predicting novel genes even for well-characterized diseases. We also validate the novelty of predictions by evaluating the method on recently reported OMIM associations and on associations recently reported in the literature curated by Bornigen.

Availability: Source code and datasets at http://www.cs.utexas edu/~naga86/research/IMC

PP20 Probabilistic Method for Detecting Copy Number Variation in a Fetal Genome using Maternal Plasma Sequencing
Room: 304
Date: Sunday, July 13, 4:35 pm - 5:00 pm

Ladislav Rampášek, University of Toronto, Canada
Aryan Arbabi, University of Toronto, Canada
Michael Brudno, University of Toronto, Canada

Session Chair: Cenk Sahinalp
Abstract Show

Motivation: The past several years have seen the development of
methodologies to identify genomic variation within a fetus through the
non-invasive sequencing of maternal blood plasma. These methods
are based on the observation that maternal plasma contains a
fraction of DNA (typically 5-15%) originating from the fetus, and such
methodologies have already been used for the detection of whole-
chromosome events (aneuploidies), and to a more limited extent for
smaller (typically several megabases long) Copy Number Variants

Results: Here we present a probabilistic method for non-invasive
analysis of de novo CNVs in fetal genome based on maternal plasma
sequencing. Our novel method combines three types of information
within a unified Hidden Markov Model: the imbalance of allelic ratios
at SNP positions, the use of parental genotypes to phase nearby
SNPs, and depth of coverage to better differentiate between various
types of CNVs and improve precision. Our simulation results, based
on in silico introduction of novel CNVs into plasma samples with 13%
fetal DNA concentration, demonstrate a sensitivity of 90% for CNVs
>400 kilobases (with 13 calls in an unaffected genome), and 40% for
50-400kb CNVs (with 108 calls in an unaffected genome).

Availability: Implementation of our model and data simulation
method is available at

PP23 RNA-Skim: a rapid method for RNA-Seq quantification at transcript level
Room: 304
Date: Monday, July 14, 10:30 am - 10:55 am

Zhaojun Zhang, UNC - Chapel Hill, United States
Wei Wang, University of California, Los Angeles, United States

Session Chair: Bernard Moret
Abstract Show

Motivation: RNA-Seq technique has been demonstrated as a revolutionary means for exploring transcriptome because it provides deep coverage and base-pair level resolution. RNA-Seq quantification is proven to be an efficient alternative to Microarray technique in gene expression study, and it is a critical component in RNA-Seq differential expression analysis. Most existing RNA-Seq quantification tools require the alignments of fragments to either a genome or a transcriptome, entailing a time-consuming and intricate alignment step. In order to improve the performance of RNA-Seq quantification, an alignment-free method, Sailfish, has been recently proposed to quantify transcript abundances using all k-mers in the transcriptome, demonstrating the feasibility of designing an efficient alignment-free method for transcriptome quantification. Even though Sailfish is substantially faster than alternative alignment-dependent methods such as Cufflinks, using all k-mers in the transcriptome quantification impedes the scalability of the method.

Results: We propose a novel RNA-Seq quantification method, RNA-Skim, which partitions the transcriptome into disjoint transcript clusters based on sequence similarity and introduces the notion of sig-mers that are a special type of k-mers uniquely associated with each cluster. We demonstrate that the sig-mer counts within a cluster are sufficient for estimating transcript abundances with accuracy comparable to any state of the art method. This enables RNA-Skim to perform transcript quantification on each cluster independently, reducing a complex optimization problem into smaller optimization tasks that can be run in parallel. As a result, RNA-Skim uses less than 4% of the k-mers and less than 10% of the CPU time required by Sailfish. It is able to finish transcriptome quantification in less than 10 minutes per sample by using just a single thread on a commodity computer, which represents more than 100 speedup over the state of the art alignment-based methods, while delivering comparable or higher accuracy.

The software is available at

PP24 Metabolite Identification through Multiple Kernel Learning on Fragmentation Trees
Room: 302
Date: Monday, July 14, 10:30 am - 10:55 am

Huibin Shen, Aalto University, Finland
Kai Dührkop, Friedrich-Schiller-University Jena, Germany
Sebastian Böcker, Friedrich-Schiller-University Jena, Germany
Juho Rousu, Aalto University, Finland

Session Chair: Yanay Ofran
Abstract Show

Motivation: Metabolite identification from tandem mass spectrometric data is a key task in metabolomics. Various computational methods has been proposed for the
identification of metabolites from tandem mass spectra. Fragmentation tree
methods explore the space of possible ways the metabolite can fragment, and
base the metabolite identification on scoring of these fragmentation trees.
Machine learning methods has been used to map mass spectra to molecular
fingerprints; predicted fingerprints, in turn, can be used to score candidate
molecular structures.

Results: Here, we combine fragmentation tree computations with kernel-based machine learning to predict molecular fingerprints and identify molecular structures.
We introduce a family of kernels capturing the similarity of fragmentation
trees, and combine these kernels using recently proposed multiple kernel
learning approaches. Experiments on two large reference datasets show that
the new methods significantly improve molecular fingerprint prediction
accuracy. These improvements result in better metabolite identification,
doubling the number of metabolites placed ranked at the top position of the
candidate list.

PP25 Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation
Room: 312
Date: Monday, July 14, 10:30 am - 10:55 am

Tarmo Äijö, Aalto University, Finland
Vincent Butty, Massachusetts Institute of Technology, United States
Zhi Chen, University of Turku, Finland
Verna Salo, University of Turku, Finland
Subhash Tripathi, University of Turku / Åbo Akademi University , Finland
Christopher Burge, Massachusetts Institute of Technology, United States
Riitta Lahesmaa, University of Turku, Finland
Harri Lähdesmäki, Aalto University,, Finland

Session Chair: Robert F. Murphy
Abstract Show

Motivation: Gene expression profiling using RNA-seq is a powerful technique for screening RNA species’ landscapes and their dynamics in an unbiased way. While several advanced methods exist for differential expression analysis of RNA-seq data, proper tools to analyze RNA-seq time-course have not been proposed.

Results: In this study, we use RNA-seq to measure gene expression during the early human T helper 17 (Th17) cell differentiation and T cell activation (Th0). To quantify Th17 specific gene expression dynamics, we present a novel statistical methodology, DyNB, for analyzing time-course RNA-seq data. We use non- parametric Gaussian process to model temporal correlation in gene expression and combine that with negative binomial likelihood for the count data. To account for experiment specific biases in gene expression dynamics, such as differences in cell differentiation efficiencies, we propose a method to rescale the dynamics between replicated measurements. We develop an MCMC sampling method to make inference of differential expression dynamics between conditions. DyNB identifies several known and novel genes involved in Th17 differentiation. Analysis of differentiation efficiencies revealed consistent patterns in gene expression dynamics between different cultures. We use qRT-PCR to validate differential expression and differentiation efficiencies for selected genes. Comparison of the results with those obtained via traditional time point wise analysis shows that time-course analysis together with time rescaling between cultures identifies differentially expressed genes which would not otherwise be detected.

Availability: An implementation of the proposed computational methods will be available at

PP30 Pipasic: Similarity and Expression Correction for Strain-Level Identification and Quantification in Metaproteomics
Room: 311
Date: Monday, July 14, 11:30 am - 11:55 pm

Anke Penzlin, Robert Koch Institute, Germany
Martin S. Lindner, Robert Koch Institute, Germany
Joerg Doellinger, Robert Koch Institute, Germany
Piotr Wojtek Dabrowski, Robert Koch Institute, Germany
Nitsche Andreas, Robert Koch Institute, Germany
Bernhard Y. Renard, Robert Koch Institute, Germany

Session Chair: Janet Kelso
Abstract Show

Motivation: Metaproteomic analysis allows studying the interplay of organisms or functional groups and has become increasingly popular also for diagnostic purposes. However, difficulties arise due to the high sequence similarity between related organisms. Further, the state of conservation of proteins between species can be correlated with their expression level which can lead to significant bias in results and interpretation. These challenges are similar but not identical to the challenges arising in the analysis of metagenomic samples and require specific solutions.

Results: We introduce Pipasic (peptide intensity-weighted proteome abundance similarity correction) as a tool which corrects identification and spectral counting based quantification results using peptide similarity estimation and expression level weighting within a non-negative lasso framework. Pipasic has distinct advantages over approaches only regarding unique peptides or aggregating results to the lowest common ancestor, as demonstrated on examples of viral diagnostics and an acid mine drainage dataset.

PP31 Deep learning of the tissue-regulated splicing code
Room: 304
Date: Monday, July 14, 11:30 am - 11:55 pm

Michael Leung, University of Toronto, Canada
Hui Xiong, University of Toronto, Canada
Leo Lee, University of Toronto, Canada
Brendan Frey, University of Toronto, Canada

Session Chair: Bernard Moret
Abstract Show

Motivation: Alternative splicing is a regulated process that directs the generation of different transcripts from single genes. A computational model that can accurately predict splicing patterns based on genomic features and cellular context is highly desirable, both in understanding this widespread phenomenon, and in exploring the effects of genetic variations on alternative splicing.

Methods: Using a deep neural network, we developed a model inferred from mouse RNA-Seq data that can predict splicing patterns in individual tissues and differences in splicing patterns across tissues. Our architecture uses hidden variables that jointly represent features in genomic sequences and tissue types when making predictions. A graphics processing unit was used to greatly reduce the training time of our models with millions of parameters.

Results: We show that the deep architecture surpasses the performance of the previous Bayesian method for predicting alternative splicing patterns. With the proper optimization procedure and selection of hyperparameters, we demonstrate that deep architectures can be beneficial, even with a moderately sparse dataset. An analysis of what the model has learned in terms of the genomic features is presented.

PP32 DrugComboRanker: Drug Combination Discovery Based on Target Network Analysis
Room: 302
Date: Monday, July 14, 11:30 am - 11:55 pm

Lei Huang, Peking University, China
Fuhai Li, Houston Methodist Hospital Research Institute , United States
Jianting Sheng, Houston Methodist Hospital Research Institute, United States
Xiaofeng Xia, Houston Methodist Hospital Research Institute, United States
Jinwen Ma, Peking University
Ming Zhan, Houston Methodist Hospital Research Institute, United States
Stephen Wong, Houston Methodist Hospital Research Institute, United States

Session Chair: Yanay Ofran
Abstract Show

Motivation: Currently there are no curative anti-cancer drugs and drug resistance is often acquired after drug treatment. One of the reasons is that cancers are complex diseases, regulated by multiple signaling pathways and cross-talks among the pathways. It is expected that drug combinations can reduce drug resistance and improve patients’ outcomes. In clinical practice, the ideal and feasible drug combinations are combinations of existing FDA approved drugs or bioactive compounds that are already used on patients or have entered clinical trials and passed safety tests. These drug combinations could directly be used on patients with less concern of toxic effects. However, there is so far no effective computational approach to search effective drug combinations from the enormous number of possibilities.

Results: In this study, we propose a novel systematic computational tool DrugComboRanker to prioritize synergistic drug combinations and uncover their mechanisms of action. We first build a drug functional network based on their genomic profiles, and partition the network into numerous drug network communities by using a Bayesian non-negative matrix factorization approach. As drugs within overlapping community share common mechanisms of action, we next uncover potential targets of drugs by applying a recommendation system on drug communities. We meanwhile build disease-specific signaling networks based on patients’ genomic profiles and interactome data. We then identify drug combinations by searching drugs whose targets are enriched in the complementary signaling modules of the disease signaling network. The novel method was evaluated on lung adenocarcinoma and endocrine receptor (ER) positive breast cancer, and compared with other drug combination approaches. These case studies discovered a set of effective drug combinations top ranked in our prediction list, and mapped the drug targets on the disease signaling network to highlight the mechanisms of action of the drug combinations.

PP33 Automated detection and tracking of many cells by using 4D live-cell imaging data
Room: 312
Date: Monday, July 14, 11:30 am - 11:55 pm

Terumasa Tokunaga, The Institute of Statistical Mathematics, Japan
Osamu Hirose, Kanazawa University, Japan
Shotaro Kawaguchi, Kanazawa University, Japan
Yu Toyoshima, The University of Tokyo, Japan
Takayuki Teramoto, Kyushu University, Japan
Hisaki Ikebata, Graduate University of Advanced Studies, Japan
Sayuri Kuge, Kyushu University, Japan
Takeshi Ishihara, Kyushu University, Japan
Yuichi Iino, The University of Tokyo, Japan
Ryo Yoshida, The Institute of Statistical Mathematics, Japan

Session Chair: Robert F. Murphy
Abstract Show

Automated fluorescence microscopes produce massive amounts of images
observing cells often in four dimensions of space and time. This study
addresses two tasks of time-lapse imaging analyses; detection and
tracking of many imaged cells, especially intended for 4D live-cell
imaging of neuronal nuclei of C. elegans. Cells of interest appear
in little more generic forms than ellipsoids. They distribute densely
and move rapidly in a series of 3D images. In such cases, existing
tracking methods often fail due to that, for instance, many trackers
transit from one to the other of different objects during rapid moves.

The present method starts from converting each 3D image to a smooth
continuous function by performing the kernel density estimation. Cell
bodies in an image are assumed to lie in regions around multiple local
maxima of the density function. Then, the tasks of detecting and
tracking cells are addressed with two hill-climbing algorithms that we
derive. By applying the cell detection method to an image at the first
frame, the positions of trackers are initialized. The tracking algorithm
keeps attracting them to around local maxima changing over time in a
subsequent image sequence. To prevent the trackers from turnovers and
coalescences, we employ Markov random fields (MRFs) to model spatial and
temporal covariation of cells, and maximize the image forces and the MRF-
induced constraints on transitions of the trackers. The tracking
procedure is demonstrated with dynamic 3D images containing more than
one hundred neurons of C. elegans.

PP41 Pareto-Optimal Phylogenetic Tree Reconciliation
Room: 312
Date: Monday, July 14, 2:10 pm - 2:35 pm

Ran Libeskind-Hadas, Harvey Mudd College, United States
Yi-Chieh Wu, Massachusetts Institute of Technology, United States
Mukul S. Bansal, University of Connecticut, United States
Manolis Kellis, Massachusetts Institute of Technology, United States

Session Chair: Russell Schwartz
Abstract Show

Motivation: Phylogenetic tree reconciliation is a widely-used method for reconstructing the evolutionary histories of gene families and species, hosts and parasites, and other dependent pairs of entities. Reconciliation is typically performed using maximum parsimony, in which each evolutionary event type is assigned a cost and the objective is to find a reconciliation of minimum total cost. It is generally understood that reconciliations are sensitive to event costs, but little is understood about the relationship between events costs and solutions. Moreover, choosing appropriate event costs is a notoriously difficult problem.

Results: We address this problem by giving an efficient algorithm for computing Pareto-optimal sets of reconciliations, thus providing the first systematic method for understanding the relationship between event costs and reconciliations. This, in turn, results in new techniques for computing event support values and, for cophylogenetic analyses, performing robust statistical tests. We provide new software tools and demonstrate their use on a number of datasets from evolutionary genomic and cophylogenetic studies.

Availability: Our Python tools are freely available at

PP42 Stochastic EM-based TFBS motif discovery with MITSU
Room: 311
Date: Monday, July 14, 2:40 pm - 3:05 pm

Alastair Kilpatrick, The University of Edinburgh, United Kingdom
Bruce Ward, The University of Edinburgh, United Kingdom
Stuart Aitken, The University of Edinburgh, United Kingdom

Session Chair: Reinhard Schneider
Abstract Show

Motivation: The Expectation-Maximisation (EM) algorithm has been successfully applied to the problem of transcription factor binding site (TFBS) motif discovery and underlies the most widely used motif discovery algorithms. In the wider field of probabilistic modelling, the stochastic EM (sEM) algorithm has been used to overcome some of the limitations of the EM algorithm; however, the application of sEM to motif discovery has not been fully explored.

Results: We present MITSU, a novel algorithm for motif discovery which combines sEM with an improved approximation to the likelihood function which is unconstrained with regard to the distribution of motif occurrences within the input dataset. The algorithm is evaluated quantitatively on realistic synthetic data and several collections of characterised prokaryotic TFBS motifs and shown to outperform EM and an alternative sEM-based algorithm, particularly in terms of site-level positive predictive value.

PP46 Gene network inference by probabilistic scoring of relationships from a factorized model of interactions
Room: 311
Date: Monday, July 14, 3:10 pm - 3:35 pm

Marinka Zitnik, University of Ljubljana, Slovenia
Blaz Zupan, University of Ljubljana, Slovenia

Session Chair: Reinhard Schneider
Abstract Show

Motivation: Epistasis analysis is an essential tool of classical genetics for inferring the order of function of genes in a common pathway. Typically, it considers single and double mutant phenotypes and for a pair of genes observes if a change in the first gene masks the effects of the mutation in the second gene. Despite the recent emergence of biotechnology techniques that can provide gene interaction data on a large, possibly genomic scale, very few methods are available for quantitative epistasis analysis and epistasis-based network reconstruction.

Results: We here propose a conceptually new probabilistic approach to gene network inference from quantitative interaction data. The approach is founded on epistasis analysis. Its features are joint treatment of the mutant phenotype data with a factorized model and probabilistic scoring of pairwise gene relationships that are inferred from the latent gene representation. The resulting gene network is assembled from scored pairwise relationships. In an experimental study, we show that the proposed approach can accurately reconstruct several known pathways and that it surpasses the accuracy of current approaches.

PP48 Privacy Preserving Protocol for Detecting Genetic Relatives Using Rare Variants
Room: 302
Date: Monday, July 14, 3:10 pm - 3:35 pm

Farhad Hormozdiari, University of California, Los Angeles, United States
Jong Wha Joo, University of California, Los Angeles, United States
Feng Guan, University of California, Los Angeles, United States
Akshay Wadia, University of California, Los Angeles, United States
Rafail Ostrosky, University of California, Los Angeles, United States
Amit Sahai, University of California, Los Angeles, United States
Eleazar Eskin, University of California, Los Angeles, United States

Session Chair: Dietlind Gerloff
Abstract Show

Motivation: High-throughput sequencing technologies have impacted many
areas of genetic research. One such area is the identification of
relatives from genetic data. The standard approach for the
identification of genetic relatives collects the genomic data of
all individuals and stores it in a database. Then, each pair of individuals are compared to detect the set of genetic relatives, and the matched individuals are informed. The main drawback of this approach is the requirement of sharing your genetic data with a trusted third party to perform the relatedness test.

Results: In this work, we propose a secure protocol to detect the genetic relatives from sequencing data while not exposing any information about their genomes. We assume that individuals have access to their genome sequences but do not want to share their genomes with anyone else. Unlike previous approaches, our approach uses both common and rare variants which provides the ability to detect much more distant relationships securely. We use a simulated data generated from the 1000 genomes data and illustrate that we can easily detect up to fifth degree cousins
which was not possible using the existing methods. We also show in the
1000 genomes data with cryptic relationships that our method can detect these individuals.

PP50 Functional Association Networks as Priors for Gene Regulatory Network Inference
Room: 311
Date: Monday, July 14, 3:40 pm - 4:05 pm

Matthew Studham, SciLifeLab, Sweden
Andreas Tjärnberg, SciLifeLab, Sweden
Torbjörn Nordling, SciLifeLab, Sweden
Sven Nelander, Uppsala University, Sweden
Erik Sonnhammer, SciLifeLab, Sweden

Session Chair: Reinhard Schneider
Abstract Show

Gene regulatory network (GRN) inference reveals the influences genes have on one another in cellular regulatory systems. If the experimental data is inadequate to fully explain the network, informative priors have been shown to improve the accuracy of inferences. This study explores the potential of undirected, confidence-weighted networks, such as those in functional association databases, as a prior source for GRN inference. Such networks often erroneously indicate symmetric interaction between genes and may contain mostly correlation-based interaction information. Despite these drawbacks, our testing on synthetic data sets indicates that if the prior networks have enough causal information then they can improve GRN inference accuracy, and if not then accuracy may decrease. This opens the door to the possibility that functional association databases can be used as priors to make GRN inference more reliable.

PP57 Graph Regularized Dual Lasso for Robust eQTL Mapping
Room: 312
Date: Tuesday, July 15, 10:30 am - 10:55 am

Wei Cheng, UNC at Chapel Hill, United States
Xiang Zhang, Case Western Reserve University, United States
Zhishan Guo, UNC at Chapel Hill, United States
Yu Shi, University of Science and Technology of China
Wei Wang, University of California, Los Angeles, United States

Session Chair: Toni Kazic
Abstract Show

As a promising tool for dissecting the genetic basis of complex traits, expression quantitative trait loci (eQTL) mapping has attracted increasing research interest. An important issue in eQTL mapping is how to effectively integrate networks representing interactions among genetic markers and genes. Recently, several Lasso-based methods have been proposed to leverage such network information. Despite their success, existing methods have three common limitations: 1) a preprocessing step is usually needed to cluster the networks; 2) the incompleteness of the networks and the noise in them are not considered; 3) other available information, such as location of genetic markers and pathway information, are not integrated.

To address the limitations of the existing methods, we propose Graph-regularized Dual Lasso (GDL), a robust approach for eQTL mapping. GDL integrates the correlation structures among genetic markers and traits simultaneously. It also takes into account the incompleteness of the networks and is robust to the noise. GDL utilizes graph-based regularizers to model the prior networks and does not require an explicit clustering step. Moreover, it enables further refinement of the partial and noisy networks. We further generalize GDL to incorporate the location of genetic makers and gene pathway information. We perform extensive experimental evaluations using both simulated and real datasets. Experimental results demonstrate that the proposed methods can effectively integrate various available priori knowledge and significantly outperform the state-of-the-art eQTL mapping methods.

PP61 EPIQ - Efficient detection of SNP-SNP epistatic interactions for quantitative traits
Room: 312
Date: Tuesday, July 15, 11:00 am - 11:25 am

Yaara Arkin, Tel-Aviv University, Israel
Elior Rahmani, Tel Aviv University, Israel
Marcus E. Kleber, University of Heidelberg, Germany
Reijo Laaksonen, University of Tampere, Finland
Winfried Maerz, University of Heidelberg, Germany
Eran Halperin, Tel-Aviv University, Israel

Session Chair: Toni Kazic
Abstract Show

Motivation: Gene-gene interactions are of potential biological and medical interest, as they can shed light on both the inheritance mechanism of a trait and on the underlying biological mechanisms. Evidence of epistatic interactions has been reported in both humans and other organisms. Unlike single-locus genome wide association studies (GWAS), which proved efficient in detecting numerous genetic loci related with various traits, interaction-based GWAS have so far produced very few reproducible discoveries. Such studies introduce a great computational and statistical burden by necessitating a large number of hypotheses to be tested including all pairs of SNPs. Thus, many software tools have been developed for interaction-based case- control studies, some leading to reliable discoveries. For quantitative data, on the other hand, only a handful of tools exist, and the computational burden is still substantial.

Results: We present an efficient algorithm for detecting epistasis in quantitative GWAS, achieving a substantial runtime speedup by avoiding the need to exhaustively test all SNP pairs using metric embedding and random projections. Unlike previous metric embedding methods for case-control studies, we introduce a new embedding, where each SNP is mapped to two Euclidean spaces. We implemented our method in a tool named EPIQ (EPIstasis detection for Quantitative GWAS), and we show by simulations that EPIQ requires hours of processing time where other methods require days and sometimes weeks. Applying our method to a dataset from the Ludwigshafen Risk and Cardiovascular Health study discovered a pair of SNPs with a near-significant interaction (p=2.2×10-13), in only 1.5 hours on 10 processors. Availability:

PP62 Accurate viral population assembly from ultra-deep sequencing data
Room: 311
Date: Tuesday, July 15, 11:30 am - 11:55 pm

Serghei Mangul, University of California, Los Angeles, United States
Nicholas Wu, University of California, Los Angeles, United States
Nicholas Mancuso, Georgia State University, United States
Alex Zelikovsky, Georgia State University, United States
Ren Sun, University of California, Los Angeles, United States
Eleazar Eskin, University of California, Los Angeles, United States

Session Chair: Toni Kazic
Abstract Show

Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. In this paper we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population which allows for the detection of previously undiscovered rare variants.

The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allows VGA to assemble rare variants. VGA utilizes an expectation-maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method which scales to millions of sequencing reads. The open source C++/Python implementation of VGA is freely available for download at Contact:

PP65 A combinatorial approach for analyzing intra-tumor heterogeneity from high-throughput sequencing data
Room: 312
Date: Tuesday, July 15, 11:30 am - 11:55 pm

Iman Hajirasouliha, Brown University, United States
Ahmad Mahmoody, Brown University, United States
Ben Raphael, Brown University, United States

Session Chair: Toni Kazic
Abstract Show

High-throughput sequencing of tumor samples has shown that most tumors exhibit extensive intra-tumor heterogeneity, with multiple subpopulations of tumor cells containing different somatic mutations. Recent studies have quantified this intra- tumor heterogeneity by clustering mutations into subpopulations according to the observed counts of DNA sequencing reads containing the variant allele. However, these clustering approaches do not consider that the population frequencies of different tumor subpopulations are correlated by their shared ancestry in the same population of cells.

In this paper, we introduce the binary tree partition, a novel combinatorial formulation of the problem of constructing the subpopulations of tumor cells from the variant allele frequencies of somatic mutations. We show that finding a binary tree partition is an NP-complete problem; derive an approximation algorithm for an optimization version of the problem; and present a recursive algorithm to find a binary tree partition with errors in the input. We show that the resulting algorithm outperforms existing clustering approaches on simulated and real sequencing data.

PP67 A statistical approach for inferring the 3D structure of the genome
Room: 304
Date: Tuesday, July 15, 12:00 pm - 12:25 pm

Nelle Varoquaux, Mines ParisTech, France
Ferhat Ay, University of Washington, United States
William Noble, University of Washington, United States
Jean-Philippe Vert, Mines ParisTech, France

Session Chair: Robert F. Murphy
Abstract Show

Motivation: Recent technological advances allow the measurement,
in a single Hi-C experiment, of the frequencies of physical contacts
among pairs of genomic loci at a genome-wide scale. The next
challenge is to infer, from the resulting DNA-DNA contact maps,
accurate three dimensional models of how chromosomes fold and
fit into the nucleus. Many existing inference methods rely upon
multidimensional scaling (MDS), in which the pairwise distances of
the inferred model are optimized to resemble pairwise distances
derived directly from the contact counts. These approaches, however,
often optimize a heuristic objective function and require strong
assumptions about the biophysics of DNA to transform interaction
frequencies to spatial distance, and thereby may lead to incorrect
structure reconstruction.

Methods: We propose a novel approach to infer a consensus three-
dimensional structure of a genome from Hi-C data. The method
incorporates a statistical model of the contact counts, assuming
that the counts between two loci follow a Poisson distribution whose
intensity decreases with the physical distances between the loci. The
method can automatically adjust the transfer function relating the
spatial distance to the Poisson intensity and infer a genome structure
that best explains the observed data.

Results: We compare two variants of our Poisson method, with or
without optimization of the transfer function, to four different MDS-
based algorithms—two metric MDS methods using different stress
functions, a nonmetric version of MDS, and ChromSDE, a recently
described, advanced MDS method—on a wide range of simulated
datasets. We demonstrate that the Poisson models reconstruct better
structures than all MDS-based methods, particularly at low coverage
and high resolution, and we highlight the importance of optimizing
the transfer function. On publicly available Hi-C data from mouse
embryonic stem cells, we show that the Poisson methods lead to
more reproducible structures than MDS-based methods when we
use data generated using different restriction enzymes, and when we
reconstruct structures at different resolutions.

Availability: A Python implementation of the proposed method is
available at

PP68 New Directions for Diffusion-Based Network Prediction of Protein Function: Incorporating Pathways with Confidence
Room: 302
Date: Tuesday, July 15, 12:00 pm - 12:25 pm

Mengfei Cao, Tufts University, United States
Christopher Pietras, Tufts University, United States
Xian Feng, Tufts University, United States
Kathryn Doroschak, University of Minnesota, United States
Thomas Schaffner, Tufts University, United States
Jisoo Park, Tufts University, United States
Hao Zhang, Tufts University, United States
Lenore Cowen, Tufts University, United States
Benjamin Hescott, Tufts University, United States

Session Chair: Predrag Radivojac
Abstract Show

Motivation: It has long been hypothesized that incorporating models of network
noise as well as edge directions and known pathway information
into the representation of protein-protein interaction networks might
improve their utility for functional inference. However, a simple way
to do this has not been obvious. We find that DSD,
our recent diffusion-based metric for measuring dissimilarity in protein-protein
interaction (PPI) networks, has natural extensions that incorporate
confidence, directions, and can even express coherent pathways by
calculating DSD on an augmented graph.

Results: We define three incremental versions of DSD which we term cDSD, caDSD,
and capDSD, where the capDSD matrix incorporates confidence, known
directed edges, and pathways into the measure of how similar each
pair of nodes is according to the structure of the PPI network. We
test four popular function prediction methods (majority
vote, weighted majority vote, multiway cut, and functional flow) using these different
matrices on the Baker's yeast PPI network in cross-validation. The
best performing method is weighted majority vote using capDSD.
We then test the performance of our augmented DSD
methods on an integrated heterogeneous set of protein association
edges from the STRING database. The superior performance of
capDSD in this context confirms that treating the pathways as
probabilistic units is more powerful than simply incorporating pathway
edges independently into the network.

Availability: All source code for calculating the confidences,
for extracting pathway information from KEGG XML files, and for
calculating the cDSD, caDSD and capDSD matrices is available from

PP71 Inferring Gene Ontologies from Pairwise Similarity Data
Room: 304
Date: Tuesday, July 15, 2:00 pm - 2:25 pm

Michael Kramer, University of California, San Diego, United States
Janusz Dutkowski, University of California, San Diego, United States
Michael Yu, University of California, San Diego, United States
Vineet Bafna, University of California, San Diego, United States
Trey Ideker, University of California, San Diego, United States

Session Chair: Lenore Cowen
Abstract Show

Motivation: While the manually curated Gene Ontology (GO) is widely used, inferring a GO directly from -omics data is a compelling new problem. Recognizing that ontologies are a directed acyclic graph (DAG) of terms and hierarchical relations, algorithms are needed that:

(1) analyze a full matrix of gene–gene pairwise similarities from -omics data;
(2) infer true hierarchical structure in these data rather than enforcing hierarchy as a computational artifact; and
(3) respect biological pleiotropy, by which a term in the hierarchy can relate to multiple higher level terms.

Methods addressing these requirements are just beginning to emerge—none has been evaluated for GO inference.

Methods: We consider two algorithms [Clique Extracted Ontology (CliXO), LocalFitness] that uniquely satisfy these requirements, compared with methods including standard clustering. CliXO is a new approach that finds maximal cliques in a network induced by progressive thresholding of a similarity matrix. We evaluate each method’s ability to reconstruct the GO biological process ontology from a similarity matrix based on (a) semantic similarities for GO itself or (b) three -omics datasets for yeast.

Results: For task (a) using semantic similarity, CliXO accurately reconstructs GO (>99% precision, recall) and outperforms other approaches (<20% precision, <20% recall). For task (b) using -omics data, CliXO outperforms other methods using two -omics datasets and achieves ~30% precision and recall using YeastNet v3, similar to an earlier approach (Network Extracted Ontology) and better than LocalFitness or standard clustering (20–25% precision, recall).

Conclusion: This study provides algorithmic foundation for building gene ontologies by capturing hierarchical and pleiotropic structure embedded in biomolecular data.

PP72 Evaluating Synteny for Improved Comparative Studies
Room: 302
Date: Tuesday, July 15, 2:00 pm - 2:25 pm

Cristina G. Ghiurcuta, EPFL, Switzerland
Bernard M.E. Moret, EPFL, Switzerland

Session Chair: Alex Bateman
Abstract Show

Motivation: Comparative genomics aims to understand the structure and
function of genomes by translating knowledge gained about some genomes
to the object of study. Early approaches used pairwise comparisons, but
today researchers are attempting to leverage the larger potential of multiway
comparisons. Comparative genomics relies on the structuring of genomes into
syntenic blocks: blocks of sequence that exhibit conserved features across the
genomes. Syntenic blocks are required for complex computations to scale to
the billions of nucleotides present in many genomes; they enable comparisons
across broad ranges of genomes because they filter out much of the individual
variability; they highlight candidate regions for in-depth studies; and they
facilitate whole-genome comparisons through visualization tools. However, the
concept of syntenic block remains loosely defined. Tools for the identification
of syntenic blocks yield quite different results, thereby preventing a systematic
assessment of the next steps in an analysis. Current tools do not include
measurable quality objectives and thus cannot be benchmarked against
themselves. Comparisons among tools have also been neglected—what few
results are given use superficial measures unrelated to quality or consistency.
Results: We present a theoretical model as well as an experimental basis with
quality measures for comparing syntenic blocks and thus also for improving
or designing tools for the identification of syntenic blocks. We illustrate the
application of the model and the measures by applying them to syntenic blocks
produced by 3 different contemporary tools (DRIMM-Synteny, i-ADHoRe and
Cyntenator) on a dataset of 8 yeast genomes. Our findings highlight the need
for a well founded, systematic approach to the decomposition of genomes into
syntenic blocks. Our experiments demonstrate widely divergent results among
these tools, throwing into question the robustness of the basic approach in
comparative genomics. We have taken the first step towards a formal approach
to the construction of syntenic blocks by developing a simple quality criterion
based on sound evolutionary principles.

PP74 Scale-space measures for graph topology link protein network architecture to function
Room: 311
Date: Tuesday, July 15, 2:30 pm - 2:55 pm

Marc Hulsman, Delft University of Technology, Netherlands
Christos Dimitrakopoulos, ETH Zurich, Switzerland
Jeroen de Ridder, Delft University of Technology, Netherlands

Session Chair: Cenk Sahinalp
Abstract Show

Motivation: The network architecture of physical protein interactions is an important determinant for the molecular functions that are carried out within each cell. To study this relation, the network architecture can be characterized by graph-topological characteristics such as shortest paths and network hubs. These characteristics have an important shortcoming: they do not take into account that interactions occur across different scales. This is important since some cellular functions may involve a single direct protein interaction (small scale) while others require more and/or indirect interactions, such as protein complexes (medium scale) and interactions between large modules of proteins (large scale).

Results: In this work, we derive generalized, scale-aware versions of known graph-topological measures based on diffusion kernels. We apply these to characterize the topology of networks across all scales simultaneously, generating a so called graph-topological scale-space. The comprehensive physical interaction network in yeast is used to show that scale-space based measures consistently give superior performance when distinguishing protein functional categories and three major types of functional interactions: genetic interaction, co-expression and perturbation interactions. Moreover, we demonstrate that graph-topological scale-spaces capture biologically meaningful features that provides new insights into the link between function and protein network architecture.

Availability: Matlab code to calculate the STMs is available from: Contact:

PP75 Using association rule mining to determine promising secondary phenotyping hypotheses
CancelledRoom: 304
Date: Tuesday, July 15, 2:30 pm - 2:55 pm

Anika Oellrich, Wellcome Trust Sanger Institute, United States
Julius Jacobsen, Wellcome Trust Sanger Institute, United Kingdom
Irene Papatheodorou, Wellcome Trust Sanger Institute, United Kingdom
The Sanger Mouse Genetics Project, Wellcome Trust Sanger Institute, United Kingdom
Damian Smedley, Wellcome Trust Sanger Institute, United Kingdom

Session Chair: Lenore Cowen
Abstract Show

Motivation: Large-scale phenotyping projects such as the Sanger Mouse Genetics project are ongoing efforts to help to identify the influences of genes and their modification on phenotypes. Gene–phenotype relations are crucial to the improvement of our understanding of human heritable diseases as well as the development of drugs. However, given that there are about 20,000 genes in higher vertebrate genomes and the experimental verification of gene–phenotype relations requires a lot of resources, methods are needed that determine good candidates for testing.

Results: In this study, we applied an association rule mining approach to the identification of promising secondary phenotype candidates. The predictions rely on a large gene–phenotype annotation set that is used to find occurrence patterns of phenotypes. Applying an association rule mining approach, we could identify 1,967 secondary phenotype hypotheses that cover 243 genes and 136 phenotypes. Using two automated and one manual evaluation strategies, we demonstrate that the secondary phenotype candidates possess biological relevance to the genes they are predicted for. From the results we conclude that the predicted secondary phenotypes constitute good candidates to be experimentally tested and confirmed.
Availability: The secondary phenotype candidates can be browsed through at gene/secondaryphenotype/list.

PP76 Robust Clinical Outcome Prediction based on Bayesian Analysis of Transcriptional Profiles and Prior Causal Networks
Room: 302
Date: Tuesday, July 15, 2:30 pm - 2:55 pm

Kourosh Zarringhalam, UMass Boston/Pfizer, United States
Ahmed Enayetallah, Biogen Idec, United States
Padmalatha Reddy, Pfizer, United States
Daniel Ziemek, Pfizer, Germany

Session Chair: Alex Bateman
Abstract Show

Motivation: Understanding and predicting an individual’s response in a clinical trial is key to better treatments and cost effective medicine. Over the coming years, more and more large-scale omics datasets will become available to characterize patients with complex and heterogeneous diseases at a molecular level. Unfortunately, genetic, phenotypical, and environmental variation is much higher in a human trial population than currently modeled or measured in most animal studies. In our experience, this high variability can lead to failure of trained predictors in independent studies and undermines the credibility and utility of promising high-dimensional datasets.

Methods: We propose a method that utilizes patient-level genome- wide expression data in conjunction with causal networks based on prior knowledge. Our approach infers a differential expression profile for each patient and uses a Bayesian approach to infer corresponding upstream regulators. These regulators and their corresponding posterior probabilities of activity are used in a regularized regression framework to predict response.

Results: We validated our approach using two clinically relevant phenotypes, namely acute rejection in kidney transplantation and response to Infliximab in ulcerative colitis. To demonstrate pitfalls in translating trained predictors across independent trials, we analyze performance characteristics of our and alternative approaches on two independent datasets for each phenotype and show that the proposed approach is able to successfully incorporate causal prior knowledge to give robust performance estimates.

PP77 Detecting independent and recurrent copy number aberrations using interval graphs
Room: 312
Date: Tuesday, July 15, 2:30 pm - 2:55 pm

Hsin-Ta Wu, Brown University, United States
Iman Hajirasouliha, Brown University, United States
Benjamin Raphael, Brown University, United States

Session Chair: Paul Horton
Abstract Show

Somatic copy number aberrations are frequent in cancer genomes, but many of these are random, passenger events. A common strategy to distinguish functional aberrations from passengers is to identify those aberrations that are recurrent across multiple samples. However, the extensive variability in the length and position of copy number aberrations makes the problem of identifying recurrent aberrations notoriously difficult. We introduce a combinatorial approach to the problem of identifying independent and recurrent copy number aberrations, focusing on the key challenging of separating the overlaps in aberrations across individuals into independent events.

We derive independent and recurrent copy number aberrations as maximal cliques in an interval graph constructed from overlaps between aberrations. We efficiently enumerate all such cliques, and derive a dynamic programming algorithm to find an optimal selection of non-overlapping cliques, resulting in a very fast algorithm, which we call RAIG (Recurrent Aberrations from Interval Graphs). We show that RAIG outperforms other methods on simulated data and performs well on data from three cancer types from The Cancer Genome Atlas (TCGA). In contrast to existing approaches that employ various heuristics to select independent aberrations, RAIG optimizes a well-defined objective function. We show that this allows RAIG to identify rare aberrations that are likely functional, but are obscured by overlaps with larger passenger aberrations.

PP80 MIRA: Mutual Information-Based Reporter Algorithm for Metabolic Networks
Room: 302
Date: Tuesday, July 15, 3:00 pm - 3:25 pm

A. Ercument Cicek, Carnegie Mellon University, United States
Kathryn Roeder, Carnegie Mellon University, United States
Gultekin Ozsoyoglu, Case Western Reserve University, United States

Session Chair: Alex Bateman
Abstract Show

Motivation: Discovering the transcriptional regulatory architecture of the metabolism has been an important topic to understand the implications of transcriptional fluctuations on metabolism. The reporter algorithm (RA) was proposed to determine the hot spots in metabolic networks, around which transcriptional regulation is focused due to a disease or a genetic perturbation. Using a z-score based scoring scheme, RA calculates the average statistical change in the expression levels of genes that are neighbors to a target metabolite in the metabolic network. The RA approach has been used in numerous studies to analyze cellular responses to the downstream genetic changes. In this paper, we propose a mutual information-based multivariate reporter algorithm (MIRA) with the goal of eliminating the following problems in detecting reporter metabolites: (1) conventional statistical methods suffer from small sample sizes, (2) as z-score ranges from minus to plus infinity, calculating average scores can lead to canceling out opposite effects, and (3) analyzing genes one by one, then aggregating results can lead to information loss. MIRA is a multivariate and combinatorial algorithm that calculates the aggregate transcriptional response around a metabolite using mutual information. We show that MIRA’s results are biologically sound, empirically significant and more reliable than RA.

Results: We apply MIRA to gene expression analysis of six knock-out strains of E. coli, and show that MIRA captures the underlying metabolic dynamics of the switch from aerobic to anaerobic respiration. We also apply MIRA, to an Autism Spectrum Disorder gene expression dataset. Results indicate that MIRA’s reports metabolites that highly overlap with recently found metabolic biomarkers in the autism literature. Overall, MIRA is a promising algorithm for detecting metabolic drug targets and understanding the relation between gene expression and metabolic activity.

PP84 Metabolome-scale prediction of intermediate compounds in multi-step metabolic pathways with a recursive supervised approach
Room: 302
Date: Tuesday, July 15, 3:30 pm - 3:55 pm

Masaaki Kotera, Kyoto University, Japan
Yasuo Tabei, Japan Science and Technology Agency, Japan
Yoshihiro Yamanishi, Kyushu University, Japan
Ai Muto, Kyoto University, Japan
Yuki Moriya, Kyoto University, Japan
Toshiaki Tokimatsu, Kyoto University, Japan
Susumu Goto, Kyoto University, Japan

Session Chair: Alex Bateman
Abstract Show

Motivation: Metabolic pathway analysis is crucial not only in syste- matic metabolic engineering but also in rational drug design. Howe- ver, the biosynthetic/biodegradation pathways are known only for a small portion of metabolites, and a vast amount of pathways remain uncharacterized. Therefore, an important challenge in metabolomics is the de novo reconstruction of potential reaction networks on a metabolome-scale.

Results: In this paper we develop a novel method to predict the multi-step reaction sequences for de novo reconstruction of metabolic pathways in the reaction-filling framework. We propose a supervised approach to learn what we refer to as ”multi-step reaction sequence likeness”, i.e., whether or not a compound-compound pair is possibly converted to each other by a sequence of enzymatic reactions. In the algorithm we propose a recursive procedure of using step-specific classifiers to predict the intermediate compounds in the multi-step reaction sequences, based on chemical substructure fingerprints of compounds. In the results, we demonstrate the usefulness of our pro- posed method on the prediction of enzymatic reaction networks from a metabolome-scale compound set, and discuss characteristic featu- res of the extracted chemical substructure transformation patterns in multi-step reaction sequences. Our comprehensively predicted reac- tion networks help to fill the metabolic gap and to infer new reaction sequences in metabolic pathways.