Accepted Papers


PT01                                                                               Sunday, July 11: 10:45 a.m. - 11:10 a.m.
Low-homology protein threading
Room: 302
Subject: Protein Stucture and Function

Author(s):
Jian Peng Toyota Technological Institute of Chicago, Chicago, IL
Jinbo Xu Toyota Technological Institute of Chicago, Chicago, IL

Presenting Author: Jian Peng

Motivation: The challenge of template-based modeling lies in the recognition of correct templates and generation of accurate sequence-template alignments. Homologous information has proved to be very powerful in detecting remote homologs, as demonstrated by the state-of-the-art profile-based method HHpred. However, HHpred does not fare well when proteins under consideration are low-homology. A protein is low-homology if we cannot obtain sufficient amount of homologous information for the protein from existing protein sequence databases. Results: We present a profile-entropy dependent scoring function for low-homology protein threading. This method will model correlation among various protein features and determine their relative importance according to the amount of homologous information available. When proteins under consideration are lowhomology, our method will rely more on structure information; otherwise, homologous information. Experimental results indicate that our threading method greatly outperforms the best profile-based method HHpred and all the top CASP8 servers on lowhomology proteins. Tested on the CASP8 hard targets, our threading method is also better than all the top CASP8 servers but slightly worse than Zhang-Server. This is significant considering that Zhang-Server and other top CASP8 servers use a combination of multiple structure prediction techniques including consensus method, multiple-template modeling, template-free modeling and model refinement while our method is a classical single-template-based threading method without any post-threading refinement.


top
PT02                                                                               Sunday, July 11: 10:45 a.m. - 11:10 a.m.
TEAM: Efficient Two-Locus Epistasis Tests in Human Genome-Wide Association Study
Room: 304
Subject: Population Genomics

Author(s):
Xiang Zhang Univ of NC, Chapel Hill, NC
Shunping Huang Univ of NC, Chapel Hill, NC
Fei Zou Univ of NC, Chapel Hill, NC
Wei Wang Univ of NC, Chapel Hill, NC

Presenting Author: Xiang Zhang

As a promising tool for identifying genetic markers underlying phenotypic differences, genome-wide association study (GWAS) has been extensively investigated in recent years. In GWAS, detecting epistasis (or gene-gene interaction) is preferable over single locus study since many diseases are known to be complex traits. A brute force search is infeasible for epistasis detection in the genome-wide scale because of the intensive computational burden. Existing epistasis detection algorithms are designed for dataset consisting of homozygous markers and small sample size. In human study, however, the genotype may be heterozygous, and number of individuals can be up to thousands. Thus existing methods are not readily applicable to human datasets. In this paper, we propose an efficient algorithm, TEAM, that significantly speeds up epistasis detection for human GWAS. Our algorithm is exhaustive, i.e., it does not ignore any epistatic interaction. Utilizing the minimum spanning tree structure, the algorithm incrementally updates the contingency tables for epistatic tests without scanning all individuals. Our algorithm has broader applicability and is more efficient than existing methods for large sample study. It supports any statistical test that is based on contingency tables, and enables both family-wise error rate (FWER) and false discovery rate (FDR) controlling. Extensive experiments show that our algorithm only needs to examine a small portion of the individuals to update the contingency tables, and it achieves at least an order of magnitude speedup over the brute force approach.


top
PT03                                                                               Sunday, July 11: 11:15 a.m. - 11:40 a.m.
Fragment-free Approach to Protein Folding Using Conditional Neural Fields
Room: 302
Subject: Protein Stucture and Function

Author(s):
Feng Zhao Toyota Technological Institute of Chicago, Chicago, IL
Jian Peng Toyota Technological Institute of Chicago, Chicago, IL
Jinbo Xu Toyota Technological Institute of Chicago, Chicago, IL

Presenting Author: Feng Zhao

Motivation. One of the major bottlenecks with ab initio protein folding is an effective conformation sampling algorithm that can generate native-like conformations quickly. The popular fragment assembly method generates conformations by restricting the local conformations of a protein to short structural fragments in the PDB. This method may limit generated conformations to a subspace to which the native fold does not belong because 1) a protein with really new fold may contain some structural fragments not in the PDB; and 2) the discrete nature of fragments may prevent them from building a native-like fold. Previously we have developed a Conditional Random Fields (CRF) method for fragment-free protein folding that can sample conformations in a continuous space and demonstrated that this CRF method compares favorably to the popular fragment assembly method. However, the CRF method is still limited by its capability of generating conformations compatible with a sequence. Results. We present a new fragment-free approach to protein folding using a recently-invented probabilistic graphical model Conditional Neural Fields (CNF). This new CNF method is much more powerful than CRF in modeling the sophisticated protein sequencestructure relationship and thus, enables us to generate native-like conformations more easily. We show that when coupled with a simple energy function and Replica Exchange Monte Carlo simulation, our CNF method can generate decoys much better than CRF on a variety of test proteins including the CASP8 free-modeling targets.


top
PT04                                                                               Sunday, July 11: 11:15 a.m. - 11:40 a.m.
Robust Unmixing of Tumor States in Array Comparative Genomic Hybridization Data
Room: 304
Subject: Disease Models and Epidemiology

Author(s):
David Allan Tolliver Carnegie Mellon Univ, Pittsburgh, PA
Charalampos Tsourakakis Carnegie Mellon Univ, Pittsburgh, PA
Ayshwarya Subramanian Carnegie Mellon Univ, Pittsburgh, PA
Stanley Shackney Drexel Univ, Philsdelphia, PA
Russell Schwartz Carnegie Mellon Univ, Pittsburgh, PA

Presenting Author: David Allan Tolliver

Motivation: Tumorigenesis is an evolutionary process by which tumor cells acquire sequences of mutations leading to increased growth, invasiveness, and eventually metastasis. It is hoped that by identifying the common patterns of mutations underlying major cancer sub-types, we can better understand the molecular basis of tumor development and identify new diagnostics and therapeutic targets. This goal has motivated several attempts to apply evolutionary tree reconstruction methods to assays of tumor state. Inference of tumor evolution is in principle aided by the fact that tumors are heterogeneous, retaining remnant populations of different stages along their development along with contaminating healthy cell populations. In practice, though, this heterogeneity complicates interpretation of tumor data because distinct cell types are conflated by common methods for assaying tumor state. We previously proposed a method to computationally infer cell populations from measures of tumor-wide gene expression through a geometric interpretation of mixture type separation, but this approach deals poorly with noisy and outlier data. Results: In the present work, we propose a new method to perform tumor mixture separation efficiently and robustly to experimental error. The method builds on the prior geometric approach but uses a novel objective function allowing for robust fits that greatly reducing the sensitivity to noise and outliers. We further develop an efficient gradient optimization method to optimize this Òsoft geometric unmixingÓ objective for measurements of tumor DNA copy numbers assessed by array comparative genomic hybridization (aCGH) data. We show, on a combination of semi-synthetic and real data, that the method yields fast and, accurate separation of tumor states. Conclusions: We have shown a novel objective function and optimization method for robust separation of tumor sub-types from aCGH data and have shown that the method provides fast, accurate reconstruction of tumor states from mixed samples. Better solutions to this problem can be expected to improve our ability to accurately identify genetic abnormalities in primary tumor samples and to infer patterns of tumor evolution.


top
PT05                                                                               Sunday, July 11: 11:45 a.m. - 12:10 p.m.
MOTIF-EM: an Automated Computational Tool for Identifying Conserved Regions in CryoEM Structures
Room: 302
Subject: Protein Stucture and Function

Author(s):
Mitul Saha Univ of TX, Galveston, TX
Michael Levitt Stanford Univ, Stanford, CA
Wah Chiu Baylor College of Medicine, Houston, TX

Presenting Author: Mitul Saha

We present a new, first-of-its-kind, fully-automated computational tool MOTIF-EM for identifying regions or domains or motifs in cryoEM maps of large macromolecular assemblies (such as chaper-onins, viruses, etc.) that remain conformationally conserved. As a by-product, regions in structures that are not conserved are re-vealed: this can indicate local molecular flexibility related to biologi-cal activity. MOTIF-EM takes cryoEM volumetric maps as inputs. The technique used by MOTIF-EM to detect conserved sub-structures is inspired by a recent breakthrough in 2D object recogni-tion. The technique works by constructing rotationally-invariant, low-dimensional representations of local regions in the input cryoEM maps. Correspondences are established between the reduced rep-resentations (by comparing them using a simple metric) across the input maps. The correspondences are clustered using hash tables and graph theory is used to retrieve conserved structural domains or motifs. MOTIF-EM has been used to extract conserved domains occurring in large macromolecular assembly maps, including as those of viruses P22 and epsilon 15, Ribosome 70S, GroEL, that remain structurally conserved in different functional states. Our method can also been used to build atomic models for some maps. We also used MOTIF-EM to identify the conserved folds shared among dsDNA bacteriophages HK97, Epsilon 15, and 29, though they have low sequence similarity.


top
PT06                                                                               Sunday, July 11: 11:45 a.m. - 12:10 p.m.
Sparse Multitask Regression for Identifying Common Mechanism of Response to Therapeutic Targets
Room: 304
Subject: Disease Models and Epidemiology

Author(s):
Bahram Parvin Lawrence Berkeley National Laboratory, Berkeley, CA
Kai Zhang Lawrence Berkeley National Laboratory, Berkeley, CA
Joe gray Lawrence Berkeley National Laboratory, Berkeley, CA

Presenting Author: Kai Zhang

Motivation: Molecular association of phenotypic responses is an important step in hypothesis generation and for initiating design of new experiments. Current practices for associating gene expression data with multidimensional phenotypic data are typically (i) performed one-to-one, i.e., each gene is examined independently with a phenotypic index, and (ii) tested with one stress condition at a time, i.e., different perturbations are analyzed separately. As a result, the complex coordination among the genes responsible for a phenotypic profile is potentially lost. More importantly, univariate analysis can potentially hide new insights into common mechanism of response. Results: In this paper, we propose a sparse, multi-task regression model together with co-clustering analysis to explore the intrinsic grouping in associating the gene expression with phenotypic signatures. The global structure of association is captured by learning an intrinsic template that is shared among experimental conditions, with local perturbations introduced to integrate effects of therapeutic agents. We demonstrate the performance of our approach on both synthetic and experimental data. Synthetic data reveal that the multitask regression has a superior reduction in the regression error when compared with traditional L1 and L2 regularized regression. On the other hand, experiments with cell cycle inhibitors over a panel of 14 breast cancer cell lines demonstrate the relevance of the computed molecular predictors with the cell cycle machinery, as well as the identification of hidden variables that are not captured by the baseline regression analysis. For example, the system has identified a hidden gene, CLCA2, as a common echanism of response for two therapeutic agents that are clinically used for cell cycle arrest in cancer treatment.


top
PT07                                                                               Sunday, July 11: 12:15 p.m. - 12:40 p.m.
Thermodynamics of RNA structures by Wang-Landau sampling
Room: 302
Subject: Protein Stucture and Function

Author(s):
Peter G Clote Boston College, Chestnut Hill, MA
Feng Lou University of Paris XI, Paris, France

Presenting Author: Peter G Clote

Motivation: Thermodynamics-based dynamic programming RNA secondary structure algorithms have been of immense importance in molecular biology, where applications range from the detection of novel selenoproteins using EST data, to the determination of microRNA genes and their targets. Dynamic programming algorithms have been developed to compute the minimum free energy secondary structure and partition function of a given RNA sequence, the minimum free energy and partition function for the hybridization of two RNA molecules, etc. However, the applicability of dynamic programming methods depends on disallowing certain types of interactions (pseudoknots, zig-zags, etc.), as their inclusion renders structure prediction an NP-complete problem. Nevertheless, such interactions have been observed in X-ray structures. Results: A non-Boltzmannian Monte Carlo algorithm was designed by Wang and Landau to estimate the density of states for complex systems, such as the Ising model, that exhibit a phase transition. In this paper, we apply the Wang-Landau (WL) method to compute the density of states for secondary structures of a given RNA sequence, and for hybridizations of two RNA sequences. Our method is shown to be much faster than existent software, such as \rnasubopt. From density of states, we compute the partition function over all secondary structures and over all pseudoknot-free hybridizations. The advantage of the WL method is that by adding a function to evaluate the free energy of arbitary pseudoknotted structures and of arbitrary hybridizations, we can estimate thermodynamic parameters for situations known to be NP-complete.


top
PT08                                                                               Sunday, July 11: 12:15 p.m. - 12:40 p.m.
Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM
Room: 304
Subject: Protein Interactions and Molecular Networks

Author(s):
Joshua Stuart UCSC, Santa Cruz, CA
Charles J. Vaske Princeton Univ, Princeton, NJ
Stephen C. Benz UCSC, Santa Cruz, CA
J. Zachary Sanborn UCSC, Santa Cruz, CA
Dent Earl UCSC, Santa Cruz, CA
Christopher Szeto UCSC, Santa Cruz, CA
Jingchun Zhu UCSC, Santa Cruz, CA
David Haussler UCSC, Santa Cruz, CA & HHMI

Presenting Author: Charles J. Vaske

Motivation: High-throughput data is providing a comprehensive view of the molecular changes in cancer tissues. New technologies allow for the simultaneous genome-wide assay of the state of genome copy number variation, gene expression, DNA methylation, and epigenetics of tumor samples and cancer cell lines. Analyses of current data sets Þnd that genetic alterations between patients can differ but often involve common pathways. It is therefore critical to identify relevant pathways involved in cancer progression and detect how they are altered in different patients. Results: We present a novel method for inferring patient-speciÞc genetic activities incor porating curated pathway interactions among genes. A gene is modeled by a factor graph as a set of interconnected variables encoding the expression and known activity of a gene and its products, allowing the incor poration of many types of -omic data as evidence. The method predicts the degree to which a pathwayÕs activities (e.g. internal gene states, interactions, or high-level ÒoutputsÓ) are altered in the patient using probabilistic inference. Compared to a competing pathway activity inference approach called SPIA, our method identiÞes altered activities in cancer-related pathways with fewer false-positives in both a gliobastoma multiform (GBM) and a breast cancer dataset. PARADIGM identiÞed consistent pathway-level activities for subsets of the GBM patients that are overlooked when genes are considered in isolation. Fur ther, grouping GBM patients based on their signiÞcant pathway per turbations divides them into clinically-relevant subgroups having signiÞcantly different survival outcomes. These Þndings suggest that therapeutics might be chosen that target genes at critical points in the commonly per turbed pathway(s) of a group of patients. Availability: Source code available at http://sbenz.github.com/Paradigm


top
PT09                                                                               Sunday, July 11: 2:30 p.m. - 2:55 p.m.
Recognition of Beta-Structural Motifs Using Hidden Markov Models Trained with Simulated Evolution
Room: 302
Subject: Protein Stucture and Function

Author(s):
Lenore Cowen Tufts Univ, Medford, MA
Anoop Kumar Tufts Univ, Medford, MA

Presenting Author: Anoop Kumar

Motivation: One of the most successful methods to date for recognizing protein sequences that are evolutionarily related, has been profile Hidden Markov Models. However, these models do not capture pairwise statistical preferences of residues that are hydrogen bonded in beta sheets. We thus explore methods for incorporating pairwise dependencies into these models. Results: We consider the remote homology detection problem for beta-structural motifs. In particular, we ask if a statistical model trained on members of only one family in a SCOP beta-structural superfamily, can recognize members of other families in that superfamily. We show that HMMs trained with our pairwise model of simulated evolution achieve nearly a median 5% improvement in AUC for beta-structural motif recognition as compared to ordinary HMMs. Availability: All datasets and HMMs are available at: http://bcb.cs.tufts.edu/pairwise


top
PT10                                                                               Sunday, July 11: 2:30 p.m. - 2:55 p.m.
Efficient identification of identical-by-descent status in pedigrees with many untyped individuals
Room: 304
Subject: Population Genomics

Author(s):
Xin Li CaseWestern Reserve Univ, Cleveland, OH
Xiaolin Yin CaseWestern Reserve Univ, Cleveland, OH
Jing Li CaseWestern Reserve Univ, Cleveland, OH

Presenting Author: Xin Li

Motivation: Inference of identical-by-descent (IBD) probabilities is the key in family-based linkage analysis. Using high-density single nucleotide polymorphism (SNP) markers, one can almost always infer haplotype configurations of each member in a family given all individuals being typed. Consequently, the IBD status can be obtained directly from haplotype configurations. However, in reality, many family members are not typed due to practical reasons. The problem of IBD/haplotype inference is much harder when treating untyped individuals as missing. Results: We present a novel hidden Markov model (HMM) approach to infer the IBD status in a pedigree with many untyped members using high density SNP markers. We introduce the concept of inheritance generating function, defined for any pair of alleles in a descent graph based on a pedigree structure. We derive a recursive formula for efficient calculation of the inheritance generating function. By aggregating all possible inheritance patterns via an explicit representation of the number and lengths of all possible paths between two alleles, the inheritance generating function provides a convenient way to theoretically derive the transition probabilities of the HMM. We further extend the basic HMM to incorporate population linkage disequilibrium (LD). Pedigree-wise IBD sharing can be constructed based on pair-wise IBD relationships. Compared to traditional approaches for linkage analysis, our new model can efficiently infer IBD status without enumerating all possible genotypes and transmission patterns of untyped members in a family. Our approach can be reliably applied on large pedigrees with many untyped members, and the inferred IBD status can be used for non-parametric genome-wide linkage analysis. Availability: The algorithm is implemented in Matlab and is freely available upon request.


top
PT11                                                                               Sunday, July 11: 3:00 p.m. - 3:25 p.m.
Markov Dynamic Models for Long-Timescale Protein Motion
Room: 302
Subject: Protein Stucture and Function

Author(s):
Tsung-Han Chiang National University of Singapore, Singapore
David Hsu National University of Singapore, Singapore

Presenting Author: Tsung-Han Chiang

Molecular dynamics simulation is a well-established method for studying protein motion at the atomic scale. However, it is computationally intensive and generates massive amounts of data, the sheer size of which often becomes an obstacle to biological insights. One way of addressing the dual challenges of computation efficiency and data analysis is to construct simplified models of protein motion at long timescales, as many important kinetic and dynamic properties of proteins ultimately depend on such motions. In this direction, we propose to use Markov models with hidden states, in which the Markovian states represent potentially overlapping probabilistic distributions over a proteinÕs conformation space. We also propose to evaluate the quality of a model by its ability to predict long-timescale protein motions. Our method was tested on 2-D synthetic energy landscapes and two extensively studied proteins, alanine dipeptide and villin. One interesting finding is that although a widely accepted model of alanine dipeptide contains 6 states, a simpler model with only 3 states is equally good for predicting the long-timescale motions. This finding highlights the importance of a principled criterion for evaluating model quality. We also used the constructed Markov models to estimate important kinetic and dynamic quantities for protein folding, in particular, mean first-passage time. The results are consistent with available experimental measurements.


top
PT12                                                                               Sunday, July 11: 3:00 p.m. - 3:25 p.m.
Efficient Genome Ancestry Inference in Complex Pedigrees with Inbreeding
Room: 304
Subject: Population Genomics

Author(s):
Eric Yi Liu Univ of NC, Chapel Hill, NC
Qi Zhang Univ of WA, Seattle, WA
Leonard McMillan Univ of NC, Chapel Hill, NC
Fernando Pardo Manuel de Villena Univ of NC, Chapel Hill, NC
Wei Wang Univ of NC, Chapel Hill, NC

Presenting Author: Eric Yi Liu

High density SNP data of model animal resources provides opportunities for fine resolution genetic variation studies. These genetic resources are generated through a variety of breeding schemes that involve multiple generations of matings derived from a set of founder animals. In this paper we investigate the problem of inferring the most probable ancestry of resulting genotypes, given a set of founder genotypes. Due to computational difficulty, existing methods either handle only small pedigree data or disregard the pedigree structure. In this paper, we present an accurate and efficient method that can accept large and complex pedigrees in inferring genome ancestry. Our method builds a Hidden Markov model leveraging the observation that large pedigrees often contain repetitive sub-structures of which ancestry probabilities can be derived accurately without explicit modeling of every generation. As an exemplar of our method, we show how recombination events over many generations of inbreeding may be modeled accurately with a pair of quaternary indicators. The ancestry inference is accurate and fast, independent of the number of generations, for model animal resources such as the Collaborative Cross (CC). Experiments on both simulated and real CC data demonstrate that our method offers comparable accuracy to those methods that build a explicit model of the entire pedigree, but much better scalability with respect to the pedigree size.


top
PT13                                                                               Sunday, July 11: 3:30 p.m. - 3:55 p.m.
A fast mathematical programming procedure for simultaneous fitting of assembly components into cryo-EM density maps
Room: 302
Subject: Protein Stucture and Function

Author(s):
Frank Alber USC, Los Angeles, CA
Shihua Zhang USC, Los Angeles, CA
Daven Vasishtan Univ of London, London, UK
Min Xu USC, Los Angeles, CA
Maya Topf Univ of London, London, UK

Presenting Author: Shihua Zhang

Motivation: Single-particle cryo-electron microscopy (cryoEM) produces density maps of macromolecular assemblies at intermediate to low resolution (~4-30 A). By fitting high-resolution structures of assembly components into these maps, pseudo-atomic models can be obtained. Optimizing the quality-of-fit of all components simultaneously is challenging due to the large search space, which makes the exhaustive search over all possible component configurations computationally unfeasible. Results: We developed an efficient mathematical programming algorithm that simultaneously fits all component structures into an assembly density map. The fitting is formulated as a point set matching problem involving several point sets that represent component and assembly densities at a reduced complexity level. In contrast to other point matching algorithms, our algorithm is able to match multiple point sets simultaneously and not only based on their geometrical equivalence, but also based on the similarity of the density in the immediate point neighborhood. In addition, we present an efficient refinement method based on the Iterated Closest Point (ICP) registration algorithm. The method generates an assembly configuration in a few seconds. This efficiency allows the generation of an ensemble of candidate solutions that can be assessed by an independent scoring function. We benchmarked the method using simulated density maps of 11 protein assemblies at 20 A, and an experimental cryoEM map at 23.5 A resolution. Our method was able to generate assembly structures with RMS errors <6.5 A, which have been further reduced to <1.8 A by the local refinement procedure.


top
PT14                                                                               Sunday, July 11: 3:30 p.m. - 3:55 p.m.
Multi-Population GWA Mapping via Multi-Task Regularized Regression
Room: 304
Subject: Population Genomics

Author(s):
Kriti Puniyani Carnegie Mellon Univ, Pittsburgh, PA
Seyoung Kim Carnegie Mellon Univ, Pittsburgh, PA
Eric P. Xing Carnegie Mellon Univ, Pittsburgh, PA

Presenting Author: Kriti Puniyani

Population heterogeneity through admixing of different founder populations can produce spurious associations in genome wide association studies, that are linked to the population structure rather than the phenotype. Since samples from the same population generally co-evolve, different populations may or may not share the same genetic underpinnings for the seemingly common phenotype. Our goal is to develop a unified framework for detecting causal genetic markers through a joint association analysis of multiple populations. Based on a multi-task regression principle, we present a multi-population group lasso algorithm using L_1/L_2-regularized regression for joint association analysis of multiple populations, that are stratified either via population survey or computational estimation. Our algorithm combines information from genetic markers across populations, to identify causal markers. It also implicitly accounts for correlations between the genetic markers, thus enabling better control over false positive rates. Joint analysis across populations enables the detection of weak associations common to all populations with greater power than in a separate analysis of each population. At the same time, the regression based framework allows causal alleles that are unique to a subset of the populations to be correctly identified. We demonstrate the effectiveness of our method on HapMap-simulated and lactase persistence datasets, where we significantly outperform state of the art methods, with greater power for detecting weak associations and reduced spurious associations.


top
PT15                                                                               Sunday, July 11: 4:00 p.m. - 4:25 p.m.
Open MS/MS Spectral Library Search to Identify Unanticipated Post-Translational Modifications and Increase Spectral Identification Rate
Room: 302
Subject: Other

Author(s):
Ding Ye Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Yan Fu Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Rui-Xiang Sun Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Hai-Peng Wang Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Zuo-Fei Yuan Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Hao Chi Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Si-Min He Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Presenting Author: Ding Ye

Identification of post-translationally modified proteins has become one of the central issues of current proteomics. Spectral library search is a new and promising computational approach to mass spectrometry-based protein identification. However, its potential in identification of unanticipated post-translational modifications has rarely been explored. The existing spectral library search tools are designed to match the query spectrum to the reference library spectra with the same peptide mass. Thus, spectra of peptides with unanticipated modifications cannot be identified. In this paper, we present an open spectral library search tool, named pMatch. It extends the existing library search algorithms in at least three aspects to support the identification of unanticipated modifications. First, the spectra in library are optimized with the full peptide sequence information to better tolerate the peptide fragmentation pattern variations caused by some modification(s). Second, a new scoring system is devised, which uses charge-dependent mass shifts for peak matching and combines a probability-based model with the general spectral dot-product for scoring. Third, a target-decoy strategy is used for false discovery rate control. To demonstrate the effectiveness of pMatch, a library search experiment was conducted on a public dataset with over 40,000 spectra in comparison with SpectraST, the most popular library search engine. Additional validations were done on four published datasets including over 150,000 spectra. The results showed that pMatch can effectively identify unanticipated modifications and significantly increase spectral identification rate.


top
PT16                                                                               Sunday, July 11: 4:00 p.m. - 4:25 p.m.
Estimating genomewide IBD-sharing from SNP data via an efficient hidden Markov model of LD with application to gene mapping
Room: 304
Subject: Population Genomics

Author(s):
Sivan Bercovici Technion, Haifa, Israel
Dan Geiger Technion, Haifa, Israel
Christopher Meek Microsoft Research, Redmond, Wa
Ydo Wexler Microsoft Research, Redmond, Wa

Presenting Author: Sivan Bercovici

We develop a factorial HMM based algorithm for computing genomewide IBD-sharing. The algorithm accepts as input SNP data of measured individuals and estimates the probability of IBD at each locus for every pair of individuals. For two g-degree relatives, when g>2, the computation yields a precision of IBD tagging at least 10% higher than previous methods for the same level of recall (>95%). Our algorithm uses a first order Markovian model for the LD process and employs a reduction of the state space of the inheritance vector from being exponential in g to quadratic. The higher accuracy along with the reduced time complexity marks our method as a feasible means for IBD mapping in practical scenarios. A software implementation, called IBDmap, is freely available.


top
PT17                                                                               Monday, July 12: 10:45 a.m. - 11:10 a.m.
TRStalker: an Efficient Heuristic for Finding Fuzzy Tandem Repeats
Room: 302
Subject: Sequence Analysis

Author(s):
Marco Pellegrini Institute for Informatics and Telematics of C.N.R., Pisa, Italy
M. Elena Renda Institute for Informatics and Telematics of C.N.R., Pisa, Italy
Alessio Vecchio Unversity of Pisa, Pisa, Italy

Presenting Author: Marco Pellegrini

Genomes in higher eucaryotic organisms contain a substantial amount of (almost) repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for tandem repeats with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events. We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently Tandem Repeats that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and eletions. To attain this goal we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem akin to the "generalized median string" that is know to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better that current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences.


top
PT18                                                                               Monday, July 12: 10:45 a.m. - 11:10 a.m.
Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation
Room: 304
Subject: Gene Regulation and Transcriptomics

Author(s):
Rendong Yang China Agricultural Univ, Beijing, China
Zhen Su China Agricultural Univ, Beijing, China

Presenting Author: Rendong Yang

Motivation: Circadian rhythms are prevalent in most organisms. Identification of circadian regulated genes is a crucial step in discovering underlying pathways and processes that are clock controlled. Such genes are largely detected by searching periodic patterns in microarray data. However, temporal gene expression profiles are usually short time-series with low sampling frequency and highly contaminated noise. This makes circadian rhythmic analysis of temporal microarray data a very challenging problem. Results: We propose a algorithm named ARSER which combines time domain and frequency domain analysis for extracting and characterizing rhythmic expression profiles from temporal microarray data. ARSER employs autoregressive spectral estimation to predict an expression profile's periodicity from frequency spectrum and then models the rhythmic patterns by using harmonic regression model to fit the time series. ARSER describes the rhythmic patterns by four parameters: period, phase, amplitude and mean level and measures the multiple testing significance by false discovery rate q-value. Tested on well-defined periodic and non-periodic short time-series data, We found ARSER is superior than two existing widely-used methods: COSOPT and Fisher's G-test during identification of sinusoidal and non-sinusoidal periodic patterns in short, noisy and non-stationary time series. Finally, we applied ARSER to analyze Arabidopsis microarray data and identified a novel set of previously undetected non-sinusoidal periodic transcripts which may lead to new insights about the circadian rhythm regulated molecular mechanisms. Availability: ARSER is implemented by Python and R. All source codes are available from http://bioinformatics.cau.edu.cn/ARSER


top
PT19                                                                               Monday, July 12: 11:15 a.m. - 11:40 a.m.
Model-Based Detection of Alternative Splicing Signals
Room: 302
Subject: Sequence Analysis

Author(s):
Yoseph Barash Univ of Toronto, Toronto, Canada
Benjamin J. Blencowe Univ of Toronto, Toronto, Canada
Brendan J. Frey Univ of Toronto, Toronto, Canada

Presenting Author: Yoseph Barash

Transcripts from approximately 95% of human multi-exon genes are subject to alternative splicing (AS). The growing interest in AS is propelled by its prominent contribution to transcriptome and proteome complexity and because of the role of aberrant AS in numerous diseases. Recent technological advances enable thousands of exons to be simultaneously profiled across diverse cell types and cellular conditions, but require accurate identification of condition-specific splicing changes. It is necessary to accurately identify such splicing changes to elucidate the underlying regulatory programs or link the splicing changes to specific diseases. We present a probabilistic model tailored for high-throughput AS data, where observed isoform levels are explained as combinations of condition-specific AS signals. According to our formulation, given an AS dataset our tasks are to detect common signals in the data and identify the exons relevant to each signal. Our model can incorporate prior knowledge about underlying AS signals, measurement quality and gene expression level effects. Using a large-scale multi-tissue AS dataset, we demonstrate the advantage of our method over standard alternative approaches. In addition, we describe newly-found tissue-specific AS signals which were verified experimentally, and discuss associated regulatory features.


top
PT20                                                                               Monday, July 12: 11:15 a.m. - 11:40 a.m.
Multivariate multi-way analysis of multi-source data
Room: 304
Subject: Other

Author(s):
Ilkka Huopaniemi Aalto Univ, Aalto, Finland
Tommi Suvitaival Aalto Univ, Aalto, Finland
Janne Nikkilä Aalto Univ & Univ of Helsink, Aalto & Helsinki, Finlandi
Matej Oresic VTT Technical Research Center, Finland
Samuel Kaski Aalto Univ, Aalto, Finland

Presenting Author: Ilkka Huopaniemi

Motivation: Multi-way ANOVA-type methods are the default tool for the analysis of data with multiple covariates. These tools have been generalized to the multivariate analysis of high-throughput biological datasets, where the main challenge is the problem of small sample-size and high dimensionality. However, the existing multi-way analysis methods are not designed for experiments where data is obtained from multiple sources. Common examples of such settings include integrated analysis of metabolic and gene expression profiles, or metabolic profiles from several tissues in our case, in a controlled multi-way experimental setup where disease status, medical treatment, gender and time-series are usual covariates. Results: We extend the applicability area of multivariate, multi-way ANOVA-type methods to multi-source cases by introducing a novel Bayesian model. The method is capable of finding covariate-related dependencies between the sources. It assumes the measurements consist of groups of similarly behaving variables, and estimates the multivariate covariate effects and their interaction effects for the discovered groups of variables. In particular, the method partitions the effects to those shared between the sources and to source-specific ones. The method is specifically designed for datasets with small sample-sizes and high dimensionality. We apply the method to a lipidomics data set from a lung cancer study with two-way experimental setup, where measurements from several tissues with mostly distinct lipids have been taken. The method is also directly applicable to gene expression and proteomics.


top
PT21                                                                               Monday, July 12: 11:45 a.m. - 12:10 p.m.
A dynamic Bayesian network for identifying protein binding footprints from single molecule based sequencing data
Room: 302
Subject: Sequence Analysis

Author(s):
Xiaoyu Chen Univ of WA, Seattle, WA
Michael Hoffman Univ of WA, Seattle, WA
Jeff Bilmes Univ of WA, Seattle, WA
Jay Hesselberth Univ of WA, Seattle, WA
William Noble Univ of WA, Seattle, WA

Presenting Author: Xiaoyu Chen

Motivation: A global map of transcription factor binding sites is critical to understanding gene regulation and genome function. DNaseI digestion of chromatin coupled with massively parallel sequencing (digital genomic footprinting) enables the identification of protein binding footprints with high resolution on a genome-wide scale. However, accurately inferring the locations of these footprints remains a challenging computational problem. Results: We present a dynamic Bayesian network-based approach for the identification and assignment of statistical confidence estimates to protein binding footprints from digital genomic footprinting data. The method, DBFP, allows footprints to be identified in a probabilistic framework and outperforms our previously described algorithm in terms of precision at a fixed recall. Applied to a digital footprinting data set from S. cerevisiae, DBFP identifies 4,679 statistically significant footprints within intergenic regions. These footprints are mainly located near transcription start sites and are strongly enriched for known transcription factor binding sites. Footprints containing no known motif are preferentially located proximal to other footprints, consistent with cooperative binding of these footprints. DBFP also identifies a set of statistically significant footprints in the yeast coding regions. Many of these footprints coincide with the boundaries of antisense transcripts, and the most significant footprints are enriched for binding sites of the chromatinassociated factors Abf1 and Rap1.


top
PT22                                                                               Monday, July 12: 11:45 a.m. - 12:10 p.m.
Discovering Transcriptional Modules by Bayesian Data Integration
Room: 304
Subject: Gene Regulation and Transcriptomics

Author(s):
David Wild Univ of Warwick, Coventry, UK
Richard Savage Univ of Warwick, Coventry, UK
Zoubin Ghahramani Univ of Cambridge, Cambridge, UK
Jim Griffen Univ of Kent, Canterbury, Kent, UK
Bernard de la Cruz San Diego, CA

Presenting Author: Richard Savage

Motivation: We present a method for directly inferring transcriptional modules by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet Process (infinite mixture) model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both data sets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both data sets. Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying transcriptional modules.


top
PT23                                                                               Monday, July 12: 12:15 p.m. - 12:40 p.m.
A statistical method for the detection of variants from next-generation resequencing of DNA pools
Room: 302
Subject: Sequence Analysis

Author(s):
Vikas Bansal Scripps Translational Science Institute, La Jolla, CA

Presenting Author: Vikas Bansal

Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing. We describe a novel statistical method, CRISP Comprehensive Read analysis for Identification of SNPs from Pooled sequencing) that is able to identify both rare and common variants by using two approaches: (a) comparing the distribution of allele counts across multiple pools using contingency tables and (b) evaluating the probability of observing multiple non-reference base-calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina GA, demonstrates that it can detect 80-85% of SNPs identified using individual sequencing while achieving a low false positive rate (3-5%). Comparison to previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP. Implementation of this method is available at http://polymorphism.scripps.edu/~vbansal/CRISP/


top
PT24                                                                               Monday, July 12: 12:15 p.m. - 12:40 p.m.
Inferring Combinatorial Association Logic Networks in Multimodal Genome-Wide Screens
Room: 304
Subject: Gene Regulation and Transcriptomics

Author(s):
Jeroen de Ridder Delft Univ of Tech/ Delft, The Netherlands
Alice Gerrits Univ of Groningen, Groningen, The Netherlands
Jan Bot Delft Univ of Technology, Delft, The Netherlands
Gerald de Haan Univ of Groningen, Groningen, The Netherlands
Marcel Reinders Delft Univ of Technology, Delft, The Netherlands
Lodewyk Wessels Netherlands Cancer Institute, Amsterdam, The Netherlands

Presenting Author: Jeroen de Ridder

Motivation: We propose an efficient method to infer combinatorial association logic networks from multiple genome-wide measurements from the same sample. We demonstrate our method on a genetical genomics dataset, in which we search for Boolean combinations of multiple genetic loci that associate with transcript levels. Results: Our method provably finds the global solution and is very efficient with runtimes of up to four orders of magnitude faster than exhaustive search. This enables permutation procedures for determining accurate false positive rates and allows selection of the most parsimonious model. When applied to transcript levels measured in myeloid cells from 24 genotyped recombinant inbred strains, we discovered that 9 gene clusters are putatively modulated by a logical combination of trait loci rather than a single locus. A literature survey supports and further elucidates one of these findings. Due to our approach, optimal solutions for multi-locus logic models and accurate estimates of the associated false discovery rates become feasible. Our algorithm therefore offers a valuable alternative to approaches employing complex, albeit sub-optimal optimization strategies to identify complex models. Availability: The code of the prototype implementation will be made available on the website of the authors. Contact: m.j.t.reinders@tudelft.nl or l.wessels@nki.nl


top
PT25                                                                               Monday, July 12: 2:30 p.m. - 2:55 p.m.
Optimal Algorithms for Haplotype Assembly From Whole-Genome Sequence Data
Room: 302
Subject: Population Genomics

Author(s):
Dan He UCLA, Los Angeles, CA
Arthur Choi UCLA, Los Angeles, CA
Knot Pipatsrisawat UCLA, Los Angeles, CA
Adnan Darwiche UCLA, Los Angeles, CA
Eleazar Eskin UCLA, Los Angeles, CA

Presenting Author: Dan He

Haplotype Inference is an important step for many types of analyses of human variation. Traditional approaches to obtain haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of "haplotype assembly" is the challenging problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of size 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem. In this paper, we proposed a dynamic programming algorithm, which is able to assemble the haplotypes optimally with time complexity O(m * 2^k * n), where m is the number of reads, k is the length of the longest read and $n$ is the total number of SNPs in the haplotypes. We also convert the haplotype problem to MaxSAT problem which is able to be solved optimally for cases where k is large. Taking advantage of the efficiency of our algorithm, we perform simulation experiments demonstrating that the assembly of haplotypes using reads of length typical of current sequencing technologies is not practical. However, we demonstrate that the combination of this approach and the traditional haplotype phasing approaches can lead to construction of haplotypes containing both common and rare variants.


top
PT26                                                                               Monday, July 12: 2:30 p.m. - 2:55 p.m.
Modularity and Directionality in Genetic Interaction Maps
Room: 304
Subject: Protein Interactions and Molecular Networks

Author(s):
Ariel Jaimovich The Hebrew Univ of Jerusalem, Jerusalem, Israel
Ruty Rinott The Hebrew University of Jerusalem
Maya Schuldiner Weizmann Institute, Rehovot, Israel
Hanah Margalit The Hebrew Univ of Jerusalem, Jerusalem, Israel
Nir Friedman The Hebrew Univ of Jerusalem, Jerusalem, Israel

Presenting Author: Ruty Rinott

Motivation: Genetic interactions between genes reflect functional relationships caused by a wide range of molecular mechanisms. Large-scale genetic interaction assays lead to a wealth of information about the functional relations between genes. However, the vast number of observed interactions, along with experimental noise, makes the interpretation of such assays a major challenge. Results: Here, we introduce a computational approach to organize genetic interactions and show that the bulk of observed interactions can be organized in a hierarchy of modules. Revealing this organization enables insights into the function of cellular machineries and highlights global properties of interaction maps. To gain further insight into the nature of these interactions, we integrated data from genetic screens under a wide range of conditions to reveal that more than a third of observed aggravating (i.e., synthetic sick/lethal) interactions are unidirectional, where one gene can buffer the effects of perturbing another gene but not vice versa. Furthermore, most modules of genes that have multiple aggravating interactions were found to be involved in such unidirectional interactions. We demonstrate that the identification of external stimuli that mimic the effect of specific gene knockouts provides insights into the role of individual modules in maintaining cellular integrity. Availability:We designed a freely accessible web tool that includes all our findings, and is specifically intended to allow effective browsing of our results (http://compbio.cs.huji.ac.il/GIAnalysis).


top
PT27                                                                               Monday, July 12: 3:00 p.m. - 3:25 p.m.
Next Generation VariationHunter : Combinatorial Algorithms for Transposon Insertion Discovery
Room: 302
Subject: Sequence Analysis

Author(s):
Fereydoun Hormozdiari Simon Fraser Univ, Burnaby, Canada
Iman Hajirasouliha Simon Fraser Univ, Burnaby, Canada
Phuong Dao Simon Fraser Univ, Burnaby, Canada
Faraz Hach Simon Fraser Univ, Burnaby, Canada
Deniz Yurokoglu Simon Fraser Univ, Burnaby, Canada
Can Alkan Univ of WA, Seattle, WA
Evan E. Eichler Univ of WA, Seattle, WA
S. Cenk Sahinalp Simon Fraser Univ, Burnaby, Canada

Presenting Author: Fereydoun Hormozdiari

Recent years have witnessed an increased research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and diseases. In this paper for the first time we provide a complete and novel formulation to discover the types and loci of transposons inserted into genomes sequenced by high-throughput sequencing technologies. In addition, we also present ``conflict resolution'' improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from Venter genome (HuRef), and are able to discover > 85% of transposon insertion events with precision of > 90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of art (such as original VariationHunter, BreakDancer and MoDil algorithms) when tested on the genome of the Yoruba African individual (NA18507).


top
PT28                                                                               Monday, July 12: 3:00 p.m. - 3:25 p.m.
Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework
Room: 304
Subject: Protein Interactions and Molecular Networks

Author(s):
Yoshihiro Yamanishi Mines Paris Tech, Fontainebleau Cedex, France
Masaaki Kotera Kyoto Univ, Kyoto, Japan
Minoru Kanehisa Kyoto Univ, Kyoto, Japan
Susumu Goto Kyoto Univ, Kyoto, Japan

Presenting Author: Yoshihiro Yamanishi

In silico prediction of drug-target interactions from heterogeneous biological data is critical in the search for drugs and therapeutic targets for known diseases such as cancers. There is therefore a strong incentive to develop new methods capable of detecting these potential drug-target interactions efficiently. In this paper, we investigate the relationship between the chemical space, the pharmacological space, and the topology of drug-target interactions networks, and show that drug-target interactions are more correlated with pharmacological effect similarity than with chemical structure similarity. We then develop a new method to predict unknown drug-target interactions from chemical, genomic, and pharmacological data on a large scale. The proposed method consists of two steps: 1) prediction of pharmacological effects from chemical structures of given compounds, and 2) inference of unknown drug-target interactions based on the pharmacological effect similarity in the framework of supervised bipartite graph inference. The originality of the proposed method lies in the prediction of potential pharmacological similarity for any drug candidate compounds and in the integration of chemical, genomic, and pharmacological data in a unified framework. In the results, we make predictions for four classes of important drug-target interactions involving enzymes, ion channels, GPCRs, and nuclear receptors. Our comprehensively predicted drug-target interaction networks enable us to suggest many potential drug-target interactions and to increase research productivity toward genomic drug discovery.


top
PT29                                                                               Monday, July 12: 3:30 p.m. - 3:55 p.m.
Efficient construction of an assembly string graph using the FM-index
Room: 302
Subject: Sequence Analysis

Author(s):
Jared Simpson Wellcome Trust Sanger Institute, Hinxton, UK
Richard Durbin Wellcome Trust Sanger Institute, Hinxton, UK

Presenting Author: Jared Simpson

Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms. Results: Standard overlap assembly methods have time complexity O(N^2), where N is the sum of the lengths of the reads. We use the Ferragina-Manzini index (FM-index) derived from the Burrows-Wheeler transform to find overlaps of length at least tau amongst a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N). Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.


top
PT30                                                                               Monday, July 12: 3:30 p.m. - 3:55 p.m.
Integrating Quantitative Proteomics and Metabolomics with a Genome-scale Metabolic Network Model
Room: 304
Subject: Protein Interactions and Molecular Networks

Author(s):
Keren Yizhak Tel Aviv Univ, TelAviv, Israel
Tomer Benyamini Tel Aviv Univ, TelAviv, Israel
Wolfram Liebermeister Humboldt Univ, Berlin, Germany
Eytan Ruppin Tel Aviv Univ, TelAviv, Israel
Tomer Schlomi Technion, Haifa, Israel

Presenting Author: Keren Yizhak

The availability of modern sequencing techniques has enabled a rapid increase in the amount of reconstructed metabolic networks. Using these models as a platform for the analysis of high throughput ÔomicsÕ data such as transcriptomic, proteomic and metabolomic, can provide valuable insight into conditional changes in the metabolic activity of an organism. While transcriptomics and proteomics provide important insights into the hierarchical regulation of metabolic flux, metabolomics may provide information on metabolic regulation, denoting the effect of metabolite concentrations on actual enzyme activity. Here we introduce a method, Integrative Omics-Metabolic Analysis (IOMA) that quantitatively integrates proteomic and metabolomic data with genome-scale metabolic models, allowing the prediction of metabolic flux distributions. The method is formulated as a quadratic programming (QP) problem that seeks a steady-state flux distribution in which flux through reactions with measured proteomic and metabolomic data is as consistent as possible with kinetically-derived flux estimations. IOMA is shown to successfully predict the metabolic state of human erythrocytes (compared to kinetic model simulations), showing a significant advantage over the commonly used methods FBA and MOMA. Furthermore, IOMA is shown to correctly predict metabolic fluxes in Escherichia coli under different gene knockouts for which both metabolomic and proteomic data is available, achieving higher prediction accuracy over the extant methods. Considering the fact that currently there are no experimental approaches that enable high-throughput flux measurements, while high-throughput metabolomic and proteomic approaches are becoming highly available, we expect IOMA to significantly contribute to future research of cellular metabolism.


top
PT31                                                                               Monday, July 12: 4:00 p.m. - 4:25 p.m.
VARiD: A Variation Detection Framework for Colorspace and Letterspace platforms
Room: 302
Subject: Sequence Analysis

Author(s):
Adrian V. Dalca MIT, Cambridge, MA
Stephen Rumble Stanford Univ, Stanford, CA
Samuel Levy Scripps Institute, La Jolla, CA
Michael Brudno Univ of Toronto, Toronto, Canada

Presenting Author: Adrian V. Dalca

Motivation: High Throughput Sequencing (HTS) technologies are transforming the study of genomic variation. The various HTS technologies have different sequencing biases and error rates, and while most HTS technologies sequence the residues of the genome directly, generating base calls for each position, the Applied BiosystemÕs SOLiD platform generates dibase-coded (color-space) sequences. While combining data from the various platforms should increase the accuracy of variation detection, to date there are only a few tools that can identify variants from color-space data, and none that can analyze color-space and regular (letter-space) data together. Results: We present VARiD - a probabilistic method for variation detection from both letter-space and color-space reads simultaneously. VARiD is based on a Hidden Markov Model (HMM), and uses the Forward-Backward algorithm to accurately identify heterozygous, homozygous, and tri-allelic SNPs, as well as microindels. Our analysis shows that VARiD performs better than the AB SOLiD toolset at detecting variants from color-space data alone, and improves the calls dramatically when letterspace and color-space reads are combined. Availability: The toolset is freely available at http://compbio.cs.utoronto.ca/varid. Contact: varid@cs.toronto.edu


top
PT32                                                                               Monday, July 12: 4:00 p.m. - 4:25 p.m.
PathText: A Text Mining Integrator for Biological Pathway Visualizations
Room: 304
Subject: Text Mining

Author(s):
Brian Kemper Univ of Tokyo, Tokyo, Japan
Takuya Matsuzaki Univ of Tokyo, Tokyo, Japan
Yukiko Matsuoka Systems Biology Institute, Tokyo, Japan
Yoshimasa Tsuruoka Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Hiroaki Kitano Systems Biology Institute, Tokyo, Japan
Sophia Ananiadou National Centre for Text Mining, Manchester, UK
Jun'ichi Tsujii Univ of Tokyo, Tokyo, Japan

Presenting Author: Sophia Ananiadou

Metabolic and signaling pathways are an increasingly important part of organizing knowledge in systems biology. They serve to integrate collective interpretations of facts scattered throughout literature. Biologists construct a pathway by reading a large number of articles and interpreting them as a consistent network, but most of the models constructed currently lack direct links to those articles. Biologists who want to check the original articles have to spend substantial amounts of time to collect relevant articles and identify the sections relevant to the pathway. Furthermore, with the scientific literature expanding by several thousand papers per week, keeping a model relevant requires a continuous curation effort. In this paper, we present a system designed to integrate a pathway visualizer, text mining systems and annotation tools into a seamless environment. This will enable biologists to freely move between parts of a pathway and relevant sections of articles, as well as identify relevant papers from large text bases. The system, PathText, is developed by SBI (Systems Biology Institute), NaCTeM (National Centre for Text Mining) and the University of Tokyo, and is being used by groups of biologists from these locations.


top
PT33                                                                               Tuesday, July 13: 10:45 a.m. - 11:10 a.m.
Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph
Room: 302
Subject: Databases and Ontologies

Author(s):
Adam Richards Medical University of South Carolina, Charleston, SC
Brian Muller Medical University of South Carolina, Charleston, SC
Matthew Shotwell Medical University of South Carolina, Charleston, SC
L. Ashley Cowart Medical University of South Carolina, Charleston, SC
Baerbel Rohrer Medical University of South Carolina, Charleston, SC
Xinghua Lu Medical University of South Carolina, Charleston, SC

Presenting Author: Adam Richards

The results of initial analyses for many high-throughput technologies commonly take the form of gene or protein sets, and one of the ensuing tasks is to evaluate the functional coherence of these sets. The study of gene set function most commonly makes use of controlled vocabulary in the form of ontology annotations. For a given gene set the statistical significance of observing these annotations or 'enrichment' may be tested using a number of methods. Instead of testing for significance of individual terms, this study is concerned with the task of assessing the global functional coherence of gene sets, for which novel metrics and statistical methods have been devised. The metrics of this study are based on the topological properties of graphs comprised of genes and their Gene Ontology annotations. A novel aspect of these methods is that both the enrichment of annotations and the relationships among annotations are considered when determining the significance of functional coherence. We applied our methods to perform analyses on an existing database and on microarray experimental results. Here we demonstrated that our approach is highly discriminative in terms of differentiating coherent gene sets from random ones and that it provides biologically sensible evaluations in microarray analysis. We further used examples to show the utility of graph visualization as a tool for studying the functional coherence of gene sets.


top
PT34                                                                               Tuesday, July 13: 10:45 a.m. - 11:10 a.m.
A Spectral Graph Theoretic Approach to Quantification and Calibration of Collective Morphological Differences in Cell Images
Room: 304
Subject: Bioimaging

Author(s):
Yu-Shi Lin Institute of Information Science, Academia Sinica, Taiwan
Chung-Chih Lin National Yang-Ming University, Taipei, Taiwan
Yuh-Show Tsai Chung Yuan Christian University, Jhongli, Taiwan
Tien-Chuan Ku Chung Yuan Christian University, Jhongli, Taiwan
Yi-Hung Huang Institute of Information Science, Academia Sinica, Taiwan
Chun-Nan Hsu USC, Los Angeles, CA

Presenting Author: Chun-Nan Hsu

High-throughput image-based assay technologies can rapidly produce a large number of cell images for drug screening, but data analysis is still a major bottleneck that limits their utility. Quantifying a wide variety of morphological differences observed in cell images under different drug influences is still a challenging task because the result can be highly sensitive to sampling and noise. We propose a graph-based approach to cell image analysis. We define graph transition energy to quantify morphological differences between image sets. A spectral graph theoretic regularization is applied to transform the feature space based on training examples of extremely different images to calibrate the quantification. Calibration is essential for a practical quantification method because we need to measure the confidence of the quantification. We applied our method to quantify the degree of partial fragmentation of mitochondria in fluorescent cell images. We show that with transformation, the quantification can be more accurate and sensitive than that without transformation. We also show that our method outperforms competing methods, including neighborhood component analysis (NCA) and the multi-variate drug profiling method by Loo et al. We illustrate its utility with a study of Annonaceous acetogenins, a family of compounds with drug potential. Our result reveals that squamoncin induces more large fragmented mitochondria than muricin A.


top
PT35                                                                               Tuesday, July 13: 11:15 a.m. - 11:40 a.m.
Semi-automated ontology generation within OBO-Edit
Room: 302
Subject: Databases and Ontologies

Author(s):
Thomas Wächter University of Technology, Dresden, Germany
Michael Schroeder University of Technology, Dresden, Germany

Presenting Author: Thomas Wächter

Motivation: Ontologies and taxonomies have proven highly beneficial for biocuration. The OBO Foundry alone lists over 90 ontologies mainly built with OBO-Edit. Creating and maintaining such ontologies is a labour intensive, difficult, manual process. Automating parts of it is of great importance for the further development of ontologies and for biocuration. Results: We developed the Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG), a system which supports the creation and extension of OBO ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. DOG4DAG is seamlessly integrated with OBO-Edit. It generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. We systematically evaluate each generation step using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Up to 78% of definitions are valid and up to 54% of child-ancestor relations can be retrieved. No other validated system exists that achieves comparable results. By combining the prediction of high quality terms, definitions and parent-child relations with the ontology editor OBO-Edit we contribute a thoroughly validated tool for all OBO ontology engineers. Availability: DOG4DAG is available within OBO-Edit 2.1 at http://www.oboedit.org. (Download version obo-edit2.1beta2) Contact: thomas.waechter@biotec.tu-dresden.de Supplementary information: Online available under http://www.biotec.tu-dresden.de/~waechter/DOG4DAG/


top
PT36                                                                               Tuesday, July 13: 11:15 a.m. - 11:40 a.m.
As-rigid-as-possible mosaicking and serial section registration of large ssTEM datasets
Room: 304
Subject: Bioimaging

Author(s):
Stephan Saalfeld Max Planck Institute of Molecular Cell Bio & Genetics, Dresden, Germany
Pavel Tomancak Max Planck Institute of Molecular Cell Bio & Genetics, Dresden, Germany
Albert Cardona Institute of Neuroinformatics, Zurich, Switzerland
Volker Hartenstein UCLA, Los Angeles, CA

Presenting Author: Stephan Saalfeld

Motivation: Tiled serial section Transmission Electron Microscopy (ssTEM) is increasingly used to describe high-resolution anatomy of large biological specimens. In particular in neurobiology, TEM is indispensable for analysis of synaptic connectivity in the brain. Registration of ssTEM image mosaics has to recover the three-dimensional continuity and geometrical properties of the specimen in presence of various distortions that are applied to the tissue during sectioning, staining and imaging. These include staining artifacts, mechanical deformation, missing sections and the fact that structures may appear dissimilar in consecutive sections. Results: We developed a fully automatic, non-rigid but as-rigid-as-possible registration method for large tiled serial section microscopy stacks. We use the Scale Invariant Feature Transform to identify corresponding landmarks within and across sections and globally optimize the pose of all tiles in terms of least square displacement of these landmark correspondences. We evaluate the precision of the approach using an artificially generated dataset designed to mimic the properties of TEM data. We demonstrate the performance of our method by registering an ssTEM dataset of the first instar larval brain of Drosophila melanogaster consisting of 6,885~images. Availability: This method is implemented as part of the open source software TrakEM2 and distributed through the Fiji project.


top
PT37                                                                               Tuesday, July 13: 11:45 a.m. - 12:10 p.m.
Using Semantic Web Rules To Reason On An Ontology Of Pseudogenes
Room: 302
Subject: Databases and Ontologies

Author(s):
Matt Holford Yale University, New Haven, CT
Ekta Khurana Yale University, New Haven, CT
Kei-hoi Cheung Yale University, New Haven, CT
Mark Gerstein Yale University, New Haven, CT

Presenting Author: Matt Holford

Recent years have seen the development of a wide range of biomedical ontologies. Notable among these is the Sequence Ontology (SO) offering a rich hierarchy of terms and relationships which can be used annotate genomic data. Well-designed formal ontologies allow data to be reasoned upon in a consistent and logically sound way and can lead to the discovery of new relationships. The Semantic Web Rules Language (SWRL) augments the capabilities of a reasoner by allowing the creation of conditional rules. To date, however, formal reasoning, especially the use of SWRL rules, has not been widely exploited in work with biomedical ontologies. We have used the Sequence Ontology to annotate human pseudogenes, extending the existing framework to incorporate additional attributes we wished to describe. We then created a series of logical rules using SWRL to answer research questions and to annotate our pseudogenes appropriately. Finally, we were left with a knowledge base which could be queried to discover information about human pseudogene evolution. The populated Pseudogene Ontology is available for download at: http://bioinformatics.med.yale.edu/owl/pseudo.owl. Additionally, a SPARQL endpoint from which to query the dataset can be accessed at: http://bioinformatics.med.yale.edu/pseudosparql.


top
PT38                                                                               Tuesday, July 13: 11:45 a.m. - 12:10 p.m.
A Soft Kinetic Data Structure for Lesion Border Detection
Room: 304
Subject: Bioimaging

Author(s):
Sinan Kochara Univ of Central AR, Conway, AR
Mutlu Mete Texas A&M Univ, College Station, TX
Kemal Aydin Univ of AR, Pine Bluff, AR
Vincent Yip Univ of Central AR, Conway, AR
Brendan Lee Univ of Central AR, Conway, AR

Presenting Author: Sinan Kochara

Motivation: The medical imaging and image processing techniques, ranging from microscopic to macroscopic, has become one of the main components of diagnostic procedures to assist dermatologists in their medical decision-making processes. Computer-aided segmentation and border detection on dermoscopic images is one of the core components of diagnostic procedures and therapeutic interventions for skin cancer. Automated assessment tools for dermoscopic images have become an important research field mainly because of inter- and intra-observer variations in human interpretations. In this study, a novel approach Ðgraph spanner- for automatic border detection in dermoscopic images is proposed. In this approach, a proximity graph representation of dermoscopic images in order to detect regions and borders in skin lesion is presented. Results: Graph spanner approach is examined on a set of 100 dermoscopic images whose manually drawn borders by a dermatologist are used as the ground truth. Error rates; false positives and false negatives along with true positives and true negatives are quantified by digitally comparing results with manually determined borders from a dermatologist. The results show that the highest precision and recall rates obtained to determine lesion boundaries are 100%. However, accuracy of assessment averages out at 97.72% and borders errorsÕ mean is 2.28% for whole dataset.


top
PT39                                                                               Tuesday, July 13: 12:15 p.m. - 12:40 p.m.
Complex Event Extraction at PubMed Scale
Room: 302
Subject: Text Mining

Author(s):
Jari Bjorne Univ of Turku, Turku, Finland
Filip Ginter Univ of Turku, Turku, Finland
Sampo Pyysalo Univ of Tokyo, Tokyo, Japan
Jun'ichi Tsujii Univ of Tokyo, Tokyo, Japan
Tapio Salakoski Univ of Turku, Turku, Finland

Presenting Author: Jari Bjorne

Motivation: There has recently been a notable shift in biomedical information extraction from relation models toward the more expressive event model, facilitated by the maturation of basic tools for biomedical text analysis and the availability of manually annotated resources. The event model allows detailed representation of complex natural language statements and can support a number of advanced text mining applications ranging from semantic search to pathway extraction. A recent collaborative evaluation demonstrated the potential of event extraction systems, yet there have so far been no studies of the generalization ability of the systems nor the feasibility of large-scale extraction. Results: This study considers event-based information extraction at PubMed scale. We introduce a system combining state-of-theart methods for domain parsing, named entity recognition and event extraction, and test the system on a representative 1% sample of all PubMed citations. We present the first evaluation of the generalization performance of event extraction systems to this scale and show that despite its computational complexity, event extraction from the entire PubMed is feasible. We further illustrate the value of the extraction approach through a number of analyses of the extracted information. Availability: The system and data will be available under an opensource license from http://bionlp.utu.fi/ upon publication. Contact: jari.bjorne@utu.fi


top
PT40                                                                               Tuesday, July 13: 12:15 p.m. - 12:40 p.m.
Reconstruction of the Neuromuscular Junction Connectome
Room: 304
Subject: Bioimaging

Author(s):
Ranga Srinivasan The Methodist Hospital Research Institute, Houston, TX
Stephen Wong The Methodist Hospital Research Institute, Houston, TX
Qing Li Univ of Houston, Houston, TX
Zhou Xiaobo The Methodist Hospital Research Institute, Houston, TX
Ju Lu Stanford Univ, Stanford, CA
Jeff Lichtman Harvard Univ, Cambridge, MA

Presenting Author: Ranga Srinivasan

Unraveling the structure and behavior of the brain and central nervous system (CNS) has always been a major goal of neuroscience. Understanding the wiring diagrams of the neuromuscular junction connectomes (full connectivity of nervous system neuronal components) is a starting point for this, as it helps in the study of the organizational and developmental properties of the mammalian CNS. The phenomenon of synapse elimination during developmental stages of the neuronal circuitry is such an example. Due to the organizational specificity of the axons in the connectomes, it becomes important to label and extract individual axons for morphological analysis. Features such as axonal trajectories, their branching patterns, geometric information, the spatial relations of groups of axons, etc. are of great interests for neurobiologists in the study of wiring diagrams. However, due to the complexity of spatial structure of the axons, automatically tracking and reconstructing them from micro-scopy images in 3D is an unresolved problem. In this paper, AXONTRACKER-3D, a 3D axon tracking and labeling and tool is built to obtain quantitative information by reconstruction of the axonal structures in the entire innervation field. The ease of use along with accuracy of results makes AXONTRACKER-3D an attractive tool to obtain valuable quantitative information from axon datasets.


top
PT41                                                                               Tuesday, July 13: 2:15 p.m. - 2:40 p.m.
Phylogenetic Networks Do not Need to Be Complex: Using Fewer Reticulations to Represent Conflicting Clusters
Room: 302
Subject: Evolution and Comparative Genomics

Author(s):
Steven Kelk Centrum voor Wiskunde en Informatica (CWI), Amsterdam, The Netherlands
Leo van Iersel Univ of Canterbury, Christchurch, New Zealand
Regula Rupp Tubingen Univ, Tubingen, Germany
Huson Daniel Tubingen Univ, Tubingen, Germany

Presenting Author: Steven Kelk

Phylogenetic trees are widely used to display estimates of how groups of species evolved. Each phylogenetic tree can be seen as a collection of clusters, subgroups of the species that evolved from a common ancestor. When phylogenetic trees are obtained for several data sets (e.g. for different genes), then their clusters are often contradicting. Consequently, the set of all clusters of such a data set cannot be combined into a single phylogenetic tree. Phylogenetic networks are a generalization of phylogenetic trees that can be used to display more complex evolutionary histories, including reticulate events such as hybridizations, recombinations and horizontal gene transfers. Here we present the new Cass algorithm that can combine any set of clusters into a phylogenetic network. We show that the networks constructed by Cass are usually simpler than networks constructed by other available methods. Moreover, we show that Cass is guaranteed to produce a network with at most two reticulations per biconnected component, whenever such a network exists. We have implemented Cass and integrated it in the freely available Dendroscope software.


top
PT42                                                                               Tuesday, July 13: 2:15 p.m. - 2:40 p.m.
Automated tracking and analysis of centrosomes in early Caenorhabditis elegans embryos
Room: 304
Subject: Bioimaging

Author(s):
Steffen Jaensch Max Planck Institute of Molecular Cell Bio & Genetics, Dresden, Germany
Markus Decker Max Planck Institute of Molecular Cell Bio & Genetics, Dresden, Germany
Anthony A. Hyman Max Planck Institute of Molecular Cell Bio & Genetics, Dresden, Germany
Eugene W. Myers Howard Hughes Medical Institute, Ashburn, VA

Presenting Author: Steffen Jaensch

Motivation: The centrosome is a dynamic structure in animal cells that serves as a microtubule organizing center during mitosis and also regulates cell-cycle progression and sets polarity cues. Automated and reliable tracking of centrosomes is essential for genetic screens that study the process of centrosome assembly and maturation in the nematode Caenorhabditis elegans. Results: We have developed a fully automatic system for tracking and measuring fluorescently labeled centrosomes in 3D timelapse images of early C. elegans embryos. Using a spinning disc microscope, we monitor the centrosome cycle in living embryos from the 1- up to the 16-cell stage at imaging intervals between 30 and 50 seconds. After establishing the centrosome trajectories with a novel method involving two layers of inference, we also automatically detect the nuclear envelope breakdown in each cell division and recognize the identities of the centrosomes based on the invariant cell lineage of C. elegans. To date we have tracked centrosomes in over 500 wild type and mutant embryos with almost no manual correction required.


top
PT43                                                                               Tuesday, July 13: 2:45 p.m. - 3:10 p.m.
SUPERTRIPLETS: A triplet-based supertree approach to phylogenomics
Room: 302
Subject: Evolution and Comparative Genomics

Author(s):
Emmanuel Douzery Institut des Sciences de l'Evolution de Montpellier, Montpellier Cedex, France
Vincent Ranwez Institut des Sciences de l'Evolution de Montpellier, Montpellier Cedex, France
Alexis Criscuolo Institut Pasteur de Paris, paris, France

Presenting Author: Emmanuel Douzery

Motivation: Phylogenetic tree-building methods use molecular data to infer biodiversity evolutionary patterns. A recurrent problem is to reconcile the various phylogenies built from different genomic sequences into a single one. This task is generally conducted by a two-step approach whereby a binary representation of the initial trees is first inferred and then a Maximum Parsimony (MP) analysis is performed on it. This binary representation uses a decomposition of all source trees that is usually based on clades, but that can also be based on triplets or quartets. The relative performances of these representations have been discussed but are difficult to assess since both are limited to relatively small datasets. Results: This article focuses on the triplet-based representation of source trees. We first recall how, using this representation, the parsimony analysis is related to the median tree notion. We then introduce SUPERTRIPLETS, a new algorithm that is specially designed to optimize this alternative formulation of the MP criterion. The method avoids several practical limitations of the triplet-based binary matrix representation, making it useful to deal with large datasets. When the correct resolution of every triplet appears more often than the incorrect ones in source trees, SUPERTRIPLETS warrants to reconstruct the correct phylogeny. Both simulations and a case study on mammalian phylogenomics confirm the advantages of this approach. In both cases, SUPERTRIPLETS tends to propose less resolved but more reliable supertrees than those inferred using MATRIX REPRESENTATION WITH PARSIMONY. Availability: JAVA source code and executable are available upon request.


top
PT44                                                                               Tuesday, July 13: 2:45 p.m. - 3:10 p.m.
SPEX^2 : Automated Concise Extraction of Spatial Gene Expression Patterns from Fly Embryo ISH Images
Room: 304
Subject: Bioimaging

Author(s):
Kriti Puniyani Carnegie Mellon Univ, Pittsburgh, PA
Christos Faloutsos Carnegie Mellon Univ, Pittsburgh, PA
Eric P. Xing Carnegie Mellon Univ, Pittsburgh, PA

Presenting Author: Kriti Puniyani

Microarray profiling of mRNA abundance is often ill suited for temporal-spatial analysis of gene expressions in multicellular organisms such as Drosophila. Recent progress in image-based genome-scale profiling of whole body mRNA patterns via in situ hybridization (ISH) calls for development of accurate and automatic image analysis systems to facilitate efficient mining of complex temporal-spatial mRNA patterns, which will be essential for functional genomics and network inference in higher organisms. We present SPEX^2, an automatic system for embryonic ISH image processing, which can extract, transform, compare, classify and cluster spatial gene expression patterns in Drosophila embryos. Our pipeline for gene expression pattern extraction outputs the precise spatial locations and strengths of the gene expression. We performed experiments on the largest publicly available collection of Drosophila ISH images, and show that our method achieves excellent performance in automatic image annotation, and also finds clusters that are significantly enriched, both for GO functional annotations, and for annotation terms from a controlled vocabulary used by human curators to describe these images.


top
PT45                                                                               Tuesday, July 13: 3:15 p.m. - 3:40 p.m.
Time and memory efficient likelihood-based tree searches on gappy phylogenomic alignments
Room: 302
Subject: Evolution and Comparative Genomics

Author(s):
Alexandros Stamatakis Technical Univ of Munich, Munich, Germany
Nikolaos Alachiotis Technical Univ of Munich, Munich, Germany

Presenting Author: Nikolaos Alachiotis

The current molecular data explosion poses new challenges for large-scale phylogenomic analyses. A property that characterizes phylogenomic datasets is that they tend to be gappy, i.e., can contain taxa with (many and disparate) missing genes. In current phylogenomic analyses alignment gappyness frequently exceeds 90%. We present and implement a generally applicable mechanism that allows for reducing memory footprints of likelihood-based phylogenomic analyses proportional to the gappyness in the alignment. We also introduce a set of algorithmic rules to efficiently conduct subtree pruning and re-grafting moves using this mechanism. On a large phylogenomic DNA dataset with 2,177 taxa, 68 genes, and a gappyness of 90% we achieve a memory footprint reduction from 9GB down to 1GB, a speedup for optimizing Maximum Likelihood model parameters of 11 and accelerate the SPR tree search phase by factor 16. Thus, our approach can be deployed to improve efficiency for the two most important resources, CPU time and memory, by up to one order of magnitude. Availability: Current open-source version of RAxML v7.2.5 available at \url{http://wwwkramer.in.tum.de/exelixis/software.html}.


top
PT46                                                                               Tuesday, July 13: 3:15 p.m. - 3:40 p.m.
Automatic Reconstruction of 3D Neuron Structures Using a Graph-Augmented Deformable Model
Room: 304
Subject: Bioimaging

Author(s):
Hanchuan Peng Howard Hughes Medical Institute, Ashburn, VA
Zongcai Ruan Howard Hughes Medical Institute, Ashburn, VA
Deniz Atasoy Howard Hughes Medical Institute, Ashburn, VA
Scott Sternson Howard Hughes Medical Institute, Ashburn, VA

Presenting Author: Hanchuan Peng

Motivation: Digital reconstruction of 3D neuron structures is an important step toward reverse engineering the wiring and functions of a brain. However, despite a number of existing studies, this task is still challenging, especially when a 3D microscopic image has low single-to-noise ratio and discontinued segments of neurite patterns. Results: We developed a Graph-augmented Deformable model (GD) to reconstruct (trace) the 3D structure of a neuron when it has a broken structure and/or fuzzy boundary. We formulated a variational problem using the geodesic shortest path, which is defined as a combination of Euclidean distance, exponent of inverse intensity of pixels along the path, and closeness to local centers of image intensity distribution. We solved it in two steps. We first used a shortest path graph algorithm to guarantee that we find the global optimal solution of this step. Then we optimized a discrete deformable curve model to achieve visually more satisfactory reconstructions. Within our framework, it is also easy to define an optional prior curve that reflects the domain knowledge of a user. We investigated the performance of our method using a number of challenging 3D neuronal image datasets of different model organisms including fruit fly, C. elegans, and mouse. In our experiments the GD method outper-formed several comparison methods significantly in reconstruction accuracy, consistency, robustness, and speed. We further used GD in two real applications, namely cataloging neurite morphology of GAL4 patterns of fruit fly to build a 3D ÒstandardÓ digital neurite atlas, and estimating the synaptic bouton density along axons for a mouse brain.


top
PT47                                                                               Tuesday, July 13: 3:45 p.m. - 4:10 p.m.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees
Room: 302
Subject: Evolution and Comparative Genomics

Author(s):
Yufeng Wu Univ of CT, Storrs, CT

Presenting Author: Yufeng Wu

Reticulate network is a model for displaying and quantifying the effects of complex reticulate processes on the evolutionary history of species undergoing reticulate evolution. A central computational problem on reticulate networks is: given a set of phylogenetic trees (each for some region of the genomes), reconstruct the most parsimonious reticulate network (called the minimum reticulate network) that combines the topological information contained in the given trees. This problem is well known to be NP-hard. Thus, existing approaches for this problem either work with only two input trees or make simplifying topological assumptions. We present novel results on the minimum reticulate network problem. Unlike existing approaches, we address the fully general problem: there is no restriction on the number of trees that are input, and there is no restriction on the form of the allowed reticulate network. We present lower and upper bounds on the minimum number of reticulation events in the minimum reticulate network (and infer an pproximately parsimonious reticulate network). A program called PIRN implements these methods, which also outputs a graphical representation of the inferred network. Empirical results on simulated and biological data show that our methods are practical for a wide range of data. More importantly, the lower and upper bounds match for many datasets (especially when the number of trees is small or reticulation level is low), and this allows us to solve the minimum reticulate network problem exactly for these datasets.


top
PT48                                                                               Tuesday, July 13: 3:45 p.m. - 4:10 p.m.
Quantifying the distribution of probes between subcellular locations using unsupervised pattern unmixing
Room: 304
Subject: Bioimaging

Author(s):
Luis Pedro Coelho Carnegie Mellon Univ, Pittsburgh, PA
Tao Peng Carnegie Mellon Univ, Pittsburgh, PA
Robert F. Murphy Carnegie Mellon Univ, Pittsburgh, PA

Presenting Author: Tao Peng

Motivation: Proteins exhibit complex subcellular distributions, which may include localizing in more than one organelle and varying in location depending on the cell physiology. Estimating the amount of protein distributed in each subcellular location is essential quantitative understanding and modeling of protein dynamics and how they affect cell behaviors. We have previously described automated methods using fluorescent microscope images to determine the fractions of protein fluorescence in various subcellular locations when the basic locations in which a protein can be present are known. As this set aof basic locations may be unknown for studies (especially for studies on a proteome-wide scale), we here describe unsupervised methods to identify the fundamental patterns from images of mixed patterns and estimate the fractional composition of them. Methods: We developed two approaches to the problem, both based on identifying types of objects present in images and representing patterns by frequencies of those object types. One is a basis pursuit method (which is based on a linear mixture model), and the other is based on latent Dirichlet allocation (LDA). For testing both approaches, we used images previously acquired for testing supervised unmixing methods. These images were of cells labeled with various combinations of two organelle specific probes that had the same fluorescent properties to simulate mixed patterns of subcellular location. Results: We achieved 0.80 and 0.91 correlation between estimated and underlying fractions of the two probes (fundamental patterns) with basis pursuit and LDA approaches respectively, indicating that our methods can unmix the complex subcellular distribution with reasonably high accuracy.


top