Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

#ISMB2016

Sponsors

Silver:
Bronze:
F1000
Recursion Pharmaceuticals

Copper:
Iowa State University

General and Travel Fellowship Sponsors:
Seven Bridges GBP GigaScience OverLeaf PLOS Computational Biology BioMed Central 3DS Biovia GenenTech HiTSeq IRB-Group Schrodinger TOMA Biosciences


Accepted Posters

Attention Conference Presenters - please review the Speaker Information Page available here.

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category L - 'Protein Structure and Function Prediction and Analysis'

L01 - Kinase Similarity Analysis
  • Qiong Cheng, University of Miami, China
  • Stephan Schurer, University of Miami, United States

Short Abstract: Kinases play critical roles in the regulation of dynamic biological systems, including cancer cell growth, proliferation and survival. With the emerging high-throughput screening technologies, large small molecule libraries have been profiled against panels of kinases. However it is noted that the kinases inhibitors may not selectively differentiate kinases since a large number of protein kinase enzymes share a common cofactor and similar three-dimensional structure of the catalytic site. We are interested in investigating the relationship of kinases from diverse spaces with a goal of identifying the potential linear or non-linear expression of ligand-based chemical, pharmacogenomic, functional, and disease space by integrating large datasets from multiple sources. Our analysis started from LINCS KINOMEscan data, which is a “benchmark” kinase target competitive binding bioassay. However, as KINOMEscan only covers a small number of compounds, we curated and integrated one of the largest kinase inhibitor datasets and developed diverse learning models to investigate this space further. We assessed kinase-to-kinase similarity through different measurements and quantify pairwise associations and predictability among those measurements. Further, we integrate this analysis with LINCS L1000 gene expression data to study the determinants underlying mechanism of action.

L02 - Tech Startups of the Genome: De novo genes arise frequently, try to be useful, and occasionally succeed and survive
  • Amir Karger, Research Computing Group, Harvard Medical School, United States of America
  • Victor Luria, Department of Systems Biology, Harvard Medical School, United States of America
  • Anne O'Donnell-Luria, Broad Institute of MIT and Harvard, United States of America
  • John Wes Cain, Department of Mathematics, Harvard University, United States of America
  • Rafik Neme, Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Germany
  • Bradley Olson, Division of Biology, United States of America
  • Marc Kirschner, Department of Systems Biology, Harvard Medical School, United States of America

Short Abstract: Most new protein-coding genes originate from old genes by duplication and domain shuffling. It was previously assumed that intergenic DNA could not yield long enough protein products through random mutations. Yet de novo protein-coding genes - derived from intergenic DNA - were recently found in multiple species. These genes are of particular interest as they alone can invent novel protein structures.
We asked how often de novo genes appear, how many exist in any genome and what proteins they make. We built a mathematical model incorporating gene dimensions and genome dynamic processes (mutation, recombination, selection). It predicts that de novo genes can easily be created and that at any time many young de novo genes exist, most being lost quickly. We identified thousands of de novo genes by phylostratigraphy in five genomes and analyzed their biophysical properties using structural bioinformatics. We found that, compared to ancient proteins, de novo proteins are shorter, more disordered, promiscuous (interacting with more proteins and DNA), vulnerable to proteases, and less prone to aggregation. Moreover, de novo proteins lack Pfam domains and may be structurally novel.
Frequent gene creation and reduced tendency towards aggregation (which is toxic) provides a steady-state population of young de novo genes in the genome. This, along with de novo proteins’ propensity to interact, increases the chance that some will use their novel structures (and possibly novel functionalities) to integrate into existing genetic networks and survive for a long evolutionary time.

L03 - CoDNaS 2.0: A comprehensive database of protein conformational diversity in the native state
  • Alexander Monzon, Universidad Nacional de Quilmes, Argentina
  • Cristian Oscar Rohr, Laboratorio de Genómica Biomédica y Evolución, Argentina
  • Maria Silvina Fornasari, Universidad Nacional de Quilmes, Argentina
  • Gustavo Parisi, Universidad Nacional de Quilmes, Argentina

Short Abstract: CoDNaS (conformational diversity of the native state) is a protein conformational diversity database [http://ufq.unq.edu.ar/codnas]. It is a redundant collection of different structures obtained for the same protein. These structures differ in their crystallization conditions, such as presence of ligands, pH, post-translational modifications, mutations, change in oligomeric state, and so on. Conformational diversity describes structural differences between conformers that define the native state of proteins. It is a key concept to understand protein function and biological processes related to protein functions.
CoDNaS offers a well curated database that is experimentally driven, thoroughly linked, and annotated. CoDNaS facilitates the extraction of key information on small structural differences based on protein movements. CoDNaS enables users to easily relate the degree of conformational diversity with physical, chemical, and biological properties derived from experiments on protein structure and biological characteristics. The new version of CoDNaS includes approximately 70% of all available protein structures, includes 263014 conformers that correspond to 17714 diffent proteins chains and more than 11 millions conformer comparisons available to download. New tools have been added that run sequence searches, display structural flexibility profiles, and allow users to browse the database for different structural classes. These tools facilitate the exploration of protein conformational diversity and its role in protein function.

L04 - A Deep Neural Network Method for Predicting Mitochondria-Localized Proteins in Plants
  • Ning Zhang, Informatics Institute, University of Missouri, United States
  • R. Shyma Prasad Rao, Biostatistics and Bioinformatics Division, Yenepoya Research Center, Yenepoya University, India
  • Fernanda Salvato, Institute of Biology, State University of Campinas, Brazil
  • Jesper F. Havelund, Department of Molecular Biology and Genetics, Aarhus University, Denmark
  • Ian Max Møller, Department of Molecular Biology and Genetics, Aarhus University, Denmark
  • Jay J. Thelen, Department of Biochemistry, University of Missouri, United States
  • Dong Xu, Department of Computer Science, University of Missouri, United States

Short Abstract: Targeting and translocation of proteins to the appropriate subcellular compartments is crucial for cell organization and function. Newly synthesized proteins are transported to mitochondria with the assistance of targeting sequences, which are complex, containing either an N-terminal presequence or a multitude of internal signals to target this organelle. Compared with experimental approaches, computational predictions provide an efficient and cost-effective way to infer subcellular localization for any given protein. However, it is still challenging to predict plant mitochondrial localized proteins accurately due to various limitations, and the performance of current tools is unsatisfactory. We present a novel computational approach for large-scale prediction of plant mitochondrial proteins. We collected protein subcellular localization data in plants from databases and literature, and extracted different types of features from the training data, including amino acid composition, protein sequence profile, and gene co-expression information. We then trained deep neural networks for predicting plant mitochondrial proteins. Testing on a non-redundant dataset of potato mitochondrial and Swiss-Prot proteins, our method achieves considerable improvements over existing tools in predicting mitochondria-localized proteins in plants.

L05 - Scalable Analysis and Visualization of Large 3D Macromolecular Complexes
  • Anthony Bradley, RCSB Protein Data Bank, San Diego Supercomputer Center, UC San Diego, United States
  • Alexander Rose, San Diego Supercomputer Center, UC San Diego, United States
  • Yana Valasatava, San Diego Supercomputer Center, UC San Diego, United States
  • Jose Duarte, RCSB Protein Data Bank, San Diego Supercomputer Center, UC San Diego, United States
  • Andreas Prlic, RCSB Protein Data Bank, San Diego Supercomputer Center, UC San Diego, United States
  • Peter Rose, RCSB Protein Data Bank, San Diego Supercomputer Center, UC San Diego, United States

Short Abstract: Advances in experimental techniques have led to an explosion in both the number and size of 3D macromolecular structures. Existing text-based file formats for macromolecular data are slow to parse, are not easily extensible and do not contain certain key data (e.g., all bonding information). For these reasons we have developed the Macromolecular Transmission Format (MMTF) (http://mmtf.rcsb.org/). MMTF has three core benefits over existing file formats. First, through bespoke compression methods, the entire Protein Data Bank (PDB) archive can be stored in MMTF in less than 7GB. Second, MMTF data are stored in binary format making parsing an order of magnitude faster than existing text-based formats. Third, MMTF is user friendly, extensible and contains information not found in current formats. In this work we show that MMTF enables high-performance and scalable structural analysis of the PDB archive.

A second application of MMTF is the efficient multi-scale visualization and analysis of large molecular complexes on the web. We have tackled this problem by introducing MMTF to reduce network transfer and parsing time, and by developing NGL (https://github.com/arose/ngl), a highly memory-efficient and scalable WebGL-based viewer. MMTF offers over 75% compression over the standard mmCIF format, is over an order of magnitude faster to parse, and contains additional information (e.g., DSSP secondary structure). NGL renders molecular complexes with millions of atoms interactively on desktop computers and smartphones alike, making it a tool of choice for web-based molecular visualization in research and education.

This project was supported by NIH under award number U01 CA198942.

L06 - VarQ: Predicting the effects of SNPs based on protein structure
  • Leandro Radusky, Universidad de Buenos Aires, Argentina
  • Javier Delgado, CRG, Spain
  • Marcelo Marti, University of Buenos Aires, Argentina
  • Luis Serrano, CRG, Spain
  • Adrian Turjanski, University of Buenos Aires, Argentina
  • Christina Kiel, CRG, Spain

Short Abstract: Predicting the detailed effects of variants on the structure and function of a protein is a critical problem to be solved in drug discovery and personalized medicine. These variants can influence drug selection and provoke a variety of effects on the protein behaviour. A portion of variants effects will result in subtle phenotypic effects that are challenging to predict. VarQ is a bioinformatic tool developed with the aim of giving to the user a description of relevant properties of each variation. It consist of a calculation pipeline and a web server. The resulting information allows to make a prediction of the possible effects of variations based in knowledge over studied SNPs. A key aspect of this work is the mining of variants. There are several databases populated with variants from different sources: clinical trials, sequencing information, etc. The tool has the ability of extract known variants from this databases and allows to the user to specify novel variations to be analyzed. Some properties computed by the tool are: location of the variant (core of the protein, surface of the protein, region labeled as interfacing with another protein, binding site), energetic impact of the mutation over the protein stability or the stability of the protein-protein complex, the conservation of the involved residue, the mobility of the position, etc.
A validation was performed using a data-set of 14 RASopathies and Cancer related proteins, with ~1200 known variants, and testing the results against the known effects of the existing literature, with promising results.

L07 - Structure of Double Knot Toxin as a model for ion channels modulators
  • Romina V. Sepulveda, Center for Bioinformatics and Integrative Biology (CBIB), Universidad Andres Bello, Chile
  • Melissa Alegría-Arcos, Centro Interdisciplinario de Neurociencias de Valparaíso (CINV), Facultad de Ciencias, Universidad De Valparaíso. Center for Bioinformatics and Integrative Biology (CBIB), Facultad de Ciencias Biologi, Chile
  • Daniel Aguayo, Center for Bioinformatics and Integrative Biology, Universidad Andres Bello, Chile
  • Ignacio Diaz-Franulic, Fraunhofer Chile Research, Chile
  • Fernando Gonzalez-Nilo, Centro Interdisciplinario de Neurociencias de Valparaíso (CINV), Facultad de Ciencias , Universidad De Valparaíso. Center for Bioinformatics and Integrative Biology (CBIB), Facultad de Ciencias Biolog, Chile

Short Abstract: During the course of evolution, animal and plants have developed an extensive repertoire of toxins as defensive mechanism against predators or competitors. Most toxins modify ion channels function, by preventing the ion fluxes across the pore or by promoting its opening. Many peptide toxins display a conserved structural motif called Inhibitor Cystine Knot (ICK), which has been proposed to improve peptide thermal stability.
The transient receptor potential vanilloid 1 (TRPV1) is a cation-selective ion channel expressed in primary sensory neurons. TRPV1 is a pain receptor modulated by multiple stimuli as high temperature (>42º), low pH (< 6), irritants compounds and peptide toxins. Due the later, and from a biomedical point of view, TRPV1 is a very attractive target for pain relief therapies.
The Double Knot toxin (DkTx) from Ornithoctonus huwena spider is a peptide toxin that interacts with the extracellular surface of TRPV1 thus promoting channel opening.
Given the importance of finding new modulators of TRPV1 and the availability of structural data of DkTx and TRPV1, we analyzed the sequence variability among peptides containing the ICK motif using Sequence Similarity Network (SSN) and implemented a multiple mutation protocol on DkTx structure based in homology modeling and molecular dynamics simulations. Our main findings are: the classification of peptides with unknown function as toxins, their functional association with specific ion channel subsets and the structural characterization of new potential toxins derived from the DkTx.

R.V.S. is funded by CONICYT PCHA/Doctorado Nacional 2013-21130631. FDG thanks Fondecyt 1131003, ACT-1107. The authors acknowledge ICM-Economía P09-022-F.



L08 - iCn3D, a Web-based 3D Viewer for Biomolecular Structures
  • Jiyao Wang, National Institutes of Health, United States of America
  • Philippe Youkharibache, National Institutes of Health, United States of America
  • Dachuan Zhang, National Institutes of Health, United States of America
  • Christopher Lanczycki, National Institutes of Health, United States of America
  • Lewis Geer, National Institutes of Health, United States of America
  • Renata Geer, National Institutes of Health, United States of America
  • Aron Marchler-Bauer, National Institutes of Health, United States of America
  • Tom Madej, National Institutes of Health, United States of America
  • Yanli Wang, National Institutes of Health, United States of America
  • Stephen Bryant, National Institutes of Health, United States of America

Short Abstract: With the widespread availability of powerful hardware and mobile computing platforms, the visualization of biomolecular 3D structure is no longer restricted to stand-alone 3D viewer applications and can also be achieved via web-based technologies such as canvas and WebGL. iCn3D (I see in 3D) is a novel structure viewer employing these technologies using the javascript libraries Three.js and jQuery. It is based in part on the 3D viewers GLmol, iview, and 3Dmol. Similar to typical stand-alone 3D viewer applications, iCn3D has a powerful user interface and is rich in features. Most importantly, it is designed to provide tight interactions between molecular sequence or sequence alignment views and the 3D view. User selections are synchronized between the two displays and form the basis for customized displays of molecular surfaces and high-lighting of substructures. Users can, for example, add customized sub-structure labels, measure distances, save the current state, go back and forth between different states, amongst many other features. For very large structures, iCn3D initially presents a simplified 3D view, with the option to choose a subset to be rendered in full detail.
Example URL: www.ncbi.nlm.nih.gov/Structure/icn3d/full.html?mmdbid=5CCB&showseq=1.

L09 - AraPPISite: a database of fine-grained protein-protein interaction site annotations for Arabidopsis thaliana
  • Hong Li, China Agricultural University, China
  • Shiping Yang, China Agricultural University, China
  • Ziding Zhang, China Agriculture University, China

Short Abstract: Knowledge about protein interaction sites provides detailed information of protein-protein interactions (PPIs). To date, nearly twenty thousands of PPIs from Arabidopsis thaliana have been identified. Nevertheless, the interaction site information has been largely missed by previously published PPI databases. Here, AraPPISite, a database that presents fine-grained interaction details for A. thaliana PPIs is established. First, the experimentally determined 3D structures of 27 A. thaliana PPIs are collected from the Protein Data Bank database and the predicted 3D structures of 3,023 A. thaliana PPIs are modeled by using two well-established template-based docking methods. For each experimental/predicted complex structure, AraPPISite not only provides an interactive user interface for browsing interaction sites, but also lists detailed evolutionary and physicochemical properties of these sites. Second, AraPPISite assigns domain-domain interactions or domain-motif interactions to 4,286 PPIs whose 3D structures cannot be modeled. In this case, users can easily query protein interaction regions at the sequence level. AraPPISite is a free and user-friendly database, which does not require user registration or any configuration on local machines. We anticipate AraPPISite can serve as a helpful database resource for the users with less experience in structural biology or protein bioinformatics to probe the details of PPIs, and thus accelerate the studies of plant genetics and functional genomics. AraPPISite is available at http://systbio.cau.edu.cn/arappisite/index.html.

L10 - Protein Classification using Specific Domain Architectures
  • Zhouxi Wang, National Center for Biotechnology Information, United States of America
  • Gabi Marchler, National Center for Biotechnology Information, United States of America
  • Myra Debyshire, National Center for Biotechnology Information, United States of America
  • Noreen Gonzale, National Center for Biotechnology Information, United States of America
  • Aron Marchler-Bauer, National Center for Biotechnology Information, United States of America

Short Abstract: Advances in genomic sequencing technology have drastically increased the amount of available sequence data, escalating the need for rapid annotation of genes and protein models. Recently, the Conserved Domain Database curation team has been developing an in house procedure, SPecific ARChitecture Labeling Engine (SPARCLE) to study the extent to which protein domain architecture can be utilized to define groups of proteins with similarities in molecular function and to derive corresponding functional characterization. So far, about 3, 000 common domain architectures from bacteria have been labelled and SPARCLE will be made available to the public as searchable resource. Currently, SPARCLE only considers best-scoring or top-ranked domain hits and is also hampered by imperfect domain annotation. To overcome some of these limitations, we propose an alternative computational procedure for defining clusters of functionally similar proteins that utilizes pre-computed domain annotation from each available source database (COGs, TIGRFAMs, Pfam, and NCBI-curated annotations) for grouping protein sequences, instead of the terse domain annotation currently employed by SPARCLE. This approach provides tunable fine-grained separation of domain architectures, and has been tested on multiple domain architecture families and several genomic datasets. The quality of the resulting classifications has been examined by curators and validated via analysis of the consistency and uniqueness of clusters. We will also discuss the limitations uncovered to date, and hope that this study will identify suitable approaches for both rapid and sustainable, but also increasingly accurate functional labeling of protein models predicted from genomic sequences.

L11 - Natural language processing in text mining for protein docking
  • Varsha D. Badal, The University of Kansas, United States of America
  • Petras J. Kundrotas, The University of Kansas, United States of America
  • Ilya A. Vakser, The University of Kansas, United States of America

Short Abstract: High-throughput sequencing has become rapid and inexpensive, providing a vast amount of protein and DNA sequences for many genomes. The next challenge for biology is to use this information to gain fundamental insights into biomolecular mechanisms. One important direction towards this goal is structural reconstruction of the entire interactomes/biological pathways, with consecutive mapping of genetic variants/mutations onto corresponding structures. Due to inherent limitation of experimental techniques, most structures of protein-protein interactions (PPI) have to be computationally modeled (docked). Protein docking pipelines produce a large number of putative docking models. Identification of near-native models among them is a serious challenge. At the same time, a rapidly growing amount of publicly available information from biomedical research provides constraints on the binding mode, which can be essential for the docking. Recently, we have shown the potential of the basic text mining (TM) for protein docking (Badal VD, Kundrotas PJ, Vakser IA, PLoS Comput Biol, 2015, 11: e1004630). Here we present an extension of the TM tool, which utilizes natural language processing (NLP) to analyze residue-containing sentences and their surrounding in the retrieved PubMed abstracts. To generate sentence dependency tree, we utilized Stanford parser, and used inverse distances between PPI-relevant keywords and residues mentioned in the abstracts to discriminate the non-interface residues. We tested WordNet, dictionary look-up and deep parsing NLP approaches. The procedure was benchmarked on 579 X-ray bound structures of binary protein complexes and validated in docking of unbound protein structures from the DOCKGROUND resource (http://dockground.compbio.ku.edu).

L12 - An EM Algorithm for Binding Energy Estimation Using HT-SELEX Data
  • Shuxiang Ruan, Washington University in St. Louis, United States of America
  • S. Joshua Swamidass, Washington University in St. Louis, United States of America
  • Gary Stormo, Washington University in St. Louis, United States of America

Short Abstract: The interaction between transcription factors and DNA plays an important role in gene expression regulation. In this study, we developed an expectation-maximization (EM) algorithm, called EMSEL, for extracting binding motifs from high-throughput SELEX (HT-SELEX) data. EMSEL builds on a comprehensive biophysical model of protein-DNA interactions and is capable of estimating the confidence intervals of the parameters in the model. We compared the binding motifs generated by EMSEL with those estimated by other algorithms using both HT-SELEX and ChIP-seq data. The results demonstrate that the EMSEL motifs generate significantly better predictions of the in vitro data and their predictions of the in vivo data are comparable to the other motifs based on the criterion of the area under the ROC curve (AUC). The ChIP-seq test results, together with the fact that many of the non-EMSEL motifs have very high information content, highlight the limitations of the AUC criterion, which is purely rank-based and fails to take account of the relative binding affinities of ChIP-seq peaks.

L13 - Assessing Impact of Taxonomic Diversity and Sequence Conservation-based Filters on Mirrortree Method
  • Erdem Turk, Mugla Sitki Kocman University, Turkey
  • Baris Suzek, Mugla Sitki Kocman University, Turkey

Short Abstract: Mirrortree is a computational method to predict protein-protein interactions. The basis of the method is coevolution; interacting proteins evolve together and tend to have similar phylogenetic trees. Hence, similarity of phylogenetic trees can be used to answer whether two proteins are interacting with each other or not.
Our goal in this study is to assess impact of two factors on the Mirrortree method’s prediction of domain-domain interactions; taxonomic diversity and application of conservation-based filters on multiple sequence alignments (MSA). For this, we first downloaded PFAM full alignments for domain pairs using a benchmark set previously used for similar experiments (e.g. Relative Co-evolution of Domain Pairs). This resulted in 1,222 PFAM domain pairs. We then randomly picked unique taxa that are common between interacting domains, using different thresholds (ranging from 10 to 50 taxa). For each domain pair and taxa threshold, we computed similarity matrices and correlation coefficients as per Mirrortree method. The computations repeated after removal of less-conserved regions from MSA’s as well.
We identified, as the taxonomic variety increases, the number correct domain-domain interactions predicted decreases; from ~70% to ~41%. On the other hand, removing less-conserved regions from MSA’s, although improves computation time, does not have a significant impact on the predictions.
In conclusion, while the computation time needed for Mirrortree method could be improved through application of sequence conservation-based filter with no prediction performance tradeoff, the taxonomic diversity should be carefully parameterized for optimal performance.

L14 - Conserved Domain Database (CDD): Improving Functional Annotation of Protein Sequences, using SPARCLE, the SPecific ARChitecture Labeling Engine
  • Myra Derbyshire, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Farideh Chitsaz, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Noreen Gonzales, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Marc Gwadz, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Lianyi Han, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Jane He, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Christopher Lanczycki, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Fu Lu, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Shennan Lu, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Gabriele Marchler, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • James Song, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Narmada Thanki, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Zhouxi Wang, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Roxanne Yamashita, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Chanjuan Zheng, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Stephen Bryant, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Lewis Geer, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America
  • Aron Marchler-Bauer, Computational Biology Branch, National Library of Medicine, National Institutes of Health, United States of America

Short Abstract: Domains are distinct functional and/or structural units of a protein. CDD is a collection of annotated multiple sequence alignment models for domains and full-length proteins, that are available as position-specific score matrices (PSSMs) for the fast identification of conserved domain footprints in protein sequences via RPS-BLAST. The CDD resource includes NCBI-curated domains, which utilize 3D-structure information to improve model accuracy and provide insights into sequence/structure/function relationships, as well as domain models imported from external source databases such as Pfam and TIGRFAMs. CDD is a redundant collection, and many NCBI-curated domain models reflect specific subfamilies of domains conserved in molecular evolution. Domain architecture (DA) is the sequential order of conserved domains in a protein. Here we use Specific Domain Architecture (SDA), the sequence of models annotating a protein, to group proteins that may have similar molecular and/or cellular functions. Using the curation interface SPARCLE ‘SPecific ARChitecture Labeling Engine’, curators assign names and functional labels (brief descriptions) to SDAs based on the sets of proteins they represent. Focusing initially on bacterial sequences, we have labeled almost 3,000 common SDAs, which cover a significant fraction of bacterial sequences. Importantly, curators record evidence to support their assignments, including representative sequences, conserved domain models, PubMed articles, E.C. numbers, 3D structure records and gene IDs. Labels are reviewed and given final approval before publishing. Labeled architectures with supporting evidences, will be made available to the public as a searchable resource. This work was supported by the Intramural Research Program of the National Library of Medicine, NIH.

L15 - Comprehensive analysis on the evolutionary conservation and structural features of buried polar residues in protein structures.
  • Matsuyuki Shirota, Tohoku University, Japan

Short Abstract: Polar residues are usually exposed to the protein surface but a small fraction of them are buried in the protein internal. These buried polar residues make intra-molecular hydrogen bonds and play important roles in protein structure and function. In this report, I performed a comprehensive survey of the buried polar residues, which are defined as Ser, Thr, Asn, Asp, Gln, Glu, His, Arg and Lys residues having zero accessible surface area, in the non-redundant protein structures from Protein Data Bank, focusing on patterns of hydrogen-bond interactions and evolutionary conservation. Compared with surface-exposed ones, the side chains of buried polar residues hydrogen bond to the residues distant along the sequence. The interaction partners of the buried polar side chains are dominated by loop residues and the side chain interactions between helices and between sheets are widely observed. In homologous proteins, the buried polar residues were more strongly conserved than the buried non-polar residues, in that a change of side-chain conformation by one methylene group is less tolerated between Asn and Gln or Asp and Glu than similar side-chain changes between aliphatic residues, Val, Ile and Leu. When buried polar residues are replaced by non-polar ones in homologous structures, their hydrogen bond partners also change to non-polar ones. These results indicate the structural specificity and evolutionary importance of the buried polar residues and provide important knowledge for better understanding of the protein structures.

L16 - Utilizing computational chemistry to characterize the functions of Structural Genomics proteins
  • Caitlyn Mills, Northeastern University, United States of America
  • Ramya Parasuram, Northeastern University, United States of America
  • Penny Beuning, Northeastern University, United States of America
  • Mary Jo Ondrechen, Northeastern University, United States of America

Short Abstract: The Protein Structure Initiative resulted in nearly 13,700 Structural Genomics (SG) protein structures deposited in the PDB, but connecting structural information with function proved to be more difficult than originally anticipated. As a result, many of these SG proteins are of unknown biochemical function or have putative functional assignments that are often incorrect. The accumulated structural information from the SG project constitutes a tremendous contribution to structural biology and genomics. However, the addition of more reliable functional predictions for SG proteins would add substantial value to this information. Our approach is based on local structure matching at the computationally predicted active site. First, Partial Order Optimum Likelihood (POOL) uses computed electrostatic and chemical properties to predict the residues in a protein structure that are important for catalysis. Next, Structurally Aligned Local Sites of Activity (SALSA) uses proteins of known function within a given superfamily, with their POOL predictions, to develop unique, spatially-localized consensus signatures for each functional family. We then compare the POOL-predicted residues for each SG protein to the consensus signatures by aligning the residues and scoring the alignment. This score is used to determine the best functional assignment for the SG proteins. This presentation focuses on the Crotonase and 6-Hairpin Glycosidase superfamilies and shows that their misannotation rates are high. In some instances, we provide better functional annotations for the SG proteins and have acquired experimental data supporting our predictions. The goal is to provide a validated approach to functional annotation for wider application by the community.

L17 - Accurate Prediction of Metal Binding Sites in Proteins
  • Frazier Baker, University of Cincinnati, United States of America
  • Alexey Porollo, Children's Hospital Medical Center, United States of America

Short Abstract: Metal ions regulate the folding and function of many proteins. Identification of metal binding sites can help with protein structure prediction and characterization of protein function. To fulfill the need for reliable sequence-based annotation of metal binding proteins, we have developed a new machine learning-based model for metal-binding site prediction. The model is based on coevolution information derived from multiple sequence alignment (MSA). Three amino acid covariance metrics were evaluated: Chi-squared, Mutual Information, and Pearson correlation. All metrics were adjusted for phylogeny bias in the MSA. Features are based on the cumulative properties derived from the most covariant residues for each potential metal binding residue (CDEHNQST). The feature space includes the average of individual conservation scores and the composition of co-varying amino acids. The training set is compiled of metal binding proteins taken from the Metal MACiE database. 1000 datasets with ratios 1:1 and 2:1 between negative and positive classes, respectively, were generated. Two machine learning algorithms, C4.5 decision tree and Random Forest (RF), were used to build prediction models. Each model was evaluated using 10-fold cross-validation (CV). The best performing model (23 features, RF, 100 trees) yielded Matthew’s correlation coefficient of 0.67 with an overall accuracy of 87.5%, averages based on 1000 runs of 10-fold CV on the 2:1 ratio dataset. The coevolution-based model with group-based features is superior to other existing models using features derived from individual residues.

L18 - A structural view of signaling through the Toll-like receptor pathway and its implications to inflammation and cancer crosstalk
  • Emine Guven-Maiorov, Cancer and Inflammation Program, Leidos Biomedical Research, Inc. Frederick National Laboratory for Cancer Research, National Cancer Institute, Frederick, MD 21702, USA, United States of America
  • Ozlem Keskin, Koc University, Turkey
  • Attila Gursoy, Koc University, Turkey
  • Carter Vanwaes, Clinical Genomic Unit, Head and Neck Surgery Branch, National Institute on Deafness and Communication Disorders, NIH, Bethesda, MD 20892, USA, United States of America
  • Zhong Chen, Clinical Genomic Unit, Head and Neck Surgery Branch, National Institute on Deafness and Communication Disorders, NIH, Bethesda, MD 20892, USA, United States of America
  • Chung-Jung Tsai, Cancer and Inflammation Program, Leidos Biomedical Research, Inc. Frederick National Laboratory for Cancer Research, National Cancer Institute, Frederick, MD 21702, USA, United States of America
  • Ruth Nussinov, Leidos Biomedical Research, NCI, NIH, United States of America

Short Abstract: Although inflammation is crucial for defense against pathogens, if not finely tuned it can also contribute to all phases of tumorigenesis. The TLR pathway plays a central role in inflammation and cancer crosstalk and construction of the structural pathway provides insights into its mechanism of action in the tumor microenvironment. We constructed the structural TLR pathway and the architectures that we obtained (i) provide the structural basis for TLR clustering upon stimulation and assembly of key signaling complexes; (ii) demonstrate that almost all downstream parallel pathways are competitive; (iii) TIR domain-containing negative regulators (BCAP, SIGIRR, and ST2) interfere with TIR domain signalosome formation; (iv) major deubiquitinases (A20, CYLD, and DUBA) prevent association of TRAF6 and TRAF3 with their partners, in addition to removing K63-linked ubiquitin chains that serve as docking platform for downstream effectors; (v) and illuminate mechanisms of oncogenic mutations. Missense mutations that fall on interfaces and nonsense/frameshift mutations that result in truncated negative regulators disrupt the interactions with their targets, thereby enable constitutive activation of NF-kB, and contribute to chronic inflammation, autoimmune diseases and oncogenesis.

L19 - The Protein Topology Graph Library web server
  • Tim Schäfer, Institute of Computer Science, Department of Molecular Bioinformatics, Johann Wolfgang Goethe-University Frankfurt, Robert-Mayer-Strasse 11–15, 60325 Frankfurt am Main, Germany
  • Ina Koch, Johann Wolfgang Goethe University Frankfurt am Main, Institute of Computer Science, Molecular Bioinformatics, Germany

Short Abstract: The huge amount of 3D protein structures available in databases like the PDB requires tools for automated analysis and intuitive visualization of protein structures. Here, we present a new version of the Protein Topology Graph Library (PTGL) web server. The PTGL is a database that uses a graph model to describe proteins. The graphs are based on 3D atom data from the PDB and the SSE assignments of the DSSP algorithm. The new version of the PTGL supports both protein graphs and amino acid graphs, and can now model protein complexes. In protein graphs, the vertices represent secondary structure elements (SSEs) or ligands and their spatial contacts. In amino acid graphs, residues are modeled instead. The PTGL allows for motif search in the graphs, and supports different visualizations.
We rewrote the PTGL from scratch and implemented many new features, including ligand support, an automated update procedure, and an application programming interface (API) which allows for the integration into other software or services. The new PTGL is an updated tool for the analysis of protein topology that supports large-scale investigations. The resulting graph files can be analyzed in standard software. Here, we also present an investigation of the properties of the new amino acid graphs.

L20 - Computational Approaches to Decipher Composition and Regulation of Complexes by Large-scale Analysis of Mass Spectrometry (MS) Data
  • Morteza Chalabi Hajkarim, Department of Mathematics & Computer Science, University of Southern Denmark, Denmark
  • Fabio Vandin, Department of Information Engineering, University of Padova, Italy
  • Veit Schwämmle, Department of Biochemistry and Molecular Biology, University of Southern Denmark, Denmark

Short Abstract: Detailed knowledge about protein complexes is necessary to understand almost all biochemical, signaling and functional processes in the cell. Competitive binding, protein interactions and post-translational modifications control complex activity and coordinate their assembly. Distinction of stable core members from selectively aggregating proteins is necessary to identify complex subunits controlling behavior and composition.

Complex composition has been studied by various protein-protein interaction network-based approaches such as graph clustering and community detection. However, these network-based approaches suffer from the incompleteness of the interactome even in widely studied organisms such as humans and mice and therefore many protein complexes are still not identified or well-characterized.

We studied a large data collection containing protein expression profiles measured by mass spectrometry. Using this data, we extracted detailed information about protein complex composition along more than 50 different tissues and cell lines. We designed a novel statistical score discriminating core members from labile components. This score tests the significance of complex composition and concordant behavior in different tissues. Our statistical approach reveals control mechanisms in various protein complexes, and has the potential to accurately predict novel protein complexes as well as to integrate further data from various omics platforms.

L21 - Understanding Sequence-Structure-Function Relationships of Orphan GPCRs through NCBI’s Conserved Domain Database (CDD).
  • James Song, National Institutes of Health, United States of America
  • Noreen Gonzales, National Institutes of Health, United States of America
  • Roxanne Yamashita, National Institutes of Health, United States of America
  • Stephen Bryant, National Institutes of Health, United States of America
  • Aron Marchler-Bauer, National Institutes of Health, United States of America

Short Abstract: CDD is a resource for protein classification and functional annotation, comprising a collection of annotated multiple sequence alignment (MSA) models that represent ancient conserved protein domains, basic units of protein function and evolution. These MSAs are also available as position specific score matrices (PSSMs) for the rapid identification of conserved domain footprints via RPS-BLAST. CDD imports well-known collections (Pfam, COGs etc.) and supplements them with manually-curated domain models that are organized into family hierarchies. Curators use protein 3D-structure information to refine models and provide insights into sequence-structure-function relationships. We present an annotated hierarchical classification of the seven-transmembrane G-protein coupled receptors (7TM GPCRs), a prominent family of drug therapeutic targets with more than 140 human orphan GPCRs whose endogenous ligands are unknown. With the increasing availability of 3D-structures of diverse 7TM receptors, we recently built a comprehensive comparative evolutionary classification of the highly divergent GPCRs. Orphan GPCR subfamilies, which contain uncharacterized protein sequences, often with poor sequence conservation, have been assigned putative functions with predicted ligand-binding sites and/or the location of 7TM helices annotated by inference from the molecular and physiological functions of known related GPCR proteins with available 3D-structure, from phylogenetic relationships, and/or based on the available literature. We hope that the classification, together with NCBI’s software tools, will aid researches in the discovery of molecular targets for drug development by providing insights regarding as-yet-unidentified molecular interactions and functional mechanisms. This work was supported by the Intramural Research Program of the National Library of Medicine, NIH.

L22 - Comparative computational analysis of phylloplanin proteins present in different plant species
  • Joanna Burr, University of Tampa, United States of America
  • Dr Padmanabhan Mahadevan, University of Tampa, United States of America

Short Abstract: This investigation aims to determine the evolutionary lineage and variation present between phylloplanins present in different plant species. Phylloplanins are highly hydrophobic, basic proteins secreted on the leaf surface (phylloplane) to inhibit spore germination and leaf infection via pathogens. Proteins annotated as phylloplanins were used to search the Genbank protein database. Phylogenetic trees were constructed from these BLAST results. Protein domains were identified in these phylloplanins proteins using the Pfam, CDD, and Interpro databases. The Pollen_Ole_e_I family consists of a number of secreted plant pollen proteins, of approximately 145 residues, whose function has not yet been determined. This analysis enabled us to gain better insight into the evolution of phylloplanins and the similarities of these proteins found in different plant species.

L23 - Are Protein Models Accurate Enough for Identifying Residue Specific Functional Features?
  • Paul Depietro, The Commonwealth Medical College, United States of America
  • Emily Holzman, The Commonwealth Medical College, United States of America
  • Juergen Haas, University of Basel & SIB Swiss Institute of Bioinformatics, Switzerland
  • Torsten Schwede, University of Basel & SIB Swiss Institute of Bioinformatics, Switzerland
  • William Mclaughlin, The Commonwealth Medical College, United States of America

Short Abstract: Experimental 3-dimensional structures are currently known only for a fraction of all known protein sequences. However, for proteins having primary sequences sufficiently similar to those with a known structure, homology modeling techniques offer means to predict their three-dimensional structures. One interesting question in this context is whether these modelling techniques are sufficiently accurate to support the identification of residue-specific functional features from protein model coordinates. We developed a way to evaluate the accuracy of the different modeling techniques based on the matches between functional site predictions in the determined structures and the modeled counterparts using the FEATURE function prediction program by Bagley and Altman. We utilized the collaborative efforts of the Protein Model Portal and Continuous Automated Model Evaluation (CAMEO) to obtain a set of structural models generated through various modeling techniques. Each modeling technique was thereby analyzed with regard to its ability to accurately reconstitute the local microenvironment corresponding to a particular small molecule binding sites or enzyme active evaluated by FEATURE. Sensitivity and specificity measures were calculated on a per residue basis, and that enabled the detection of local differences in the modeled versus the experimentally determined reference structures. Accuracy of the modeling techniques was assessed for sub-sets of the data reflecting the difficulty of modeling the target protein.

Financial support was provided in part by the NIGMS [grant number 5U01 GM093324-02] and pilot project grant at The Commonwealth Medical College.

L24 - Assessing accuracies of segmentation-based methods for RNA secondary structure prediction
  • Gerardo Cardenas, University of Texas at El Paso, United States
  • Ming-Ying Leung, University of Texas at El Paso, United States

Short Abstract: RNA secondary structure prediction has become an important area of interest in biology and medicine because it helps in understanding the many biological processes and in designing RNA-based therapies to treat various diseases such as cancers and AIDS. Different thermodynamics based computational algorithms for RNA structure prediction exist, and have been used to help understand the disease mechanisms and design treatments. However, most of these computational tools that can predict complex pseudoknot structures have a sequence length limitation of few hundred nucleotide bases due to their high demands of computer resources. Yet, many RNA molecules, such as those making up viral genomes, are thousands of bases long. To overcome the sequence length limitation, a segmentation approach was previously proposed to cut a long RNA into shorter chunks at strategic positions that conserve inversion patterns in the nucleotide sequence, predict each single chunk independently by existing programs like pknotsRG and RNAstructure, and then combine the results to build the final prediction of the entire RNA. In the present study, we investigated whether the prediction accuracy of the segmentation approach could be improved by capturing possible structures formed between two neighboring chunks that would be missed by the previous single-chunk method. Using 136 sequences with known structures obtained from Rfam, we compared the overall prediction accuracies of these segmentation- based methods. When the chunk size was 90 bases or more, the single-and two-chunk methods were found to be not statistically different in prediction accuracy.

L25 - ProQ3: Improved model quality assessments using Rosetta energy terms
  • Karolis Uziela, Stockholm University, Sweden
  • Nanjiang Shu, Bioinformatics Infrastructure for Life Science (BILS), Sweden
  • Björn Wallner, Linköping University, Sweden
  • Arne Elofsson, Stockholm University, Sweden

Short Abstract: Motivation: To assess the quality of a protein model, i.e. to estimate how close it is to its native structure,
using no other information than the structure of the model has been shown to be useful for structure
prediction. The state of the art method, ProQ2, is based on a machine learning approach that uses a
number of features calculated from a protein model. Here, we examine if these features can be exchanged with energy terms calculated from Rosetta and if a combination of these terms can improve the quality assessment.
Results: When using the full atom energy function from Rosetta in ProQRosFA the QA is on par with
our previous state-of-the-art method, ProQ2. The method based on the low-resolution centroid scoring
function, ProQRosCen, performs almost as well and the combination of all the three methods, ProQ2,
ProQRosFA and ProQCenFA into ProQ3 show superior performance over ProQ2.
Availability: ProQ3 is freely available as a webserver: http://proq3.bioinfo.se/

L26 - How to more accurately compute protein residue contacts?
  • Pedro Martins, UFMG, Brazil
  • Vinícius Mayrink, UFMG, Brazil
  • Sabrina Silveira, UFV, Brazil
  • Carlos Silveira, UNIFEI, Brazil
  • Leonardo Lima, Universidade Federal de São João Del Rei, Brazil
  • Raquel Melo-Minardi, Universidade Federal de Minas Gerais, Brazil

Short Abstract: Computing contacts in proteins is important to several types of studies from Bioinformatics to Structural Biology. An accurate computation of contacts is essential to correctness and reliability of application involving folding prediction, protein structure prediction, structural quality assessment, network contacts analysis, thermodynamic stability prediction, protein-protein and protein-ligand interactions, docking and so forth. In this work, we built a large database of contacts using about 45,000 PDB files to compare three paradigms for contacts prospection at atomic level: distance-based only, distance and geometric-based (occlusion free) and distance and angulation-based.
The main contribution of this paper is a critical evaluation of the different paradigms that may be used to compute contacts between protein atoms. We focused on protein-protein interfaces and analysed four types of contacts namely hydrogen bonds, aromatic stackings, hydrophobic and ionic (attractive) interactions. We scanned for possible contacts in the range from 0 to 7 Å. Our data showed the importance of a geometric approach to filter out spurious occluded contacts after about 3.5 Å for aromatic stackings, hydrophobic and ionic interactions. For hydrogen bonds the angulation criteria presented more reliable results at every distance in the studied interval.
We provide the database with all computed contacts and the source codes used to populate the database.

L27 - RCSB Protein Data Bank: Views of structural biology for basic and applied research
  • Peter W Rose, RCSB Protein Data Bank, UC San Diego, United States
  • Chunxiao Bi, RCSB Protein Data Bank, UC San Diego, United States
  • Cole H Christie, RCSB Protein Data Bank, UC San Diego, United States
  • Jose M Duarte, RCSB Protein Data Bank, UC San Diego, United States
  • Zukang Feng, RCSB Protein Data Bank, Rutgers The State University of New Jersey, United States
  • Rachel Kramer Green, RCSB Protein Data Bank, Rutgers The State University of New Jersey, United States
  • Tara Kalro, RCSB Protein Data Bank, UC San Diego, United States
  • Andreas Prlic, RCSB Protein Data Bank, UC San Diego, United States
  • Chris Randle, RCSB Protein Data Bank, UC San Diego, United States
  • John D Westbrook, RCSB Protein Data Bank, Rutgers The State University of New Jersey, United States
  • Jesse Woo, RCSB Protein Data Bank, UC San Diego, United States
  • Huanwang Yang, RCSB Protein Data Bank, Rutgers The State University of New Jersey, United States
  • Jasmine Young, RCSB Protein Data Bank, Rutgers The State University of New Jersey, United States
  • Christine Zardecki, RCSB Protein Data Bank, Rutgers The State University of New Jersey, United States
  • Helen M Berman, RCSB Protein Data Bank, Rutgers The State University of New Jersey, United States
  • Stephen K Burley, RCSB Protein Data Bank, Rutgers The State University of New Jersey, United States

Short Abstract: The RCSB Protein Data Bank (RCSB PDB, http://www.rcsb.org) provides rich structural views of biological systems to enable breakthroughs in scientific inquiry, medicine, drug discovery, technology, and education. The website offers multiple tools for structure query, analysis, and visualization.

Users can perform simple searches from the top search bar (e.g., ID, name, sequence, ligand) or build complex combinations of search parameters using Advanced Search. Information from DrugBank is integrated with PDB data to facilitate searches for drugs and drug targets. Other classification systems are used to organize PDB structures in hierarchical trees for browsing and searching (e.g., Membrane Protein Annotation, Gene Ontology, Enzyme Classification).

Visualization features include Protein Feature View, a graphic comparison of a PDB sequence with UniProt and other annotations; Gene View, a tool that illustrates the correspondences between the human genome and 3D structure; and 3D visualization of electron density maps for bound ligands.

The RCSB PDB is funded by a grant (DBI-1338415) from the National Science Foundation, the National Institutes of Health, and the US Department of Energy. RCSB PDB is a member of the Worldwide Protein Data Bank (http://wwpdb.org).

L28 - Tertiary Structural Propensities Reveal Basic Sequence-Structure Relationships in Proteins
  • Fan Zheng, Dartmouth College, United States of America
  • Jian Zhang, Dartmouth College, United States of America
  • Gevorg Grigoryan, Dartmouth College, United States of America

Short Abstract: The Protein Data Bank (PDB) is a key resource of general principles that has shaped our understanding of protein structure. Most of the existing statistical generalizations of protein structures are made for secondary structures, which are often too generic to satisfy many specific design goals, or for protein domains, for which the PDB distribution is highly biased by evolution or human sampling, and thus not being physically meaningful. To fill this gap, we proposed the local tertiary motifs (TERMs) as a new fundamental level of structural unit. TERMs are combinations of non-continuous small secondary fragments connected by inter-residue contacts. We hypothesized that the PDB contains valuable quantitative information on the level of TERMs. We studied the propensities of TERMs within their corresponding ensembles, i.e. geometrically similar structural fragments from completely unrelated proteins. The TERM propensities are physically meaningful in many contexts. By breaking a protein structure into its constituent TERMs, we can evaluate the accuracy of structure-prediction models with poorly predicted regions identifiable, via a metric we named “structure score” capturing the sequence-structure relationships in TERMs. Also, querying TERMs affected by point mutations enables straightforward prediction of mutational free energies. Our performance exceeds or is comparable to state-of-art methods. Our results suggest that the data in the PDB are now sufficient to enable the quantification of complex structural features, such as those associated with entire TERMs. This should present opportunities for advances in computational structural biology techniques, including structure prediction and design.

L29 - Measuring Distance between Protein Functions on Gene Ontology with a New Metric
  • Ruiyu Yang, Indiana University Bloomington, United States of America
  • Yuxiang Jiang, Indiana University Bloomington, United States of America
  • Matthew W. Hahn, Indiana University, United States of America
  • Elizabeth A. Housworth, Indiana University, United States of America
  • Predrag Radivojac, Indiana University, United States of America

Short Abstract: We propose new metrics on sets, ontologies and functions that can be used in various stages of probabilistic modeling, including exploratory data analysis, learning, inference, and result interpretation. The completeness of the proposed metric spaces have also been proved. These new metric functions unify and generalize some of the popular metrics on sets and functions, such as the Jaccard and bag distances on sets and Marczewski-Steinhaus distance on functions. As a special case and direct application of this new metric, information-theoretic metrics are then introduced on directed acyclic graphs (such as Gene Ontology) drawn independently according to a fixed probability distribution and show how they can be used to calculate similarity between class labels for the objects with hierarchical output spaces (e.g., protein function). Finally, we provide evidence that the proposed metrics are useful by clustering species based solely on functional annotations available for subsets of their genes. The functional trees resemble evolutionary trees obtained by the phylogenetic analysis of their genomes.

L30 - Critical Assessment of Function Annotation: Moving Forward with the “State-of-the-art"
  • Yuxiang Jiang, Indiana University Bloomington, United States of America
  • Iddo Friedberg, Iowa State University, United States of America
  • Predrag Radivojac, Indiana University, United States of America

Short Abstract: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. Assigning functions to biological macromolecules, especially proteins, turn out to be one of the major challenges to understand life on a molecular level. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, properly assessing methods for protein function prediction and tracking progress in the field remain challenging as well.

Here we report the result of the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. According to our assessment, the top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. It also revealed that the definition of top performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies.

L31 - Structural characterization of antibody Next Generation Sequencing data
  • Jaroslaw Nowak, University of Oxford, United Kingdom of Great Britain and Northern Ireland
  • Terry Baker, UCB Pharma Ltd, United Kingdom of Great Britain and Northern Ireland
  • Guy Georges, Roche Diagnostics GmbH, Germany
  • Stefan Klostermann, Roche Diagnostics GmbH, Germany
  • Jiye Shi, UCB Pharma Ltd, United Kingdom of Great Britain and Northern Ireland
  • Bojana Popovic, MedImmune, United Kingdom of Great Britain and Northern Ireland
  • Charlotte Deane, University of Oxford, United Kingdom of Great Britain and Northern Ireland

Short Abstract: Antibodies are proteins produced by the immune system to act upon immunogenic
molecules, known as antigens. Human antibodies are composed of two chains –
Heavy and Light. The antigen binding site of a typical antibody is made up of six loops
known as Complementarity-Determining Regions (CDRs). Three of those loops are
located on the Heavy chain (H1-H3) and three on the Light chain (L1-L3). Five out of six
CDRs (L1, L2, L3, H1 and H2) form only a small number of discrete conformations called
canonical classes.
Our results show that all CDR types have structurally similar loops of different lengths.
Based on these findings, we have created length-independent canonical classes for the
non-H3 CDRs. Using these length-variable classes we have predicted canonical class
membership of the CDRs from a Next-Generation Sequencing (NGS) dataset of human IgM antibodies
containing over 10,000,000 Light chain sequences and over 5,000,000 Heavy chain
sequences. We find that due to differences in CDR length distribution
between available structural data and the sequence data our length-independent approach
classifies more sequences into classes than the standard, length-dependent approach.
Using statistical and machine learning methods, we have also clustered the CDR
sequences to investigate how well we can reconstruct the canonical classes from
sequence data alone.
Overall, our analysis is one of the most comprehensive attempts at quantifying the
range of CDR structural variability in the naïve human antibody repertoire.

L32 - A sequence order-independent clique-matching approach for the comparison of protein binding sites
  • Fernando Gutiérrez, Pontificia Universidad Católica de Chile, Chile
  • Andreas Schüller, Pontificia Universidad Católica de Chile, Chile

Short Abstract: Predicting the macromolecular targets of small molecule compounds is important for drug discovery in order to flag off-targets, identify new targets of known drugs (drug repositioning) and to deorphanize ligands without known targets. Here, we present a new method for target prediction based on the three-dimensional comparison of protein-ligand binding sites (“pockets”). Pockets are represented by clouds of atoms as the Cartesian coordinates of Cα and hydrogen bond donor/acceptor atoms of residues lining the protein cavities. These pockets are then compared by a sequence order-independent clique-matching algorithm. Finally, a pocket similarity score is calculated based on the number of aligned atoms and their root mean squared distance. We devised a benchmark based on 201 high-resolution protein-ligand complexes with known binding affinity, employed our method to retrieve related pockets, and analyzed the results by means of receiver operator characteristic (ROC). Best results were obtained for the structurally conserved binding sites of serine proteases (area under the ROC curve, AUC=0.99) and metalloproteases (AUC=0.95), while we obtained good results for the structurally diverse binding sites of glycosylases (AUC=0.79). We compared our method with the published algorithm APoc and demonstrate that our method performs favorably. In summary, we present a fast, sequence order-independent clique-matching approach for the comparison of protein pockets with straightforward application in small molecule target prediction.
Acknowledgments: FONDECYT 1161798.

L33 - Protein Frustratometer 2: a tool to localize energetic frustration in protein molecules, now with electrostatics
  • Rodrigo Gonzalo Parra, Protein Physiology Laboratory, Dep de Quimica Biologica, Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN, Buenos Aires, Argentina., Argentina
  • Nicholas P. Schafer, Interdisciplinary Nanoscience Center, Department of Molecular Biology and Genetics, Aarhus University, DK-8000 Aarhus, Denmark., Denmark
  • Leandro Radusky, Structural Bioinformatics Group, Dep de Quimica Biologica, Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN, Buenos Aires, Argentina., Argentina
  • Min-Yeh Tsai, Center for Theoretical Biological Physics and Department of Chemistry, Rice University, Houston,Texas, United States of America
  • A. Brenda Guszovsky, Protein Physiology Laboratory, Dep de Quimica Biologica, Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN, Buenos Aires, Argentina., Argentina
  • Peter G. Wolynes, Center for Theoretical Biological Physics and Department of Chemistry, Rice University, Houston,Texas, United States of America
  • Diego U. Ferreiro, FCEyN-Univ Buenos Aires, Argentina

Short Abstract: Natural protein molecules are highly evolved systems. Spontaneous folding of individual proteins and recognition between polypeptides leading to well-defined structural ensembles are fundamental concepts in the biology of macromolecules, the specificity of which is explained by the ``Principle of minimal frustration'' . This insight has lead to multiple developments in the understanding of protein folding and function. The minimal frustration principle does not rule out that some energetic frustration may be present in a folded protein. Moreover, it may not be a random occurrence but an evolved characteristic, facilitating motion of the protein around its native basin, binding to appropriate partners and is thought to be fundamental to protein function. We have developed theoretical methods for spatially localizing and quantifying the energetic frustration present in native proteins. These have proven useful in the study of binding interfaces, allosteric transitions, aggregation and ligand binding, conformational dynamics, have been related with evolutionary patterns and disease-related polymorphisms.

The new Protein Frustratometer server is based on the associative memory, water mediated, structure and energy model (AWSEM). AWSEM provides a transferable, coarse-grained, non-additive force field that is able to predict the native structures of many proteins and protein complexes from sequence information. Recently, electrostatic forces have been included in the AWSEM suite and have been shown to play a role in modulating the asperities of the folding and binding landscapes.

Along with a significant speed-up for the calculations, this new server allows for the possibility of analyzing the local frustration that arises by electrostatic interactions.

L34 - Molecular Dynamics Simulations of Glycosylated HIV-1 gp120 Trimers in the Context of Viral Coreceptor Tropism
  • Natasha Wood, University of Cape Town, South Africa
  • Simon Travers, SANBI, University of the Western Cape, South Africa

Short Abstract: The HIV-1 envelope surface protein is covered with N-linked glycosylation sites and glycans (carbohydrates) contribute to more than half of its molecular weight. The glycans render the protein surface mostly undetectable to the host immune system, but specific glycans have been identified that form part of viral epitopes for broadly cross-neutralising antibodies and glycans have also been implicated in chemokine receptor (coreceptor) tropism. Despite their abundance and importance, the extent to which N-linked glycans influence antibody and coreceptor binding, which is essential for productive infection, is not well documented. Using molecular simulation techniques, we have previously shown that the dynamics of gp120 is substantially affected by the presence of glycans at specific N-linked glycosylation sites.

Here, we use CCR5- and CXCR4-tropic viral sequences to explore the effect of the glycan distribution on HIV-1 coreceptor tropism. We have modelled six Env trimer structures using three pairs of phenotyped (using Trofile®) gp160 sequences, representing subtype A, C and D infections. Oligomannose (Man9) glycans were attached to N-linked glycosylation sites of each structure and we used AMBER to produce molecular dynamics simulations of the modelled structures. The preliminary results reveal the degree to which the glycan composition and density around key regions of HIV-1 gp120 and gp41 impact the tropism-associated dynamics of the protein.

These results present a unique view on how the glycan-protein, as well as the glycan-glycan, interactions of the HIV-1 envelope trimer may modulate the infectivity and immunogenicity of the virion.

L35 - Decomposition of protein structures into structural domains based on hydrophobic core detection
  • Mohammad Taheri, Shahid Beheshti University, Iran (Islamic Republic of)
  • Elnaz Saberi Ansari, Institute for Research in Fundamental Sciences (IPM), Iran (Islamic Republic of)
  • Changiz Eslahchi, Shahid Beheshti University, Iran (Islamic Republic of)

Short Abstract: Automatic decomposition of protein structures into structural domains has been widely examined. Up to now, various clustering algorithms have been presented for protein decomposition. A main challenge in some of the presented algorithms is defining the "stopping criteria", namely to determine the condition in which the algorithm should finish. A prior knowledge about the number of domains of a protein structure can play a key role in the stopping criteria. Some domain assignment algorithms use the knowledge of the number of domains. Generally such algorithms show a better accuracy, comparing to the algorithms which don't use this kind of knowledge. We introduce a new algorithm for domain assignment problem based on dividing this problem into two sub-problems: 1-Obtaining and clustering core residues of a protein structure such that every cluster represents the hydrophobic core of a domain. By the assumption that every domain has a hydrophobic core, a one-to-one correspondence between clusters and domains is expected. By solving this sub-problem, we transform the domain assignment problem from a clustering problem into a classification problem. Moreover, as the number of domains will be obtained implicitly in this step, we can use this information with the algorithms in which the number of domains is used as a prior information. 2-Classification of non-core residues. To evaluate this algorithm, we compare the result of this algorithm with the results of several state-of-art algorithms on some benchmark datasets. The evaluation shows our new algorithm achieves a competitive accuracy comparing to best algorithms ever presented.

L36 - Changes to Dynamics upon Oligomerization Identify Key Functional Protein Sites
  • Sambit Mishra, Iowa State University, United States
  • Kannan Sankar, Iowa State University, United States
  • Robert Jernigan, Iowa State University, United States

Short Abstract: Oligomerization is the assembly of protein subunits to form a complex functional biological macromolecule, an oligomer. It is one of the fundamental means through which nature equips proteins with the ability to perform complex functions and attain greater stability. Oligomers can exist either as an assemblage of identical blocks of proteins, homooligomers or can form a mosaic of heterogenous subunits termed heterooligomers. In this study, we investigate the dynamic effect of oligomerization and its functional significance on a set of 145 diverse homooligomeric proteins. We employ Elastic Network Model to inspect the change in residue fluctuations upon oligomerization and then couple it with residue conservation score to understand the functional significance of regions with altered dynamics. The study here reveals the importance of sites with dampened fluctuations post oligomerization. These sites can be located either in the interface or in the non-interface regions of the oligomeric assembly and can harbor key functional residues. A case study on the bovine glutamate dehydrogenase further confirms that these residues can serve as orthosteric ligand binding sites. This study introduces a novel approach for identifying functional residues in oligomeric proteins which can further be investigated as potential drug targets.

L37 - Integrative mapping of metabolic pathways
  • Sara Calhoun, University of California San Francisco, United States of America
  • Magdalena Korczynska, University of California San Francisco, United States of America
  • Daniel Wichelecki, University of Illinois Urbana-Champaign, United States of America
  • Brian San Francisco, University of Illinois Urbana-Champaign, United States of America
  • Dmitry Rodionov, Sanford-Burnham Medical Research Insititute, United States of America
  • Nawar Al-Obaidi, Albert Einstein College of Medicine, United States of America
  • Matthew O'Meara, University of California San Francisco, United States of America
  • Steven Almo, Albert Einstein College of Medicine, United States of America
  • Andrei Osterman, Sanford-Burnham Medical Research Insititute, United States of America
  • John Gerlt, University of Illinois Urbana-Champaign, United States of America
  • Matthew Jacobson, University of California San Francisco, United States of America
  • Brian Shoichet, University of California San Francisco, United States of America
  • Andrej Sali, University of California San Francisco, United States of America

Short Abstract: The function of a protein is often defined by describing its interacting partners (e.g., substrate and product for an enzyme) and its context in a larger molecular network (e.g., a metabolic pathway). The functions of most proteins are not known, but can be determined by experimental and computational approaches, such as ligand screening. Here, we introduce an integrative pathway mapping approach that identifies enzymes and ligands in a pathway as well as their order, given a set of candidate members and at least one member. Inspired by integrative structural modeling, the goal is achieved by finding those pathway models that satisfy structural and network restraints implied by data from a variety of different methods, such as virtual screening, cheminformatics, genomic context analysis, and ligand binding experiments. We demonstrate the method by identifying a novel L-gulonate degradation pathway in Haemophilus influenzae Rd KW20. The predicted pathway was validated by X-ray crystallography, in vitro assays, genetic analyses, and metabolomics. Additional applications for predicting bacterial sugar metabolic pathways and networks are also being pursued. These applications demonstrate the potential of our approach to contribute to the discovery of metabolic pathways and functional annotation of proteins.

L38 - Structure-based prediction of homeodomain binding specificity using homology models and an integrative energy function
  • Alvin Farrel, University of North Carolina at Charlotte, United States of America
  • Jun-Tao Guo, University of North Carolina at Charlotte, United States of America

Short Abstract: Transcription factors (TFs) are essential to regulation of gene expression through binding to specific target DNA sites. Structure-based methods for studying TF-DNA interactions can help us annotate TF-binding sites (TFBS) at genome-scale, better understand the effects of mutations in transcription factors and target sites, and facilitate structure-based drug design. Structure-based TFBS prediction algorithms require high-resolution TF-DNA complex structures. Despite advances in structure determination methods, the structural solution of protein-DNA complexes remains a difficult task, and there are a limited number of TF-DNA complex structures in Protein Data Bank (PDB). Therefore, there is a need for modeling protein-DNA complex structures to extend the applicability of structure-based TF-binding site prediction. Here we describe a method of generating TF-DNA complex models by combining TF homology models and DNA structures from homologous complex templates to increase the coverage of TF-DNA conformations. A number of TF-DNA interface features are used to determine the top complex models. The top models are used for structure-based transcription factor binding site prediction using an integrative energy function. The integrative energy function combines a residue-level statistical potential with two atomic terms, hydrogen bond energy between protein residues and DNA bases, and electrostatic energy between aromatic residues and DNA bases involved in π stacking interactions. The results on homedomains show that our approach improves model selection and consequently TFBS prediction accuracy.


View Posters By Category

Search Posters:


TOP