Attention Conference Presenters - please review the Speaker Information Page available here.
If you need assistance please contact email@example.com and provide your poster title or submission ID.
Category L - 'Protein Structure and Function Prediction and Analysis'
Short Abstract: Kinases play critical roles in the regulation of dynamic biological systems, including cancer cell growth, proliferation and survival. With the emerging high-throughput screening technologies, large small molecule libraries have been profiled against panels of kinases. However it is noted that the kinases inhibitors may not selectively differentiate kinases since a large number of protein kinase enzymes share a common cofactor and similar three-dimensional structure of the catalytic site. We are interested in investigating the relationship of kinases from diverse spaces with a goal of identifying the potential linear or non-linear expression of ligand-based chemical, pharmacogenomic, functional, and disease space by integrating large datasets from multiple sources. Our analysis started from LINCS KINOMEscan data, which is a “benchmark” kinase target competitive binding bioassay. However, as KINOMEscan only covers a small number of compounds, we curated and integrated one of the largest kinase inhibitor datasets and developed diverse learning models to investigate this space further. We assessed kinase-to-kinase similarity through different measurements and quantify pairwise associations and predictability among those measurements. Further, we integrate this analysis with LINCS L1000 gene expression data to study the determinants underlying mechanism of action.
Short Abstract: Most new protein-coding genes originate from old genes by duplication and domain shuffling. It was previously assumed that intergenic DNA could not yield long enough protein products through random mutations. Yet de novo protein-coding genes - derived from intergenic DNA - were recently found in multiple species. These genes are of particular interest as they alone can invent novel protein structures.
We asked how often de novo genes appear, how many exist in any genome and what proteins they make. We built a mathematical model incorporating gene dimensions and genome dynamic processes (mutation, recombination, selection). It predicts that de novo genes can easily be created and that at any time many young de novo genes exist, most being lost quickly. We identified thousands of de novo genes by phylostratigraphy in five genomes and analyzed their biophysical properties using structural bioinformatics. We found that, compared to ancient proteins, de novo proteins are shorter, more disordered, promiscuous (interacting with more proteins and DNA), vulnerable to proteases, and less prone to aggregation. Moreover, de novo proteins lack Pfam domains and may be structurally novel.
Frequent gene creation and reduced tendency towards aggregation (which is toxic) provides a steady-state population of young de novo genes in the genome. This, along with de novo proteins’ propensity to interact, increases the chance that some will use their novel structures (and possibly novel functionalities) to integrate into existing genetic networks and survive for a long evolutionary time.
Short Abstract: CoDNaS (conformational diversity of the native state) is a protein conformational diversity database [http://ufq.unq.edu.ar/codnas]. It is a redundant collection of different structures obtained for the same protein. These structures differ in their crystallization conditions, such as presence of ligands, pH, post-translational modifications, mutations, change in oligomeric state, and so on. Conformational diversity describes structural differences between conformers that define the native state of proteins. It is a key concept to understand protein function and biological processes related to protein functions.
CoDNaS offers a well curated database that is experimentally driven, thoroughly linked, and annotated. CoDNaS facilitates the extraction of key information on small structural differences based on protein movements. CoDNaS enables users to easily relate the degree of conformational diversity with physical, chemical, and biological properties derived from experiments on protein structure and biological characteristics. The new version of CoDNaS includes approximately 70% of all available protein structures, includes 263014 conformers that correspond to 17714 diffent proteins chains and more than 11 millions conformer comparisons available to download. New tools have been added that run sequence searches, display structural flexibility profiles, and allow users to browse the database for different structural classes. These tools facilitate the exploration of protein conformational diversity and its role in protein function.
Short Abstract: Targeting and translocation of proteins to the appropriate subcellular compartments is crucial for cell organization and function. Newly synthesized proteins are transported to mitochondria with the assistance of targeting sequences, which are complex, containing either an N-terminal presequence or a multitude of internal signals to target this organelle. Compared with experimental approaches, computational predictions provide an efficient and cost-effective way to infer subcellular localization for any given protein. However, it is still challenging to predict plant mitochondrial localized proteins accurately due to various limitations, and the performance of current tools is unsatisfactory. We present a novel computational approach for large-scale prediction of plant mitochondrial proteins. We collected protein subcellular localization data in plants from databases and literature, and extracted different types of features from the training data, including amino acid composition, protein sequence profile, and gene co-expression information. We then trained deep neural networks for predicting plant mitochondrial proteins. Testing on a non-redundant dataset of potato mitochondrial and Swiss-Prot proteins, our method achieves considerable improvements over existing tools in predicting mitochondria-localized proteins in plants.
Short Abstract: Advances in experimental techniques have led to an explosion in both the number and size of 3D macromolecular structures. Existing text-based file formats for macromolecular data are slow to parse, are not easily extensible and do not contain certain key data (e.g., all bonding information). For these reasons we have developed the Macromolecular Transmission Format (MMTF) (http://mmtf.rcsb.org/). MMTF has three core benefits over existing file formats. First, through bespoke compression methods, the entire Protein Data Bank (PDB) archive can be stored in MMTF in less than 7GB. Second, MMTF data are stored in binary format making parsing an order of magnitude faster than existing text-based formats. Third, MMTF is user friendly, extensible and contains information not found in current formats. In this work we show that MMTF enables high-performance and scalable structural analysis of the PDB archive.
A second application of MMTF is the efficient multi-scale visualization and analysis of large molecular complexes on the web. We have tackled this problem by introducing MMTF to reduce network transfer and parsing time, and by developing NGL (https://github.com/arose/ngl), a highly memory-efficient and scalable WebGL-based viewer. MMTF offers over 75% compression over the standard mmCIF format, is over an order of magnitude faster to parse, and contains additional information (e.g., DSSP secondary structure). NGL renders molecular complexes with millions of atoms interactively on desktop computers and smartphones alike, making it a tool of choice for web-based molecular visualization in research and education.
This project was supported by NIH under award number U01 CA198942.
Short Abstract: Predicting the detailed effects of variants on the structure and function of a protein is a critical problem to be solved in drug discovery and personalized medicine. These variants can influence drug selection and provoke a variety of effects on the protein behaviour. A portion of variants effects will result in subtle phenotypic effects that are challenging to predict. VarQ is a bioinformatic tool developed with the aim of giving to the user a description of relevant properties of each variation. It consist of a calculation pipeline and a web server. The resulting information allows to make a prediction of the possible effects of variations based in knowledge over studied SNPs. A key aspect of this work is the mining of variants. There are several databases populated with variants from different sources: clinical trials, sequencing information, etc. The tool has the ability of extract known variants from this databases and allows to the user to specify novel variations to be analyzed. Some properties computed by the tool are: location of the variant (core of the protein, surface of the protein, region labeled as interfacing with another protein, binding site), energetic impact of the mutation over the protein stability or the stability of the protein-protein complex, the conservation of the involved residue, the mobility of the position, etc.
A validation was performed using a data-set of 14 RASopathies and Cancer related proteins, with ~1200 known variants, and testing the results against the known effects of the existing literature, with promising results.
Short Abstract: During the course of evolution, animal and plants have developed an extensive repertoire of toxins as defensive mechanism against predators or competitors. Most toxins modify ion channels function, by preventing the ion fluxes across the pore or by promoting its opening. Many peptide toxins display a conserved structural motif called Inhibitor Cystine Knot (ICK), which has been proposed to improve peptide thermal stability.
The transient receptor potential vanilloid 1 (TRPV1) is a cation-selective ion channel expressed in primary sensory neurons. TRPV1 is a pain receptor modulated by multiple stimuli as high temperature (>42º), low pH (< 6), irritants compounds and peptide toxins. Due the later, and from a biomedical point of view, TRPV1 is a very attractive target for pain relief therapies.
The Double Knot toxin (DkTx) from Ornithoctonus huwena spider is a peptide toxin that interacts with the extracellular surface of TRPV1 thus promoting channel opening.
Given the importance of finding new modulators of TRPV1 and the availability of structural data of DkTx and TRPV1, we analyzed the sequence variability among peptides containing the ICK motif using Sequence Similarity Network (SSN) and implemented a multiple mutation protocol on DkTx structure based in homology modeling and molecular dynamics simulations. Our main findings are: the classification of peptides with unknown function as toxins, their functional association with specific ion channel subsets and the structural characterization of new potential toxins derived from the DkTx.
R.V.S. is funded by CONICYT PCHA/Doctorado Nacional 2013-21130631. FDG thanks Fondecyt 1131003, ACT-1107. The authors acknowledge ICM-Economía P09-022-F.
Example URL: www.ncbi.nlm.nih.gov/Structure/icn3d/full.html?mmdbid=5CCB&showseq=1.
Short Abstract: Knowledge about protein interaction sites provides detailed information of protein-protein interactions (PPIs). To date, nearly twenty thousands of PPIs from Arabidopsis thaliana have been identified. Nevertheless, the interaction site information has been largely missed by previously published PPI databases. Here, AraPPISite, a database that presents fine-grained interaction details for A. thaliana PPIs is established. First, the experimentally determined 3D structures of 27 A. thaliana PPIs are collected from the Protein Data Bank database and the predicted 3D structures of 3,023 A. thaliana PPIs are modeled by using two well-established template-based docking methods. For each experimental/predicted complex structure, AraPPISite not only provides an interactive user interface for browsing interaction sites, but also lists detailed evolutionary and physicochemical properties of these sites. Second, AraPPISite assigns domain-domain interactions or domain-motif interactions to 4,286 PPIs whose 3D structures cannot be modeled. In this case, users can easily query protein interaction regions at the sequence level. AraPPISite is a free and user-friendly database, which does not require user registration or any configuration on local machines. We anticipate AraPPISite can serve as a helpful database resource for the users with less experience in structural biology or protein bioinformatics to probe the details of PPIs, and thus accelerate the studies of plant genetics and functional genomics. AraPPISite is available at http://systbio.cau.edu.cn/arappisite/index.html.
Short Abstract: Advances in genomic sequencing technology have drastically increased the amount of available sequence data, escalating the need for rapid annotation of genes and protein models. Recently, the Conserved Domain Database curation team has been developing an in house procedure, SPecific ARChitecture Labeling Engine (SPARCLE) to study the extent to which protein domain architecture can be utilized to define groups of proteins with similarities in molecular function and to derive corresponding functional characterization. So far, about 3, 000 common domain architectures from bacteria have been labelled and SPARCLE will be made available to the public as searchable resource. Currently, SPARCLE only considers best-scoring or top-ranked domain hits and is also hampered by imperfect domain annotation. To overcome some of these limitations, we propose an alternative computational procedure for defining clusters of functionally similar proteins that utilizes pre-computed domain annotation from each available source database (COGs, TIGRFAMs, Pfam, and NCBI-curated annotations) for grouping protein sequences, instead of the terse domain annotation currently employed by SPARCLE. This approach provides tunable fine-grained separation of domain architectures, and has been tested on multiple domain architecture families and several genomic datasets. The quality of the resulting classifications has been examined by curators and validated via analysis of the consistency and uniqueness of clusters. We will also discuss the limitations uncovered to date, and hope that this study will identify suitable approaches for both rapid and sustainable, but also increasingly accurate functional labeling of protein models predicted from genomic sequences.
Short Abstract: High-throughput sequencing has become rapid and inexpensive, providing a vast amount of protein and DNA sequences for many genomes. The next challenge for biology is to use this information to gain fundamental insights into biomolecular mechanisms. One important direction towards this goal is structural reconstruction of the entire interactomes/biological pathways, with consecutive mapping of genetic variants/mutations onto corresponding structures. Due to inherent limitation of experimental techniques, most structures of protein-protein interactions (PPI) have to be computationally modeled (docked). Protein docking pipelines produce a large number of putative docking models. Identification of near-native models among them is a serious challenge. At the same time, a rapidly growing amount of publicly available information from biomedical research provides constraints on the binding mode, which can be essential for the docking. Recently, we have shown the potential of the basic text mining (TM) for protein docking (Badal VD, Kundrotas PJ, Vakser IA, PLoS Comput Biol, 2015, 11: e1004630). Here we present an extension of the TM tool, which utilizes natural language processing (NLP) to analyze residue-containing sentences and their surrounding in the retrieved PubMed abstracts. To generate sentence dependency tree, we utilized Stanford parser, and used inverse distances between PPI-relevant keywords and residues mentioned in the abstracts to discriminate the non-interface residues. We tested WordNet, dictionary look-up and deep parsing NLP approaches. The procedure was benchmarked on 579 X-ray bound structures of binary protein complexes and validated in docking of unbound protein structures from the DOCKGROUND resource (http://dockground.compbio.ku.edu).
Short Abstract: The interaction between transcription factors and DNA plays an important role in gene expression regulation. In this study, we developed an expectation-maximization (EM) algorithm, called EMSEL, for extracting binding motifs from high-throughput SELEX (HT-SELEX) data. EMSEL builds on a comprehensive biophysical model of protein-DNA interactions and is capable of estimating the confidence intervals of the parameters in the model. We compared the binding motifs generated by EMSEL with those estimated by other algorithms using both HT-SELEX and ChIP-seq data. The results demonstrate that the EMSEL motifs generate significantly better predictions of the in vitro data and their predictions of the in vivo data are comparable to the other motifs based on the criterion of the area under the ROC curve (AUC). The ChIP-seq test results, together with the fact that many of the non-EMSEL motifs have very high information content, highlight the limitations of the AUC criterion, which is purely rank-based and fails to take account of the relative binding affinities of ChIP-seq peaks.
Short Abstract: Mirrortree is a computational method to predict protein-protein interactions. The basis of the method is coevolution; interacting proteins evolve together and tend to have similar phylogenetic trees. Hence, similarity of phylogenetic trees can be used to answer whether two proteins are interacting with each other or not.
Our goal in this study is to assess impact of two factors on the Mirrortree method’s prediction of domain-domain interactions; taxonomic diversity and application of conservation-based filters on multiple sequence alignments (MSA). For this, we first downloaded PFAM full alignments for domain pairs using a benchmark set previously used for similar experiments (e.g. Relative Co-evolution of Domain Pairs). This resulted in 1,222 PFAM domain pairs. We then randomly picked unique taxa that are common between interacting domains, using different thresholds (ranging from 10 to 50 taxa). For each domain pair and taxa threshold, we computed similarity matrices and correlation coefficients as per Mirrortree method. The computations repeated after removal of less-conserved regions from MSA’s as well.
We identified, as the taxonomic variety increases, the number correct domain-domain interactions predicted decreases; from ~70% to ~41%. On the other hand, removing less-conserved regions from MSA’s, although improves computation time, does not have a significant impact on the predictions.
In conclusion, while the computation time needed for Mirrortree method could be improved through application of sequence conservation-based filter with no prediction performance tradeoff, the taxonomic diversity should be carefully parameterized for optimal performance.
Short Abstract: Domains are distinct functional and/or structural units of a protein. CDD is a collection of annotated multiple sequence alignment models for domains and full-length proteins, that are available as position-specific score matrices (PSSMs) for the fast identification of conserved domain footprints in protein sequences via RPS-BLAST. The CDD resource includes NCBI-curated domains, which utilize 3D-structure information to improve model accuracy and provide insights into sequence/structure/function relationships, as well as domain models imported from external source databases such as Pfam and TIGRFAMs. CDD is a redundant collection, and many NCBI-curated domain models reflect specific subfamilies of domains conserved in molecular evolution. Domain architecture (DA) is the sequential order of conserved domains in a protein. Here we use Specific Domain Architecture (SDA), the sequence of models annotating a protein, to group proteins that may have similar molecular and/or cellular functions. Using the curation interface SPARCLE ‘SPecific ARChitecture Labeling Engine’, curators assign names and functional labels (brief descriptions) to SDAs based on the sets of proteins they represent. Focusing initially on bacterial sequences, we have labeled almost 3,000 common SDAs, which cover a significant fraction of bacterial sequences. Importantly, curators record evidence to support their assignments, including representative sequences, conserved domain models, PubMed articles, E.C. numbers, 3D structure records and gene IDs. Labels are reviewed and given final approval before publishing. Labeled architectures with supporting evidences, will be made available to the public as a searchable resource. This work was supported by the Intramural Research Program of the National Library of Medicine, NIH.
Short Abstract: Polar residues are usually exposed to the protein surface but a small fraction of them are buried in the protein internal. These buried polar residues make intra-molecular hydrogen bonds and play important roles in protein structure and function. In this report, I performed a comprehensive survey of the buried polar residues, which are defined as Ser, Thr, Asn, Asp, Gln, Glu, His, Arg and Lys residues having zero accessible surface area, in the non-redundant protein structures from Protein Data Bank, focusing on patterns of hydrogen-bond interactions and evolutionary conservation. Compared with surface-exposed ones, the side chains of buried polar residues hydrogen bond to the residues distant along the sequence. The interaction partners of the buried polar side chains are dominated by loop residues and the side chain interactions between helices and between sheets are widely observed. In homologous proteins, the buried polar residues were more strongly conserved than the buried non-polar residues, in that a change of side-chain conformation by one methylene group is less tolerated between Asn and Gln or Asp and Glu than similar side-chain changes between aliphatic residues, Val, Ile and Leu. When buried polar residues are replaced by non-polar ones in homologous structures, their hydrogen bond partners also change to non-polar ones. These results indicate the structural specificity and evolutionary importance of the buried polar residues and provide important knowledge for better understanding of the protein structures.
Short Abstract: The Protein Structure Initiative resulted in nearly 13,700 Structural Genomics (SG) protein structures deposited in the PDB, but connecting structural information with function proved to be more difficult than originally anticipated. As a result, many of these SG proteins are of unknown biochemical function or have putative functional assignments that are often incorrect. The accumulated structural information from the SG project constitutes a tremendous contribution to structural biology and genomics. However, the addition of more reliable functional predictions for SG proteins would add substantial value to this information. Our approach is based on local structure matching at the computationally predicted active site. First, Partial Order Optimum Likelihood (POOL) uses computed electrostatic and chemical properties to predict the residues in a protein structure that are important for catalysis. Next, Structurally Aligned Local Sites of Activity (SALSA) uses proteins of known function within a given superfamily, with their POOL predictions, to develop unique, spatially-localized consensus signatures for each functional family. We then compare the POOL-predicted residues for each SG protein to the consensus signatures by aligning the residues and scoring the alignment. This score is used to determine the best functional assignment for the SG proteins. This presentation focuses on the Crotonase and 6-Hairpin Glycosidase superfamilies and shows that their misannotation rates are high. In some instances, we provide better functional annotations for the SG proteins and have acquired experimental data supporting our predictions. The goal is to provide a validated approach to functional annotation for wider application by the community.
Short Abstract: Metal ions regulate the folding and function of many proteins. Identification of metal binding sites can help with protein structure prediction and characterization of protein function. To fulfill the need for reliable sequence-based annotation of metal binding proteins, we have developed a new machine learning-based model for metal-binding site prediction. The model is based on coevolution information derived from multiple sequence alignment (MSA). Three amino acid covariance metrics were evaluated: Chi-squared, Mutual Information, and Pearson correlation. All metrics were adjusted for phylogeny bias in the MSA. Features are based on the cumulative properties derived from the most covariant residues for each potential metal binding residue (CDEHNQST). The feature space includes the average of individual conservation scores and the composition of co-varying amino acids. The training set is compiled of metal binding proteins taken from the Metal MACiE database. 1000 datasets with ratios 1:1 and 2:1 between negative and positive classes, respectively, were generated. Two machine learning algorithms, C4.5 decision tree and Random Forest (RF), were used to build prediction models. Each model was evaluated using 10-fold cross-validation (CV). The best performing model (23 features, RF, 100 trees) yielded Matthew’s correlation coefficient of 0.67 with an overall accuracy of 87.5%, averages based on 1000 runs of 10-fold CV on the 2:1 ratio dataset. The coevolution-based model with group-based features is superior to other existing models using features derived from individual residues.
Short Abstract: Although inflammation is crucial for defense against pathogens, if not finely tuned it can also contribute to all phases of tumorigenesis. The TLR pathway plays a central role in inflammation and cancer crosstalk and construction of the structural pathway provides insights into its mechanism of action in the tumor microenvironment. We constructed the structural TLR pathway and the architectures that we obtained (i) provide the structural basis for TLR clustering upon stimulation and assembly of key signaling complexes; (ii) demonstrate that almost all downstream parallel pathways are competitive; (iii) TIR domain-containing negative regulators (BCAP, SIGIRR, and ST2) interfere with TIR domain signalosome formation; (iv) major deubiquitinases (A20, CYLD, and DUBA) prevent association of TRAF6 and TRAF3 with their partners, in addition to removing K63-linked ubiquitin chains that serve as docking platform for downstream effectors; (v) and illuminate mechanisms of oncogenic mutations. Missense mutations that fall on interfaces and nonsense/frameshift mutations that result in truncated negative regulators disrupt the interactions with their targets, thereby enable constitutive activation of NF-kB, and contribute to chronic inflammation, autoimmune diseases and oncogenesis.
Short Abstract: The huge amount of 3D protein structures available in databases like the PDB requires tools for automated analysis and intuitive visualization of protein structures. Here, we present a new version of the Protein Topology Graph Library (PTGL) web server. The PTGL is a database that uses a graph model to describe proteins. The graphs are based on 3D atom data from the PDB and the SSE assignments of the DSSP algorithm. The new version of the PTGL supports both protein graphs and amino acid graphs, and can now model protein complexes. In protein graphs, the vertices represent secondary structure elements (SSEs) or ligands and their spatial contacts. In amino acid graphs, residues are modeled instead. The PTGL allows for motif search in the graphs, and supports different visualizations.
We rewrote the PTGL from scratch and implemented many new features, including ligand support, an automated update procedure, and an application programming interface (API) which allows for the integration into other software or services. The new PTGL is an updated tool for the analysis of protein topology that supports large-scale investigations. The resulting graph files can be analyzed in standard software. Here, we also present an investigation of the properties of the new amino acid graphs.
Short Abstract: Detailed knowledge about protein complexes is necessary to understand almost all biochemical, signaling and functional processes in the cell. Competitive binding, protein interactions and post-translational modifications control complex activity and coordinate their assembly. Distinction of stable core members from selectively aggregating proteins is necessary to identify complex subunits controlling behavior and composition.
Complex composition has been studied by various protein-protein interaction network-based approaches such as graph clustering and community detection. However, these network-based approaches suffer from the incompleteness of the interactome even in widely studied organisms such as humans and mice and therefore many protein complexes are still not identified or well-characterized.
We studied a large data collection containing protein expression profiles measured by mass spectrometry. Using this data, we extracted detailed information about protein complex composition along more than 50 different tissues and cell lines. We designed a novel statistical score discriminating core members from labile components. This score tests the significance of complex composition and concordant behavior in different tissues. Our statistical approach reveals control mechanisms in various protein complexes, and has the potential to accurately predict novel protein complexes as well as to integrate further data from various omics platforms.
Short Abstract: CDD is a resource for protein classification and functional annotation, comprising a collection of annotated multiple sequence alignment (MSA) models that represent ancient conserved protein domains, basic units of protein function and evolution. These MSAs are also available as position specific score matrices (PSSMs) for the rapid identification of conserved domain footprints via RPS-BLAST. CDD imports well-known collections (Pfam, COGs etc.) and supplements them with manually-curated domain models that are organized into family hierarchies. Curators use protein 3D-structure information to refine models and provide insights into sequence-structure-function relationships. We present an annotated hierarchical classification of the seven-transmembrane G-protein coupled receptors (7TM GPCRs), a prominent family of drug therapeutic targets with more than 140 human orphan GPCRs whose endogenous ligands are unknown. With the increasing availability of 3D-structures of diverse 7TM receptors, we recently built a comprehensive comparative evolutionary classification of the highly divergent GPCRs. Orphan GPCR subfamilies, which contain uncharacterized protein sequences, often with poor sequence conservation, have been assigned putative functions with predicted ligand-binding sites and/or the location of 7TM helices annotated by inference from the molecular and physiological functions of known related GPCR proteins with available 3D-structure, from phylogenetic relationships, and/or based on the available literature. We hope that the classification, together with NCBI’s software tools, will aid researches in the discovery of molecular targets for drug development by providing insights regarding as-yet-unidentified molecular interactions and functional mechanisms. This work was supported by the Intramural Research Program of the National Library of Medicine, NIH.
Short Abstract: This investigation aims to determine the evolutionary lineage and variation present between phylloplanins present in different plant species. Phylloplanins are highly hydrophobic, basic proteins secreted on the leaf surface (phylloplane) to inhibit spore germination and leaf infection via pathogens. Proteins annotated as phylloplanins were used to search the Genbank protein database. Phylogenetic trees were constructed from these BLAST results. Protein domains were identified in these phylloplanins proteins using the Pfam, CDD, and Interpro databases. The Pollen_Ole_e_I family consists of a number of secreted plant pollen proteins, of approximately 145 residues, whose function has not yet been determined. This analysis enabled us to gain better insight into the evolution of phylloplanins and the similarities of these proteins found in different plant species.
Short Abstract: Experimental 3-dimensional structures are currently known only for a fraction of all known protein sequences. However, for proteins having primary sequences sufficiently similar to those with a known structure, homology modeling techniques offer means to predict their three-dimensional structures. One interesting question in this context is whether these modelling techniques are sufficiently accurate to support the identification of residue-specific functional features from protein model coordinates. We developed a way to evaluate the accuracy of the different modeling techniques based on the matches between functional site predictions in the determined structures and the modeled counterparts using the FEATURE function prediction program by Bagley and Altman. We utilized the collaborative efforts of the Protein Model Portal and Continuous Automated Model Evaluation (CAMEO) to obtain a set of structural models generated through various modeling techniques. Each modeling technique was thereby analyzed with regard to its ability to accurately reconstitute the local microenvironment corresponding to a particular small molecule binding sites or enzyme active evaluated by FEATURE. Sensitivity and specificity measures were calculated on a per residue basis, and that enabled the detection of local differences in the modeled versus the experimentally determined reference structures. Accuracy of the modeling techniques was assessed for sub-sets of the data reflecting the difficulty of modeling the target protein.
Financial support was provided in part by the NIGMS [grant number 5U01 GM093324-02] and pilot project grant at The Commonwealth Medical College.
Short Abstract: RNA secondary structure prediction has become an important area of interest in biology and medicine because it helps in understanding the many biological processes and in designing RNA-based therapies to treat various diseases such as cancers and AIDS. Different thermodynamics based computational algorithms for RNA structure prediction exist, and have been used to help understand the disease mechanisms and design treatments. However, most of these computational tools that can predict complex pseudoknot structures have a sequence length limitation of few hundred nucleotide bases due to their high demands of computer resources. Yet, many RNA molecules, such as those making up viral genomes, are thousands of bases long. To overcome the sequence length limitation, a segmentation approach was previously proposed to cut a long RNA into shorter chunks at strategic positions that conserve inversion patterns in the nucleotide sequence, predict each single chunk independently by existing programs like pknotsRG and RNAstructure, and then combine the results to build the final prediction of the entire RNA. In the present study, we investigated whether the prediction accuracy of the segmentation approach could be improved by capturing possible structures formed between two neighboring chunks that would be missed by the previous single-chunk method. Using 136 sequences with known structures obtained from Rfam, we compared the overall prediction accuracies of these segmentation- based methods. When the chunk size was 90 bases or more, the single-and two-chunk methods were found to be not statistically different in prediction accuracy.
Short Abstract: Motivation: To assess the quality of a protein model, i.e. to estimate how close it is to its native structure,
using no other information than the structure of the model has been shown to be useful for structure
prediction. The state of the art method, ProQ2, is based on a machine learning approach that uses a
number of features calculated from a protein model. Here, we examine if these features can be exchanged with energy terms calculated from Rosetta and if a combination of these terms can improve the quality assessment.
Results: When using the full atom energy function from Rosetta in ProQRosFA the QA is on par with
our previous state-of-the-art method, ProQ2. The method based on the low-resolution centroid scoring
function, ProQRosCen, performs almost as well and the combination of all the three methods, ProQ2,
ProQRosFA and ProQCenFA into ProQ3 show superior performance over ProQ2.
Availability: ProQ3 is freely available as a webserver: http://proq3.bioinfo.se/
Short Abstract: Computing contacts in proteins is important to several types of studies from Bioinformatics to Structural Biology. An accurate computation of contacts is essential to correctness and reliability of application involving folding prediction, protein structure prediction, structural quality assessment, network contacts analysis, thermodynamic stability prediction, protein-protein and protein-ligand interactions, docking and so forth. In this work, we built a large database of contacts using about 45,000 PDB files to compare three paradigms for contacts prospection at atomic level: distance-based only, distance and geometric-based (occlusion free) and distance and angulation-based.
The main contribution of this paper is a critical evaluation of the different paradigms that may be used to compute contacts between protein atoms. We focused on protein-protein interfaces and analysed four types of contacts namely hydrogen bonds, aromatic stackings, hydrophobic and ionic (attractive) interactions. We scanned for possible contacts in the range from 0 to 7 Å. Our data showed the importance of a geometric approach to filter out spurious occluded contacts after about 3.5 Å for aromatic stackings, hydrophobic and ionic interactions. For hydrogen bonds the angulation criteria presented more reliable results at every distance in the studied interval.
We provide the database with all computed contacts and the source codes used to populate the database.
Short Abstract: The RCSB Protein Data Bank (RCSB PDB, http://www.rcsb.org) provides rich structural views of biological systems to enable breakthroughs in scientific inquiry, medicine, drug discovery, technology, and education. The website offers multiple tools for structure query, analysis, and visualization.
Users can perform simple searches from the top search bar (e.g., ID, name, sequence, ligand) or build complex combinations of search parameters using Advanced Search. Information from DrugBank is integrated with PDB data to facilitate searches for drugs and drug targets. Other classification systems are used to organize PDB structures in hierarchical trees for browsing and searching (e.g., Membrane Protein Annotation, Gene Ontology, Enzyme Classification).
Visualization features include Protein Feature View, a graphic comparison of a PDB sequence with UniProt and other annotations; Gene View, a tool that illustrates the correspondences between the human genome and 3D structure; and 3D visualization of electron density maps for bound ligands.
The RCSB PDB is funded by a grant (DBI-1338415) from the National Science Foundation, the National Institutes of Health, and the US Department of Energy. RCSB PDB is a member of the Worldwide Protein Data Bank (http://wwpdb.org).
Short Abstract: The Protein Data Bank (PDB) is a key resource of general principles that has shaped our understanding of protein structure. Most of the existing statistical generalizations of protein structures are made for secondary structures, which are often too generic to satisfy many specific design goals, or for protein domains, for which the PDB distribution is highly biased by evolution or human sampling, and thus not being physically meaningful. To fill this gap, we proposed the local tertiary motifs (TERMs) as a new fundamental level of structural unit. TERMs are combinations of non-continuous small secondary fragments connected by inter-residue contacts. We hypothesized that the PDB contains valuable quantitative information on the level of TERMs. We studied the propensities of TERMs within their corresponding ensembles, i.e. geometrically similar structural fragments from completely unrelated proteins. The TERM propensities are physically meaningful in many contexts. By breaking a protein structure into its constituent TERMs, we can evaluate the accuracy of structure-prediction models with poorly predicted regions identifiable, via a metric we named “structure score” capturing the sequence-structure relationships in TERMs. Also, querying TERMs affected by point mutations enables straightforward prediction of mutational free energies. Our performance exceeds or is comparable to state-of-art methods. Our results suggest that the data in the PDB are now sufficient to enable the quantification of complex structural features, such as those associated with entire TERMs. This should present opportunities for advances in computational structural biology techniques, including structure prediction and design.
Short Abstract: We propose new metrics on sets, ontologies and functions that can be used in various stages of probabilistic modeling, including exploratory data analysis, learning, inference, and result interpretation. The completeness of the proposed metric spaces have also been proved. These new metric functions unify and generalize some of the popular metrics on sets and functions, such as the Jaccard and bag distances on sets and Marczewski-Steinhaus distance on functions. As a special case and direct application of this new metric, information-theoretic metrics are then introduced on directed acyclic graphs (such as Gene Ontology) drawn independently according to a fixed probability distribution and show how they can be used to calculate similarity between class labels for the objects with hierarchical output spaces (e.g., protein function). Finally, we provide evidence that the proposed metrics are useful by clustering species based solely on functional annotations available for subsets of their genes. The functional trees resemble evolutionary trees obtained by the phylogenetic analysis of their genomes.
Short Abstract: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. Assigning functions to biological macromolecules, especially proteins, turn out to be one of the major challenges to understand life on a molecular level. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, properly assessing methods for protein function prediction and tracking progress in the field remain challenging as well.
Here we report the result of the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. According to our assessment, the top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. It also revealed that the definition of top performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies.
Short Abstract: Antibodies are proteins produced by the immune system to act upon immunogenic
molecules, known as antigens. Human antibodies are composed of two chains –
Heavy and Light. The antigen binding site of a typical antibody is made up of six loops
known as Complementarity-Determining Regions (CDRs). Three of those loops are
located on the Heavy chain (H1-H3) and three on the Light chain (L1-L3). Five out of six
CDRs (L1, L2, L3, H1 and H2) form only a small number of discrete conformations called
Our results show that all CDR types have structurally similar loops of different lengths.
Based on these findings, we have created length-independent canonical classes for the
non-H3 CDRs. Using these length-variable classes we have predicted canonical class
membership of the CDRs from a Next-Generation Sequencing (NGS) dataset of human IgM antibodies
containing over 10,000,000 Light chain sequences and over 5,000,000 Heavy chain
sequences. We find that due to differences in CDR length distribution
between available structural data and the sequence data our length-independent approach
classifies more sequences into classes than the standard, length-dependent approach.
Using statistical and machine learning methods, we have also clustered the CDR
sequences to investigate how well we can reconstruct the canonical classes from
sequence data alone.
Overall, our analysis is one of the most comprehensive attempts at quantifying the
range of CDR structural variability in the naïve human antibody repertoire.
Short Abstract: Predicting the macromolecular targets of small molecule compounds is important for drug discovery in order to flag off-targets, identify new targets of known drugs (drug repositioning) and to deorphanize ligands without known targets. Here, we present a new method for target prediction based on the three-dimensional comparison of protein-ligand binding sites (“pockets”). Pockets are represented by clouds of atoms as the Cartesian coordinates of Cα and hydrogen bond donor/acceptor atoms of residues lining the protein cavities. These pockets are then compared by a sequence order-independent clique-matching algorithm. Finally, a pocket similarity score is calculated based on the number of aligned atoms and their root mean squared distance. We devised a benchmark based on 201 high-resolution protein-ligand complexes with known binding affinity, employed our method to retrieve related pockets, and analyzed the results by means of receiver operator characteristic (ROC). Best results were obtained for the structurally conserved binding sites of serine proteases (area under the ROC curve, AUC=0.99) and metalloproteases (AUC=0.95), while we obtained good results for the structurally diverse binding sites of glycosylases (AUC=0.79). We compared our method with the published algorithm APoc and demonstrate that our method performs favorably. In summary, we present a fast, sequence order-independent clique-matching approach for the comparison of protein pockets with straightforward application in small molecule target prediction.
Acknowledgments: FONDECYT 1161798.
Short Abstract: Natural protein molecules are highly evolved systems. Spontaneous folding of individual proteins and recognition between polypeptides leading to well-defined structural ensembles are fundamental concepts in the biology of macromolecules, the specificity of which is explained by the ``Principle of minimal frustration'' . This insight has lead to multiple developments in the understanding of protein folding and function. The minimal frustration principle does not rule out that some energetic frustration may be present in a folded protein. Moreover, it may not be a random occurrence but an evolved characteristic, facilitating motion of the protein around its native basin, binding to appropriate partners and is thought to be fundamental to protein function. We have developed theoretical methods for spatially localizing and quantifying the energetic frustration present in native proteins. These have proven useful in the study of binding interfaces, allosteric transitions, aggregation and ligand binding, conformational dynamics, have been related with evolutionary patterns and disease-related polymorphisms.
The new Protein Frustratometer server is based on the associative memory, water mediated, structure and energy model (AWSEM). AWSEM provides a transferable, coarse-grained, non-additive force field that is able to predict the native structures of many proteins and protein complexes from sequence information. Recently, electrostatic forces have been included in the AWSEM suite and have been shown to play a role in modulating the asperities of the folding and binding landscapes.
Along with a significant speed-up for the calculations, this new server allows for the possibility of analyzing the local frustration that arises by electrostatic interactions.
Short Abstract: The HIV-1 envelope surface protein is covered with N-linked glycosylation sites and glycans (carbohydrates) contribute to more than half of its molecular weight. The glycans render the protein surface mostly undetectable to the host immune system, but specific glycans have been identified that form part of viral epitopes for broadly cross-neutralising antibodies and glycans have also been implicated in chemokine receptor (coreceptor) tropism. Despite their abundance and importance, the extent to which N-linked glycans influence antibody and coreceptor binding, which is essential for productive infection, is not well documented. Using molecular simulation techniques, we have previously shown that the dynamics of gp120 is substantially affected by the presence of glycans at specific N-linked glycosylation sites.
Here, we use CCR5- and CXCR4-tropic viral sequences to explore the effect of the glycan distribution on HIV-1 coreceptor tropism. We have modelled six Env trimer structures using three pairs of phenotyped (using Trofile®) gp160 sequences, representing subtype A, C and D infections. Oligomannose (Man9) glycans were attached to N-linked glycosylation sites of each structure and we used AMBER to produce molecular dynamics simulations of the modelled structures. The preliminary results reveal the degree to which the glycan composition and density around key regions of HIV-1 gp120 and gp41 impact the tropism-associated dynamics of the protein.
These results present a unique view on how the glycan-protein, as well as the glycan-glycan, interactions of the HIV-1 envelope trimer may modulate the infectivity and immunogenicity of the virion.
Short Abstract: Automatic decomposition of protein structures into structural domains has been widely examined. Up to now, various clustering algorithms have been presented for protein decomposition. A main challenge in some of the presented algorithms is defining the "stopping criteria", namely to determine the condition in which the algorithm should finish. A prior knowledge about the number of domains of a protein structure can play a key role in the stopping criteria. Some domain assignment algorithms use the knowledge of the number of domains. Generally such algorithms show a better accuracy, comparing to the algorithms which don't use this kind of knowledge. We introduce a new algorithm for domain assignment problem based on dividing this problem into two sub-problems: 1-Obtaining and clustering core residues of a protein structure such that every cluster represents the hydrophobic core of a domain. By the assumption that every domain has a hydrophobic core, a one-to-one correspondence between clusters and domains is expected. By solving this sub-problem, we transform the domain assignment problem from a clustering problem into a classification problem. Moreover, as the number of domains will be obtained implicitly in this step, we can use this information with the algorithms in which the number of domains is used as a prior information. 2-Classification of non-core residues. To evaluate this algorithm, we compare the result of this algorithm with the results of several state-of-art algorithms on some benchmark datasets. The evaluation shows our new algorithm achieves a competitive accuracy comparing to best algorithms ever presented.
Short Abstract: Oligomerization is the assembly of protein subunits to form a complex functional biological macromolecule, an oligomer. It is one of the fundamental means through which nature equips proteins with the ability to perform complex functions and attain greater stability. Oligomers can exist either as an assemblage of identical blocks of proteins, homooligomers or can form a mosaic of heterogenous subunits termed heterooligomers. In this study, we investigate the dynamic effect of oligomerization and its functional significance on a set of 145 diverse homooligomeric proteins. We employ Elastic Network Model to inspect the change in residue fluctuations upon oligomerization and then couple it with residue conservation score to understand the functional significance of regions with altered dynamics. The study here reveals the importance of sites with dampened fluctuations post oligomerization. These sites can be located either in the interface or in the non-interface regions of the oligomeric assembly and can harbor key functional residues. A case study on the bovine glutamate dehydrogenase further confirms that these residues can serve as orthosteric ligand binding sites. This study introduces a novel approach for identifying functional residues in oligomeric proteins which can further be investigated as potential drug targets.
Short Abstract: The function of a protein is often defined by describing its interacting partners (e.g., substrate and product for an enzyme) and its context in a larger molecular network (e.g., a metabolic pathway). The functions of most proteins are not known, but can be determined by experimental and computational approaches, such as ligand screening. Here, we introduce an integrative pathway mapping approach that identifies enzymes and ligands in a pathway as well as their order, given a set of candidate members and at least one member. Inspired by integrative structural modeling, the goal is achieved by finding those pathway models that satisfy structural and network restraints implied by data from a variety of different methods, such as virtual screening, cheminformatics, genomic context analysis, and ligand binding experiments. We demonstrate the method by identifying a novel L-gulonate degradation pathway in Haemophilus influenzae Rd KW20. The predicted pathway was validated by X-ray crystallography, in vitro assays, genetic analyses, and metabolomics. Additional applications for predicting bacterial sugar metabolic pathways and networks are also being pursued. These applications demonstrate the potential of our approach to contribute to the discovery of metabolic pathways and functional annotation of proteins.
Short Abstract: Transcription factors (TFs) are essential to regulation of gene expression through binding to specific target DNA sites. Structure-based methods for studying TF-DNA interactions can help us annotate TF-binding sites (TFBS) at genome-scale, better understand the effects of mutations in transcription factors and target sites, and facilitate structure-based drug design. Structure-based TFBS prediction algorithms require high-resolution TF-DNA complex structures. Despite advances in structure determination methods, the structural solution of protein-DNA complexes remains a difficult task, and there are a limited number of TF-DNA complex structures in Protein Data Bank (PDB). Therefore, there is a need for modeling protein-DNA complex structures to extend the applicability of structure-based TF-binding site prediction. Here we describe a method of generating TF-DNA complex models by combining TF homology models and DNA structures from homologous complex templates to increase the coverage of TF-DNA conformations. A number of TF-DNA interface features are used to determine the top complex models. The top models are used for structure-based transcription factor binding site prediction using an integrative energy function. The integrative energy function combines a residue-level statistical potential with two atomic terms, hydrogen bond energy between protein residues and DNA bases, and electrostatic energy between aromatic residues and DNA bases involved in π stacking interactions. The results on homedomains show that our approach improves model selection and consequently TFBS prediction accuracy.
View Posters By Category
- A) Bioinformatics of Disease and Treatment
- B) Comparative Genomics
- C) Education
- D) Epigenetics
- E) Functional Genomics
- F) Genome Organization and Annotation
- G) Genetic Variation Analysis
- H) Metagenomics
- I) Open Science and Citizen Science
- J) Pathogen informatics
- K) Population Genetics Variation and Evolution
- L) Protein Structure and Function Prediction and Analysis
- M) Proteomics
- N) Sequence Analysis
- O) Systems Biology and Networks
- P) Other