Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner


Accepted Posters

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.


Track: Function

Session B-121: A structured-output method for protein function prediction
COSI: Function
  • Jovana Kovacevic, Faculty of Mathematics, Belgrade University, Serbia and School of Informatics and Computing, Indiana Unversity, United States
  • Predrag Radivojac, School of Informatics and Computing, Indiana Unversity, United States

Short Abstract: Representing protein function as a subgraph of the Gene Ontology (GO) graph leads to a formulation of protein function prediction as a structural classification problem. Although many computational strategies have been applied in this field, the potential of structural support vector machines (SSVMs) as optimization engines remains unexplored. We therefore developed a sequence-based structured-output predictor of protein function, utilizing some of the most principled machine learning framework. The adjustment of SSVMs to this problem is non-trivial as it requires the development of an optimization algorithm that maximizes an objective function over the vast set of all possible consistent subgraphs of the ontology. The proposed method showed competitive performance with state-of-the-art predictors, especially for the species-specific models.

Session B-122: Crowdsourcing Protein Family Database Curation
COSI: Function
  • Matt Jeffryes, EMBL-EBI, United Kingdom
  • Alex Bateman, EMBL-EBI, United Kingdom
Session B-123: Frameshift aware Hidden Markov Model for protein-DNA sequence annotation
COSI: Function
  • Genevieve R. Krause, University of Montana, United States
  • Walt Shands, UC Santa Cruz, United States
  • Travis J. Wheeler, University of Montana, United States

Short Abstract: Accurate annotation of biological sequences is fundamental to modern molecular biology. The errors inevitable in current sequencing technologies and mutations leading to pseudogenes can impair the annotation of protein-coding DNA by producing frameshifts. To address this issue we have developed a prototype implementation of a frameshift aware Hidden Markov Model. This implementation provides superior sensitivity to other tools for annotating protein-coding DNA, such as tBlastn and Exonerate.

Session B-124: Computational Functional Annotation using Hierarchical Orthologous Groups in OMA
COSI: Function
  • Alex Warwick Vesztrocy, University College London, United Kingdom
  • Adrian Altenhoff, ETH Zurich, Switzerland
  • Christophe Dessimoz, University College London, United Kingdom
Session B-125: DIANA-TarBase: a fundamental asset for functional characterization of microRNA targets.
COSI: Function
  • Dimitra Karagouni, Universithy of Thessaly, Greece
  • Spiros Tastsoglou, Universithy of Thessaly, Greece
  • Ioannis Vlachos, Universithy of Thessaly, Greece
  • Maria Paraskevopoulou, Universithy of Thessaly, Greece
  • Thanasis Vergoulis, ‘Athena’ Research and Innovation Center, Greece
  • Theodore Dalamagas, ‘Athena’ Research and Innovation Center, Greece
  • Artemis Hatzigeorgiou, University of Thessaly, Greece

Short Abstract: microRNAs (miRNAs) are short non-coding RNAs (ncRNAs) present in animals, plants and viruses. Their role in translational repression/mRNA decay was the first to be discovered and the most extensively studied to date. Each miRNA may target a multitude of genes, rendering miRNAs key regulators in most physiological/pathological conditions. miRNAs have also been found to interact with other ncRNAs (long non-coding RNAs, lncRNAs), remarkably broadening the scope of non-coding RNA research, and they also have been studied from a host-pathogen perspective, to reveal interactions between viral miRNAs (vmiRNAs) and host mRNAs and lncRNAs. DIANA-TarBase is the oldest and largest manually curated database housing experimentally supported miRNA:gene interactions. Its latest version, DIANA-TarBase v7.0, contains more than half a million entries collected and processed from published experiments on 356 different cell types from 24 species, including thousands of vmiRNA:host-mRNA interactions. Its interface offers extensive filtering options, such as regulation type, applied experiment and species, while results are enhanced with interaction metadata to aid result interpretation. Numerous top quality interactions from high-throughput and specific techniques are being meticulously curated. Specific emphasis is put on CLIP experiment analysis. TarBase, together with LncBase v2.0 for experimental/predicted miRNA:lncRNA interactions, formed the foundations to develop DIANA-miRPath v3.0 for complex exploratory investigations and, more recently, DIANA-mirExTra v2.0, which is a tool dedicated in NGS analysis and in the identification of sample-specific miRNAs/TFs of importance. Resources from DIANA-Lab are all freely accessible in http://www.microrna.gr and are visited by more than 7,000 users per month.

Session B-126: A Domain-Based Machine Learning Approach for Function Prediction using CATH FunFams
COSI: Function
  • Jonathan G. Lees, UCL, United Kingdom
  • Sayoni Das, UCL, United Kingdom
  • Christine A. Orengo, UCL, United Kingdom
Session B-127: InterProScan: Protein sequence analysis and classification
COSI: Function
  • Gift Nuka, EMBL-EBI, United Kingdom
  • Interpro Team, EMBL-EBI, United Kingdom

Short Abstract: InterPro (http://www.ebi.ac.uk/interpro/) is a freely available resource that is used to classify sequences into protein families and to predict the presence of important domains and sites. InterProScan (https://www.ebi.ac.uk/interpro/interproscan.html) is the underlying software application that allows both protein and nucleic acid sequences to be scanned against InterPro's predictive models (profile hidden Markov models, profiles, position-specific scoring matrices and regular expressions), which are provided by the resource’s member databases. Here, we present recent developments which include support for specific functional inferences provided by new per-residue annotations and prediction of intrinsically disordered regions.

Session B-128: Predicting Protein Function Directly from STRING Network Topology using Deep Learning Techniques
COSI: Function
  • Cen Wan, University College London, United Kingdom
  • Domenico Cozzetto, University College London, United Kingdom
  • Rui Fa, University College London, United Kingdom
  • David Jones, University College London, United Kingdom
Session B-129: A Secondary Cut­off Threshold for Improved HMMERCTTER Protein Superfamily Classification
COSI: Function
  • Agustin Amalfitano, Universidad Nacional de Mar del Plata, Argentina
  • Marcel Brun, Universidad Nacional de Mar del Plata, France
  • Nicolas Stocchi, IIB-CONICET-UNMdP, Argentina
  • Juan Manuel Veron, Universidad Nacional de Mar del Plata, Argentina
  • Arjen Ten Have, Universidad Nacional de Mar del Plata, Argentina
  • Miguel Benavente, Universidad Nacional de Mar del Plata, Argentina

Short Abstract: Background Pfam and TIGRFAM are HMMER profile databases for function assignation of complete proteomes. They use trusted thresholds rather than HMMER's sensitive thresholds to increase specificity, resulting in reduced sensitivity. HMMER Cut­off Threshold Tool (HMMERCTTER) clusters a superfamily training sequences into monophyletic clusters with 100% Precision & Recall (P&R). These are used to classify new sequences keeping 100% P&R. Classification is iterated whereto in each step the profile is updated by including accepted sequences. Unfortunately, for certain complex or diverge superfamilies this results in poor coverage. Results In order to increase HMMERCTTER coverage we developed a less restrictive cut­off for classification of sequences in the “twilight zone” of similarity. We redefined the classification stage as a three step method. In the first, fully adaptive step, sequences are processed one by one. Those that classify for only one group are added to that group, updating both profile and threshold, only if 100%P&R is maintained. In the second, semi­adaptive step, added sequences modify only the group threshold, still checking for 100 P&R. In the third, optional step, the unclassified sequences are finally classified, based on their scores, without changing the profile and threshold of the groups, thus achieving 100% coverage, at the cost of reducing P&R. Conclusions and perspectives This method is expected to improve considerably coverage. Results will be compared with the previous method on high fidelity datasets.

Session B-130: Neural Networks and Random Forests for Protein Ontology Prediction
COSI: Function
  • Jari Björne, University of Turku, Finland
  • Kai Hakala, University of Turku, Finland
  • Farrokh Mehryary, University of Turku, Finland
  • Hans Moen, University of Turku, Finland
  • Martti Tolvanen, University of Turku, Finland
  • Tapio Salakoski, University of Turku, Finland
  • Suwisa Kaewphan, Turku Centre for Computer Science (TUCS), University of Turku, Finland
  • Filip Ginter, University of Turku, Finland
Session B-131: Investigating machine learning methods to characterise origins of DNA replication in kinetoplastid genomes
COSI: Function
  • Samantha Campbell, University of Glasgow, United Kingdom
  • Catarina Marques, University of Dundee, United Kingdom
  • Richard Burchmore, University of Glasgow, United Kingdom
  • Richard McCulloch, University of Glasgow, United Kingdom
  • Nicholas Dickens, Florida Atlantic University, United States

Short Abstract: DNA replication is an essential cellular process in all organisms and much is known about replication machinery, however, outside of yeast little is known about the sequence specific features that denote sites of replication initiation. Recent studies in two kinetoplastid parasites, Leishmania sp. and Trypanosoma brucei, may help improve our understanding of origins of replication. All genes in these parasites are expressed in polycistronic arrays, with transcription initiation and termination occurring at a small number of array boundaries, termed strand switch regions (SSRs). We used next generation sequencing to map DNA replication along chromosomes and predict that origins co-localise with SSRs, though only a subset of SSRs in both genomes act as origins, for reasons that remain unclear. Implemented in Python using scikit-learn, our support vector machine (SVM) successfully classifies origins with >90% accuracy in both the L. major and L. mexicana genomes. DNA sequence reads mapping to this region are broken down into tiled k-mers and normalised counts are used as features. Our results confirm the presence of DNA features specific to sites of DNA replication in Leishmania. We are currently refining the coordinates of SSRs and working to optimise the k-size and feature selection process. We will use the results of these analyses to predict replication origins in uncharacterised Leishmania genomes and identify conserved features in closely related Trypanosome genera, where origins appear to have a distinct relationship with SSRs.

Session B-132: Predicting Novel Abnormal Circadian Phenotypes in Mouse
COSI: Function
  • John Williams, MRC Harwell Institute, United Kingdom
  • Siddharth Sethi, MRC Harwell Institute, United Kingdom
  • Kenneth Condon, MRC Harwell Institute, United Kingdom
  • Patrick Nolan, MRC Harwell Institute, United Kingdom
  • Michelle Simon, MRC Harwell Institute, United Kingdom
  • Georgios Gkoutos, University of Birmingham, United Kingdom
  • Ann-Marie Mallon, MRC Harwell Institute, United Kingdom
Session B-133: Prediction of Enzymatic Properties of Protein Sequences Based on the EC Nomenclature
COSI: Function
  • Alperen Dalkiran, METU, Turkey
  • Ahmet Süreyya Rifaioğlu, Middle East Technical University, Turkey
  • Tunca Dogan, EMBL-EBI, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom
  • Volkan Atalay, Middle East Technical University, Turkey
  • Rengul Atalay, METU, Turkey

Short Abstract: Computational methods have been proposed in the last two decades, in order to predict the attributes of gene products in an automated manner. One basic problem in this field is predicting whether a protein is an enzyme or not. Here we present new method to predict whether a protein sequence is an enzyme or a non-enzyme by constructing six classifiers, each corresponding to one of the six main EC classes, with a combinatorial machine learning approach. The idea is that if all six classifiers give low prediction scores for a given input protein sequence, it can be labelled as non-enzyme, whereas if the target protein receives a prediction score higher than the class specific score threshold, it is predicted to be an enzyme with the corresponding basic enzymatic function. Our system combines three independent classifiers: SPMap, Blast-kNN and Pepstats-SVM similar to the method developed previously by our research group for protein function prediction: GOPred (Saraç et al., 2010). The proposed system combines these three methods and it gives a weighted mean score for each EC class. Finally, predictions are generated according to the weighted mean score given to each of the six EC classifiers. We obtained overall recall measures of 0.91 and 0.85 from the validation of the positive (i.e. the specific enzymatic function) and the negative (i.e. non-enzyme) data sets, respectively.

Session B-134: Multi-label Prediction of Human Protein Function using Deep Neural Networks
COSI: Function
  • Rui Fa, The Francis Crick Institute, United Kingdom
  • Domenico Cozzetto, The Francis Crick Institute, United Kingdom
  • Cen Wan, The Francis Crick Institute, United Kingdom
  • David Jones, The Francis Crick Institute, United Kingdom
Session B-135: Prediction of human mitochondrial proteins from various resources
COSI: Function
  • Katsuhiko Murakami, Tokyo University of Technology, Japan

Short Abstract: Subcellular localization of proteins is an important information to determine fundamental function of unknown proteins. As for human mitochondrial proteins, there are still many proteins that are not firmly characterized as mitochondrial proteins. Although much efforts have been made to develop several databases for mitochondrial proteins, their lists of mitochondrial protein members are rather different each other. It is an important to integrate various information and examine their consistency, and provide them measure of reliability. We examined predictability of human mitochondrial proteins using various information from several information resources, including experimental evidences as well as score from sequence based prediction tools. To examine predictability, we utilized a nonlinear supervised machine learning algorithm, RandomForest regression, which often achieves good predictive performance in most problems. Furthermore, we evaluated the features (explanatory variables for regression) how much they contribute to the accurate prediction. The analysis showed that integrative features calculated from several single features contributes better to discrimination than experimental evidences. Finally, we provide a comprehensive measure of reliability that each proteins is functional in mitochondria using integrative dataset from several mitochondrial protein databases.

Session B-136: Enhancing the Biological Relevane of Machine Learning Classifiers for Reverse Vaccinology
COSI: Function
  • Ashley Heinson, University of Southampton, United Kingdom
Session B-137: Metabolic pathway assignment based on phylogenetic profiling
COSI: Function
  • Sandra Weißenborn, Max Planck Institute of Molecular Plant Physiology, Germany
  • Dirk Walther, Max Planck Institute of Molecular Plant Physiology, Germany

Short Abstract: he assignment of gene function is a crucial step following the complete sequencing of a genome. In order to improve functional gene annotation in plants, we applied phylogenetic profiling [PMT+99] with the particular goal to identify as of yet unassigned secondary metabolic pathway genes as well as predicting the presence of secondary metabolic pathways in newly sequenced species. While primary metabolites and associated pathways are indispensable and, thus, occur in essentially all plants, most secondary metabolites and their biosynthesis pathways have evolved only in a subset of plant species. Thus, an involvement of genes in a common secondary pathway ought to be reflected by a common presence/absence pattern across different plant species. We calculated phylogenetic profiles for 42,014 metabolic pathway enzymes with KEGG enzyme identifiers from 24 plant species based on sequence and pathway annotation data from KEGG and Ensemble Plants. For the required step of gene family assignment, we included data of all 39 species available at the Ensemble Plants database and established gene families using a network-based approach. For a subset of known metabolic pathways, we were indeed able to show that the phylogenetic profiles of their enzymes cluster together significantly more often than randomly expected. In our tests for pathway assignments of enzymes with known function, best results were achieved in the categories carbon metabolism and biosynthesis of amino acids; i.e. primary metabolism pathways. A successful application to secondary pathway assignments currently appears severely hampered by the paucity of functionally characterized secondary metabolism pathway genes in a broader set of plant species. Nonetheless, our results show that phylogenetic profiling has the potential to improve protein function prediction of unknown enzymes and may contribute to the identification and annotation of plant secondary metabolic pathways. References [PMT+99] Matteo Pellegrini, Edward M. Marcotte, Michael J. Thompson, David Eisenberg, and Todd O. Yeates. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. PNAS, 96(8):4285{4288, 1999.

Session B-138: Contribution of features based on sequence, predicted PPIs and GO similarities to the prediction of gene-HPO associations
COSI: Function
  • Branislava Gemović, Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinča, University of Belgrade, Serbia
  • Radoslav Davidović, Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinča, University of Belgrade, Serbia
  • Neven Šumonja, Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinča, University of Belgrade, Serbia
  • Nevena Veljković, Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinča, University of Belgrade, Serbia
  • Vladimir Perović, Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinča, University of Belgrade, Serbia
Session B-139: Computational and experimental validation of amyloid databases
COSI: Function
  • Michal Burdukiewicz, University of Wrocław, Poland
  • Stefan Rödiger, BTU Cottbus - Senftenberg, Germany
  • Marlena Gąsior-Głogowska, Wroclaw University of Technology, Department of Biomedical Engineering and Instrumentation, Poland
  • Paweł Mackiewicz, University of Wroclaw, Poland
  • Malgorzata Kotulska, Wroclaw University of Technology, Department of Biomedical Engineering and Instrumentation, Poland

Short Abstract: Amyloids are self-aggregating proteins that participate in neurodegenerative disorders, such as Alzheimer's or Parkinson's diseases. The computational prediction of amyloids is a great challenge because regions responsible for amyloidogenicity cannot be described by specific amino acid motifs, but rather residues with specific physiochemical properties. Henceforth, we created AmyloGram, a software suitable for the detection of amyloids. The reduced amino acid alphabet, based on the physicochemical properties of amino acids, allows AmyloGram to handle the diversity of amyloid proteins. We found eight peptides that were annotated in the the AmyLoad database as non-amyloidogenic but assesed by AmyloGram with the high probability of amyloidogenicity. We analyzed them using the Fourier transform infrared spectroscopy and found out that seven of these eight peptides are in fact amyloidogenic. For three out of seven amyloidogenic peptides, our experimental results were also confirmed independently by other studies. The computational prediction using other amyloid-predicting software, PASTA 2.0, revealed amyloidogenic properties in only two of the seven amyloids that were confirmed experimentally. Our data indicate that AmyloGram is able to detect amyloid proteins that were falsely assigned as non-amyloidogenic. Furthermore, it is able to find more false non-amyloids than other software maintaining the same specificity (in both cases thresholds were adjusted to assure the 0.95 specificity). Nevertheless, the potential peptides should be test in a wider experimental setting because amyloidogenicity can depend on various conditions. AmyloGram is available as a web-server (www.smorfland.uni.wroc.pl/amylogram/).

Session B-140: Automating Genomic Context Analysis with a Probabilistic Model of Protein Function and Relatedness
COSI: Function
  • Jeffrey Yunes, University of California, San Francisco, United States
  • Patricia Babbitt, University of California, San Francisco, United States
Session B-141: Manual annotation of a subset of the CAFA3 target set
COSI: Function
  • George Georghiou, EBML European Bioinformatics Institute, United Kingdom

Short Abstract: The Critical Assessment of protein Function Annotation algorithms (CAFA) challenge is a large-scale assessment whose purpose is to evaluate new computational methods that are capable of predicting Human Phenotype Ontology and Gene Ontology (GO) terms for proteins, based on their sequence or structure. For the latest CAFA challenge (CAFA3), the UniProt Gene Ontology Annotation (GOA) team has contributed and curated two target sets: the first consists of intrinsically disordered proteins, and the second consists of moonlighting proteins. Participants in CAFA3 predicted the biological processes, molecular functions and cellular component of the target sets using GO. To create our target sets, we used data from the DisProt database for intrinsically disordered proteins (http://www.disprot.org/) and the MoonProt  database of moonlighting proteins (http://www.moonlightingproteins.org/) to generate a potential list of proteins for CAFA. We identified 627 proteins as potentially being intrinsically disordered and 306 proteins as potential moonlighting proteins. For each candidate protein we found appropriate literature to provide experimental evidence of whether it was intrinsically disordered or had a secondary function. In total, we found more than 1100 papers associated with both target sets that needed to be read, evaluated, and, if suitable, curated using the GO. After evaluation, the intrinsically disordered dataset comprised 472 proteins and the moonlighting dataset comprised 156 proteins. In total, 766 papers were curated, which resulted in the creation of 6981 new GO annotations to be used as a benchmark for evaluating the predictions submitted by CAFA3 contestants. These annotations for the target sets are publicly accessible through the QuickGO website (http://www.ebi.ac.uk/QuickGO/).

Session B-142: JustOrthologs: A Fast, Accurate, and User-Friendly Ortholog Identification Algorithm
COSI: Function
  • Justin Miller, Brigham Young University, United States
  • Brandon Pickett, Brigham Young University, United States
  • Perry Ridge, Brigham Young University, United States
Session B-143: UniRule - A Unified Rule-Based System to Annotate Unreviewed Protein Entries in UniProtKB.
COSI: Function
  • Alexandre Renaux, EMBL-EBI, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom
  • Uniprot Consortium, EMBL-EBI/PIR/SIB, United Kingdom

Short Abstract: UniProt provides a comprehensive protein resource to the scientific community, most notably through the UniProt Knowledgebase (UniProtKB). Within UniProtKB, the reviewed section (UniProtKB/Swiss-Prot) contains high quality, manually curated, richly-annotated protein records. In contrast, the unreviewed section (UniProtKB/TrEMBL) which constitutes more than 99% of UniProtKB, is annotated with automatic annotation systems developed within the Consortium and data imports in collaboration with other databases. The use of rule-based annotation is necessary because there is no experimental data available for the majority of the unreviewed protein sequences. UniRule is a rule-based annotation system leveraging the expert-curated data in UniProtKB/Swiss-Prot. It unifies existing manual rule-based systems (HAMAP, PIR, and RuleBase rules) into one system which manages, applies, and evaluates all rules. Currently, UniRule contains over 4,500 applied rules, which provide annotation for approximately 29% of unreviewed entries. Rules are a formalized way of expressing an association between conditions, which have to be met, and annotations, which are propagated. InterPro signatures, predictive models for the functional classification of protein sequences, and taxonomic constraints are used as conditions. As a result, UniRule enriches the functional annotation of proteins with nomenclatures, catalytic activities, Gene Ontology terms and sequence features such as transmembrane domains. A key feature of the UniRule curation tool is a statistical quality control system allowing curators to evaluate their rules against the reviewed entries, to make sure rules are as accurate as possible. A dedicated space on the uniprot.org website has been created to allow users to view and explore UniRule.

Session B-144: Region-specific Function Prediction: automatically inferring function labels for protein regions
COSI: Function
  • Da Chen Emily Koo, New York University, United States
  • Noah Youngs, NYU & Simons Foundation, United States
  • Richard Bonneau, NYU, United States
Session B-145: Extending Hidden Markov Models to allow conditioning on previous observations
COSI: Function
  • Margarita C. Theodoropoulou, Department of Computer Science and Biomedical Informatics, University of Thessaly, Greece
  • Ioannis A. Tamposis, Department of Computer Science and Biomedical Informatics, University of Thessaly, Greece
  • Konstantinos Tsirigos, Department of Computer Science and Biomedical Informatics, University of Thessaly, Greece
  • Pantelis Bagos, Department of Computer Science and Biomedical Informatics, University of Thessaly, Greece

Short Abstract: Hidden Markov Models (HMMs) are probabilistic models widely used in computational molecular biology. However, the Markovian assumption regarding transition probabilities which dictates that the observed symbol depends only on the current state may not be sufficient for some biological problems. In order to overcome the limitations of the first order HMM, a number of extensions have been proposed in the literature to incorporate past information in HMMs conditioning either on the hidden states, or on the observations, or both. Here, we present a simple extension of the standard HMM in which the current observed symbol (amino acid residue) depends both on the current state and on a series of observed previous symbols. The major advantage of the method is the simplicity in the implementation, which is achieved by properly transforming the observation sequence, using an extended alphabet. Thus, it can utilize all the available algorithms for the training or decoding of HMMs. We investigated the use of several encoding schemes and performed tests in a number of important biological problems previously studied by our team (prediction of transmembrane proteins and prediction of signal peptides). The evaluation shows that, when enough data is available, the performance was increased by 1.5%-8% and the existing prediction methods can be improved using this approach. The methods, for which the improvement was significant (PRED-TMBB2, PRED-TAT and HMM-TM), are available as web-servers freely accessible to academic users at www.compgen.org/tools/.

Session B-146: Coloring protein darkness
COSI: Function
  • Damiano Piovesan, BioComputing UP - University of Padova, Italy
  • Marco Necci, University of Padua, Italy
  • Silvio Tosatto, University of Padua, Italy
Session B-147: Viterbi Training of Hidden Markov Models for Labeled Sequences
COSI: Function
  • Margarita C. Theodoropoulou, Department of Computer Science and Biomedical Informatics, University of Thessaly, Greece
  • Ioannis Mintsopoulos, Department of Computer Science and Biomedical Informatics, University of Thessaly, Greece
  • Pantelis Bagos, University of Thessaly, Greece

Short Abstract: Hidden Markov Models (HMMs) are probabilistic models that have been successfully applied for various tasks in molecular biology. There are two types of problems that must be solved while building an HMM: estimating the model parameters (transition and emission probabilities) from the observed sequence and, finding the hidden sequence of states given the observed sequence (decoding). Traditionally, the parameters of a HMM are optimized according to the Maximum Likelihood criterion, mainly using the efficient Baum-Welch algorithm and the decoding is performed using the standard Viterbi algorithm or the N-best algorithm. A different approach to parameter estimation is Viterbi Training (VT). VT estimates the parameters only of the most probable hidden state sequence produced by the Viterbi algorithm, rather than maximizing the likelihood of the observed data. Usually, for complex biological problems, we use labeled sequences with the so-called Class HMMs approach. Here, we present the development of VT algorithm for labeled sequences. In order to evaluate the VT algorithm we performed tests in a number of important biological problems previously studied by our team (prediction of transmembrane proteins and prediction of signal peptides). In all five biological problems using the VT algorithm, we managed to diminish considerably both the iterations needed and the total time spent on training, with a negligible (if any) decrease in the model’s efficiency. In the era of big data, the efficient and, yet, fast algorithms are an urgent need and we believe that the new algorithm will be useful in this respect.

Session B-148: Automatic Generation of Functional Annotation Rules Using Inferred GO-Domain Associations
COSI: Function
  • Sabeur Aridhi, University of Lorraine, LORIA, Campus Scientifique, BP 239, 54506 Vandoeuvre-lès-Nancy, France, France
  • Seyed Ziaeddin Alborzi, INRIA Nancy Grand-Est, France
  • Marie-Dominique Devignes, LORIA-CNRS, France
  • David W. Ritchie, INRIA, France
  • Rabie Saidi, EMBL-EBI, United Kingdom
  • Maria J. Martin, EMBL-EBI, United Kingdom
  • Alexandre Renaux, EMBL-EBI, United Kingdom
Session B-149: SwissProtCluster: The New Protein Superfamily Database for Reliable Function Assignation by HMMERCTTER
COSI: Function
  • Nicolas Stocchi, IIB-CONICET-UNMdP, Argentina
  • Agustin Amalfitano, FI-CONICET-UNMdP, Argentina
  • Marcel Brun, FI-UNMdP, Argentina
  • Arjen ten Have, IIB-CONICET-UNMdP, Argentina

Short Abstract: Background HMMER databases, like Pfam, are used for sequence function assignation. They use trusted cut-offs to obtain specificity at the cost of reduced sensitivity. HMMER Cut-off Threshold Tool (HMMERCTTER) consists of HMMERCTTER_Clust that identifies monophyletic clusters with 100% precision and recall (P&R), i.e. clusters that identify all cluster-sequences with higher scores than non-cluster-sequences. HMMERCTTER_Class then classifies target-sequences using the identified clusters. Also, HMMERCTTER_Class can use any sequence clustering with only 100% P&R clusters. Therefore, we developed a 100% P&R HMMER-cluster database based on UniProTKB-SwissProt, providing a reliable tool for function assignation of complete proteomes. Here we report the construction of the single-domain database. Results Single-domain sequences were grouped based on family annotation codes and tested for 100% P&R. SwissProtCluster_1D.v1 contains 4143 groups of at least four sequences, totaling 69518 sequences, as well as 5871 ungrouped sequences. 3853 groups show 100% P&R, the remaining 290 groups were scrutinized by a script that removes outliers until the group is 100% P&R. Ungrouped sequences were clustered into new groups using a combination of CD-Hit and HMMERCTTER. SwissProtCluster_1D.v2 covers 86% of the UniProTKB-SwissProt single-domain sequence space. Sequences from small groups (n<4) were clustered using homologs from UniProt_RP15. Conclusions UniProTKB-SwissProt contains sequences with incorrect family codes and protein families that are described by more than a single protein family code. Performance will be tested by a comparison with Pfam using UniProt_RP75 as benchmark.

Session B-150: Label-Space Dimensionality Reduction and a Similarity-Based Representation for Protein Function Prediction
COSI: Function
  • Stavros Makrodimitris, TU Delft, Netherlands
  • Roeland van Ham, TU Delft, Netherlands
  • Marcel Reinders, Delft University of Technology, Netherlands
Session B-151: Barcoding functional enrichment of gene expression neighbours reflects diversification of APOBEC3 cytidine deaminases in cancers.
COSI: Function
  • Joseph Ng, Randall Division of Cell and Molecular Biophysics, King's College London, United Kingdom
  • Michael Malim, Department of Infectious Diseases, Division of Immunology, Infection & Inflammatory Disease, King's College London, United Kingdom
  • Franca Fraternali, Randall Division of Cell and Molecular Biophysics, King's College London, United Kingdom

Short Abstract: Apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like 3 (APOBEC3) is a family of cytidine deaminases capable of mutating cytosine (C) to uridine (U) on single-stranded DNA and countering retroviral replication. However, analyses of clinical sequencing data discovered that some APOBEC3 enzymes, APOBEC3B particularly, were associated with a distinct mutational signature on the somatic genome in some cancer types. Little has been known about the mechanism of this mutagenic process. We seek to understand the functional context in which different APOBEC3 genes are expressed. Here we present a comprehensive analysis of RNA-seq data of 8,951 tumours from The Cancer Genome Atlas (TCGA) covering 25 cancer types, as well as that of cancer cell-lines and normal samples matching their sites-of-origin. Expression profiles of APOBEC3B were correlated with genes that were different from what correlated with the other APOBEC3s. An extensive functional annotation characterised distinct barcodes of biological functions for such expression neighbours of each APOBEC3 gene: cell cycle and DNA repair processes were only enriched in APOBEC3B expression neighbours, while other APOBEC3 members were more related to immune processes and specific T-cell populations. This analysis reflected the diversification of this family of enzymes to act in both retroviral defence and oncogenesis, and highlighted a molecular signature characteristic of APOBEC3B upregulation in cancers.

Session B-152: Phylogenetic- based gene function prediction in the Gene Ontology Consortium
COSI: Function
  • Huaiyu Mi, University of Southern California, United States
  • Pascale Gaudet, Swiss Institute of Bioinformatics, Swaziland
  • Marc Feuermann, Swiss Institute of Bioinformatics, Switzerland
  • Anushya Muruganujan, University of Southern California, United States
  • Suzanna E. Lewis, Lawrence Berkeley National Laboratory, United States
  • Paul D. Thomas, University of Southern California, United States
Session B-153: Expanding the Critical Assessment of Function Annotation with Experimental Data and Biocuration
COSI: Function
  • Naihui Zhou, Iowa State University, United States
  • Yuxiang Jiang, Indiana University Bloomington, United States
  • Timothy Bergquist, University of Washington, United States
  • Kimberley A. Lewis, Dartmouth College, United States
  • Alex W. Crocker, Dartmouth College, United States
  • Deborah A. Hogan, Dartmouth College, United States
  • Maria J. Martin, European Bioinformatics Institute, United Kingdom
  • Claire O'Donovan, European Bioinformatics Institute, United Kingdom
  • Sean Mooney, University of Washington, United States
  • Casey Greene, University of Pennsylvania, United States
  • Predrag Radivojac, Indiana University, United States
  • Iddo Friedberg, Iowa State University, United States

Short Abstract: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. Assigning functions to biological macromolecules, especially proteins, turned out to be one of the major challenges to understand life on a molecular level. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, properly assessing methods for protein function prediction and tracking progress in the field remain challenging as well. The Critical Assessment of Functional Annotation (CAFA) is a timed challenge to assess computational methods that automatically assign protein function. Here we report the preliminary results of the third CAFA challenge, and outline some additions that have taken place since the second CAFA. One hundred and forty seven models from 66 research groups were received and are currently being evaluated for accuracy in predicting protein function in 27 target species. These functions are described by the Gene Ontology (GO) and the Human Phenotype Ontology (HPO). Comparisons between top-performing methods in CAFA1 and CAFA2 showed significant improvement in prediction accuracy, demonstrating the general improvement of automatic protein function prediction algorithms. We expect to see more improvement in CAFA3. CAFA3 features expanded protein sets for predictions. We are using experimental whole genome screens to generate ground truths for select functions in Drosophila melanogaster, Pseudomonas aeruginosa and Candida albicans. Additionally, we released sets of moonlighting proteins, to further challenge function prediction methods.

Session B-154: Gene annotation bias impedes biomedical research
COSI: Function
  • Winston Haynes, Stanford University, United States
  • Purvesh Khatri, Stanford University, United States
Session B-155: The landscape of microbial phenotypic traits and associated genes
COSI: Function
  • Maria Brbic, Rudjer Boskovic Institute, Croatia
  • Matija Piskorec, Rudjer Boskovic Institute, Croatia
  • Vedrana Vidulin, Institute Jozef Stefan, Slovenia
  • Anita Krisko, Mediterranean Institute for Life Sciences, Croatia
  • Tomislav Smuc, Rudjer Boskovic Institute, Croatia
  • Fran Supek, Institute for Research in Biomedicine, Spain

Short Abstract: Bacteria and Archaea display a variety of phenotypic traits and can adapt to diverse ecological niches. However, systematic annotation of prokaryotic phenotypes is lacking. We have therefore developed ProTraits, a resource containing ∼545 000 novel phenotype inferences, spanning 424 traits assigned to 3046 bacterial and archaeal species. These annotations were assigned by a computational pipeline that associates microbes with phenotypes by text-mining the scientific literature and the broader World Wide Web, while also being able to define novel concepts from unstructured text. Moreover, the ProTraits pipeline assigns phenotypes by drawing extensively on comparative genomics, capturing patterns in gene repertoires, codon usage biases, proteome composition and co-occurrence in metagenomes. Notably, we find that gene synteny is highly predictive of many phenotypes, and highlight examples of gene neighborhoods associated with spore-forming ability. A global analysis of trait interrelatedness outlined clusters in the microbial phenotype network, suggesting common genetic underpinnings. Our extended set of phenotype annotations allows detection of 57 088 high confidence gene-trait links, which recover many known associations involving sporulation, flagella, catalase activity, aerobicity, photosynthesis and other traits. Over 99% of the commonly occurring gene families are involved in genetic interactions conditional on at least one phenotype, suggesting that epistasis has a major role in shaping microbial gene content.

Session B-156: Predicting protein functions from sequence using a neuro-symbolic deep learning model
COSI: Function
  • Maxat Kulmanov, King Abdullah University of Science and Technology, Saudi Arabia
  • Mohammed Asif Khan, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia
Session B-157: Thinking outside the informatics box: Computed chemical properties for protein function annotation
COSI: Function
  • Caitlyn L. Mills, Northeastern University, United States
  • Lydia A. Ruffner, Northeastern University, United States
  • Mary Jo Ondrechen, Northeastern University, United States
  • Penny J. Beuning, Northeastern University, United States

Short Abstract: There are now over 14,000 Structural Genomics (SG) protein structures deposited in the Protein Data Bank (PDB) and most of these are of unknown or uncertain biochemical function. Reliable computational methods for the prediction of the function of protein structures is an important current need. Typically, functions are assigned using informatics-based approaches. The annotation of protein function by automated means has led to high rates of misannotations in some databases [1]. Here we present a complementary and powerful approach based on computed chemical properties of the individual residues in a protein structure. Partial Order Optimum Likelihood (POOL) is used to predict the residues in the query protein structure that are important for catalysis. Typically these include the residues in the first layer that make direct contact with the substrate molecule(s) and also some residues in the second and third layers that play supporting roles in the catalytic process [2]. Then, for proteins of known biochemical function, Structurally Aligned Local Sites of Activity (SALSA) [3] places the POOL-predicted residues into local structural alignments to establish chemical signatures – local arrays of active residues that are common to proteins of the same function. The POOL-predicted residues of the query (SG) protein are then aligned with the local chemical signatures for the different functional types. These alignments, each SG protein against each functional family, are scored in order to predict the most likely function of the SG proteins. Results are reported for the SG members of the Ribulose Phosphate Binding Barrel (RPBB), Clp-Crotonase, and Haloacid Dehalogenase superfamilies. While we find the SG proteins in the RPBB superfamily to be well annotated, we predict very high annotation error rates (about 75%) in the Clp-Crotonase superfamily. Of particular interest are cases of predicted misannotation, where our prediction differs from that of the assigned function. Experimental testing of our predictions is performed by direct biochemical assays. Our annotations are shown to be correct for the cases that have been tested to date. Acknowledgments: This work has the support of the National Science Foundation under grant number CHE-1305655, MathWorks, Inc. and a PhRMA Foundation Fellowship awarded to CLM. References [1] A.M. Schnoes, S.D. Brown, I. Dodevski, and P.C. Babbitt, PLoS Comp Biol 5(12), e1000605 (2009). [2] H.R. Brodkin, N.A. DeLateur, S. Somarowthu, C.L. Mills, W.R. Novak, P.J. Beuning, D. Ringe, and M.J. Ondrechen, Protein Sci 24, 762-778 (2015). [3] Z. Wang, P. Yin, J. Lee, R. Parasuram, S. Somarowthu, and M.J. Ondrechen, BMC Bioinformatics 14(Suppl 3), S13 (2013).

Session B-158: Reasoning on Gene Ontology Networks Predicts Novel Protein Annotations
COSI: Function
  • Ilya Novikov, Baylor College of Medicine, United States
  • Angela Wilkins, Baylor College of Medicine, United States
  • Olivier Lichtarge, Baylor College of Medicine, United States
Session B-159: Artificial Dilution Series: A General Framework for Benchmarking Classifier Evaluation Metrics
COSI: Function
  • Petri Toronen, University of Helsinki, Finland
  • Ilja Pljusnin, University of Helsinki, Finland
  • Liisa Holm, University of Helsinki, Finland

Short Abstract: The comparison of competing Gene Ontology (GO) predictions requires the selection of appropriate evaluation metrics. Unfortunately, there is no consensus as to what is the most appropriate metric and, therefore, the metrics used tend to vary between publications. The selection of evaluation metrics for GO prediction has challenges: metrics have difficulties related to highly unbalanced classes and others fail to account for the complex correlation structure between GO classes. We have developed a framework for testing the performance of classifier evaluation metrics called Artificial Dilution Series (ADS). ADS takes a GO annotated set of proteins and generates artificial prediction sets with a defined level of noise by permuting labels in the original data set. Next, the evaluation metric under test is applied to assess the quality of the artificial predictions. This procedure is repeated many times with different noise levels creating a series of diluted versions of the original data set. Finally, we assess the performance of different evaluation metrics by their ability to separate different noise levels from one another. To complement ADS, we perform additional tests based on False Positive Data (FPD). Such data sets lack the original positive signal. Instead, they represent corner cases that cause some evaluation metrics to give unreasonably high scores. FPD results with good evaluation metric should match to its ADS results with low signal. We tested several GO prediction evaluation metrics using ADS and FPD and find clear differences in their performance.

Session B-160: Elucidating the Function Space of Proteins Defined by Ontologies
COSI: Function
  • Predrag Radivojac, Indiana University, United States
  • Shawn Peng, Indiana University Bloomington, United States
  • Yuxiang Jiang, Indiana University Bloomington, United States
Session B-161: FunFOLDQ: a fast automated method for the prediction of ligand binding site residues and Gene Ontology terms
COSI: Function
  • Danielle Brackenridge, University of Reading, United Kingdom
  • Daniel Barry Roche, Institut de Biologie Computationnelle, Université de Montpellier and Centre de Recherche en Biologie cellulaire de Montpellier, France
  • Liam McGuffin, University of Reading, United Kingdom

Short Abstract: Protein ligand binding site prediction methods aim to predict, from amino acid sequence, protein-ligand interactions, putative ligands and ligand binding site residues using either sequence information, structural information or a combination of both. In silico characterisation of protein-ligand interactions have become extremely important to help determine a protein functionality, as in vivo based functional elucidation is unable to keep pace with the current growth of sequence databases. Additionally, in vitro biochemical functional elucidation is time consuming, costly and may not be feasible for large scale analysis, such as drug discovery. Thus, in silico prediction of protein-ligand interactions need to be utilized to aid in functional elucidation. Hence, we developed a structurally informed functional annotation pipeline, called FunFOLDQ, which predicts in silico protein-ligand interactions and Gene Ontology terms. FunFOLDQ, along with its previous implementations, have been ranked amongst the top methods in previous Critical Assessment of Techniques for Protein Structure Prediction (CASP) competitions, ranked 2nd for prediction of “Holo” binding sites in the recent CASP12 competition. We also recently competed in Critical Assessment of protein Function Annotation 3 (CAFA3) challenge. We will present our new methodology and benchmarking results. FunFOLDQ can be used to improve the functional annotation of protein domains, protein dark matter as well as the study of protein-ligand interactions in areas such as rational drug design.

Session B-162: DeepLoc: Prediction of protein subcellular localization using deep learning
COSI: Function
  • José Juan Almagro Armenteros, Department of Bio and Health Informatics, Technical University of Denmark, Denmark
  • Casper Kaae Sønderby, The Bioinformatics Centre, Department of Biology, University of Copenhagen, Denmark
  • Søren Kaae Sønderby, The Bioinformatics Centre, Department of Biology, University of Copenhagen, Denmark
  • Henrik Nielsen, Department of Bio and Health Informatics, Technical University of Denmark, Denmark
  • Ole Winther, DTU Compute, Technical University of Denmark, Denmark

Short Abstract: Motivation: The prediction of eukaryotic protein subcellular localization is a well studied topic in bioinformatics, due to its relevance in proteomics research. Many machine learning methods have been successfully applied in this task, achieving considerable performances. However, most methods depend on homology and knowledge database information to perform the prediction. In this project, we have used recurrent neural networks, which are able to process sequential data of varying length, to predict protein subcellular localization relying only on the sequence information. Furthermore, the model was trained on a protein dataset extracted from one of the latest UniProt releases, where the criteria for regarding protein annotations as experimental have been made much more strict. Results: The developed model is based on a convolutional long short-term memory (LSTM) neural network with attention mechanism, which selectively focuses on important regions of the proteins. We demonstrate that our model achieves a good accuracy (0.7797) and outperforms current state of the art algorithms. Moreover, we extend our model to predict if the proteins are membrane-bound or soluble, in combination with the subcellular localization, with a high accuracy (0.9234). Availability: The method is available as a web server at http://www.cbs.dtu.dk/services/DeepLoc-1.0

Session B-163: GO FEAT: fast functional annotation web tool for genomic and transcriptomic data
COSI: Function
  • Fabricio Araujo, UFPA, Brazil
  • Yan Pantoja, UFPA, Brazil
  • Ailton Sousa, UFPA, Brazil
  • Luis Guimarães, UFPA, Brazil
  • Rommel Ramos, UFPA, Brazil

Short Abstract: Giving biological meaning to genomic and transcriptomic data based on sequence annotation is laborious and time consuming, especially considering the amount of data generated by high-throughput technologies. The biological analysis is often given by functional annotation through Gene Ontology (GO) database which is widely used as the gene functions dictionary of terms. GO terms are divided into three ontologies: cellular component, molecular function and biological process. To this date, there are over 40,000 biological concepts registered at GO and they are based on experiments reported in over 100,000 peer-reviewed scientific papers. To simplify this process, some tools are available, such as Blast2GO, AmiGO, GOrilla, REVIGO, QuickGO, NaviGO. However, these tools have limitations: a) not all are freely available; b) installation, configuration and command line are complex; c) lack of visual interface; d) sequence number limitation for analysis; e) difficulty to share results between team members. Thus, we present GO FEAT, a free, online, user friendly platform for functional annotation and enrichment of genomic and transcriptomic data based on homology search analysis. GO FEAT overcomes the limitations of current tools by allowing users to generate reports, tables, GO charts and graphs that help the user with downstream analysis of data. In addition, GO FEAT allow users to export the results with different output formats. GO FEAT is freely available for use at http://computationalbiology.ufpa.br/gofeat/.

Session B-164: Efficient inference of orthologs in large eukaryotic pan-genomes
COSI: Function
  • Siavash Sheikhizadeh Anari, Wageningen University, Netherlands
  • Dick De Ridder, Wageningen University, Netherlands
  • Sandra Smit, Wageningen University, Netherlands
  • Eric Schranz, Wageningen University, Netherlands

Short Abstract: Homologous proteins/genes (homologs) have a common origin and might be inherited either from a single gene through a speciation event (orthologs) or from duplicated ones (paralogs). Identification of orthologs is fundamental to functional genomics, comparative genomics and phylogenomics research. Existing orthology inference approaches mostly rely on similarity scores between all pairs of proteins. Calculating pair-wise similarity scores through all-against-all comparisons quickly becomes a large computationally burden as the number of genomes increase. As novel genomes are continually being generated by large-scale genomic projects, there is a need for more efficient orthology detection methods. We propose an efficient method for detecting orthology in large eukaryotic pan-genomes as an extension to PanTools, our pan-genomic data representation. We first find all pairs of intersecting proteins, that share a certain fraction of amino acid k-mers, significantly shrinking the search space. Then, we calculate a normalized Smith-Waterman similarity score for each pair of intersecting proteins. Pairs with a similarity score greater than a pre-specified threshold will be connected through an edge in a so-called homology graph. Finally, we pass each connected component of the homology graph to MCL, a sensitive clustering algorithm, to be possibly broken into several orthology groups. Our results on large eukaryotic datasets demonstrate a significant efficiency gain compared to the well-known BLAST-based method, Orthofinder, with comparable accuracy.

Session B-165: Functional Prediction Stability in UniProtKB/TrEMBL and its Implication in a Context of Evolving Data
COSI: Function
  • Maryam Abdollahyan, Queen Mary University of London, United Kingdom
  • Rabie Saidi, UniProt, European Bioinformatics Institute, Cambridge, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom
  • Taha Boulehmi, University of Tübingen, Germany
Session B-166: A knowledge-based T2-statistic to perform pathway analysis for quantitative proteomic data
COSI: Function
  • En-Yu Lai, Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica; Institute of Biomedical Informatics, National Yang Ming University, Taiwan
  • Yi-Hau Chen, Institute of Statistical Science, Academia Sinica, Taiwan
  • Kun-Pin Wu, Institute of Biomedical Informatics, National Yang Ming University, Taiwan

Short Abstract: Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as a common approach; which fundamentally implies genes or proteins as independent units. The independent assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. Consequently, these methods often underestimate the associations among biomolecules and cause false positives in practice. Some studies incorporate the sample covariance matrix into the calculation to address this issue. However, sample covariance may not be a precise estimation if the sample size is very limited, which is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative proteomic data. The covariance matrix used in the test statistic is constructed by the confidence scores retrieved from the STRING database or the HitPredict database. We also design an integrating procedure to retain pathways of sufficient evidence as a pathway group. The performance of the proposed T2-statistic is demonstrated using five published experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differentiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets produced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling pathway is a gene expression dataset of limited sample size. Compared with other popular statistics, the proposed T2-statistic yields more accurate descriptions in agreement with the discussion of the original publication. We implemented the T2-statistic into an R package T2GA, which is available at https://github.com/roqe/T2GA.

Session B-167: Sugar Type Discrimination Method for Protein O-glycosylation Sites in Mammalian Proteins
COSI: Function
  • Kenji Etchuya, Meiji University, Japan
  • Yuri Mukai, Meiji University, Japan

Short Abstract: Glycosylation, one of the protein post-translational modifications, is known to contribute to protein folding, functions and enzyme activities. In O-glycosylation, various sugar chains are attached to motif residues (usually Ser or Thr) by glycosyltransferases in the Golgi body. O-glycosylations were thought to occur in the Golgi body, however some of them were confirmed in nucleus, the endoplasmic reticulum and cytosol. These sugar types promote each biological function, and play different roles in living cells. Many prediction tools for protein O-glycosylation have been developed and published on the web. These methods can predict major types of O-glycosylation sites using protein primary sequences and secondary structures. Sugar type discrimination which is applied to more sugar types will be useful to clarify the correlation between each sugar type and its biological function. In this study, to developing sugar type discrimination method based on each sugar type, amino acid residues around the O-glycosylation sites were analyzed. Mammalian protein data with O-glycosylation annotations was extracted from the Uniprot KB/Swiss-Prot 2016_07. The propensities of the amino acids depending on each sugar type around O-glycosylation site were compared with each sugar type and applied to the sugar type discrimination method. PSSM was calculated with amino acid propensities. The accuracy of sugar type discrimination method between each sugars was over 80% success rate by 5-fold cross validation test. The characteristics based on each sugar type around O-glycosylation sites and details of this method were shown in this study.

Session B-168: High-throughput Protein Functional Prediction by Data Science Approach
COSI: Function
  • Yi-Wei Liu, Department of Computer Science, National Chengchi University, Taiwan
  • Wen-Hung Liao, Department of Computer Science, National Chengchi University, Taiwan
  • Jia-Ming Chang, Department of Computer Science, National Chengchi University, Taiwan
Session B-169: A Self-training Approach for Functional Annotation of UniProtKB Proteins
COSI: Function
  • Maryam Abdollahyan, Queen Mary University of London, United Kingdom
  • Rabie Saidi, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
  • Fabrizio Smeraldi, Queen Mary University of London, United Kingdom
  • Maria-Jesus Martin, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
Session B-170: A motif approach for mining conserved gene orders and its application for microbial genomes
COSI: Function
  • Michal Ziv-Ukelson, Ben Gurion University of the Negev, Israel
  • Dina Svetlitsky, Ben Gurion University of the Negev, Israel
  • Tal Dagan, Institute of General Microbiology, Christian-Albrechts University Kiel, Germany
  • Vered Chalifa-Caspi, Bioinformatics Core Facility, National Institute for Biotechnology in the Negev, Israel

Short Abstract: Motivation: Clusters of genes that are conserved across multiple genomes provide important clues as for gene transcriptional regulation and function. The recent availability of completely sequenced genomes calls for gene-based pattern discovery in large-scale dimensions. Previous computational work addressed general non-collinear gene cluster discovery, allowing for gene order shuffling between occurrences of the cluster in different genomes. Here we seek ordered gene clusters, i.e. groups of genes whose order is conserved across a wide range of prokaryotic taxa. We term these clusters Ordered Gene Motifs (OGMs). Results: We present a novel methodology for the discovery, ranking and taxonomic distribution analysis of OGMs. Our approach – OGMFinder – is based on an efficient reference-based motif discovery algorithm, adapted to scale up to a large alphabet of all orthologous gene families. We apply OGMFinder to 1,485 bacterial chromosomes and 468 plasmid genome annotations. Our analysis yields a catalogue of 15,281 chromosomal and 517 plasmid-encoded conserved OGMs that are ranked according to a probabilistic ranking score. Plasmid-encoded OGMs are, on average, shorter than chromosomal OGMs. Furthermore, the distribution of OGMs in chromosomes has a clear taxonomic signal, while plasmid encoded OGMs may be shared by bacteria having a similar lifestyle or residing in a similar habitat.

Session B-504: Neighborhood-Based Label Propagation in Large Protein Graphs
COSI: Function
  • Sabeur Aridhi, University of Lorraine/LORIA/INRIA, France
  • Seyed ziaeddin Alborzi, LORIA/INRIA, France
  • Smaïl-Tabbone Malika, University of Lorraine/LORIA, France
  • Marie-Dominique Devignes, LORIA/INRIA, France
  • David Ritchie, LORIA/INRIA, France

Short Abstract: Understanding protein function is one of the keys to understanding life at the molecular level. It is also important in several scenarios including human disease and drug discovery. In this age of rapid and affordable biological sequencing, the number of sequences accumulating in databases is rising with an increasing rate. This presents many challenges for biologists and computer scientists alike. In order to make sense of this huge quantity of data, these sequences should be annotated with functional properties. UniProtKB consists of two components: i) the UniProtKB/Swiss-Prot database containing protein sequences with reliable information manually reviewed by expert bio-curators and ii) the UniProtKB/TrEMBL database that is used for storing and processing the unknown sequences. Hence, for all proteins we have available the sequence along with few more information such as the taxon and some structural domains. Pairwise similarity can be defined and computed on proteins based on such attributes. Other important attributes, while present for proteins in Swiss-Prot, are often missing for proteins in TrEMBL, such as their function and cellular localization. The enormous number of protein sequences now in TrEMBL calls for rapid procedures to annotate them automatically. In this work, we present DistNBLP, a novel Distributed Neighborhood-Based Label Propagation approach for large-scale annotation of proteins. To do this, the functional annotations of reviewed proteins are used to predict those of non-reviewed proteins using label propagation on a graph representation of the protein database. DistNBLP is built on top of the "akka" toolkit for building resilient distributed message-driven applications.

Session B-507: Bio-M-lom: Deep multi-label, multi-ontology multi-model network architecture for biological sequence annotation
COSI: Function
  • Ricardo Corral, Instituto de Fisiología Celular, Mexico
  • Ismael Fernández Martínez Mexico
  • Gabriel Del Rio Guerra Mexico

Short Abstract: We propose Bio-M-lom, a general deep neural network architectural model that handles simultaneously with multiple ontologies and multiple learning models knowledge transfer for extreme multi-label learning on biological sequences

Session B-508: Evolution of chemoreceptor genes in solitary Hymenoptera
COSI: Function
  • George F. Obiero, Max Planck Institute for Chemical Ecology, Germany
  • Thomas Pauli, Department of Evolutionary Biology and Ecology, Ludwig Albert University of Freiburg, Germany, Germany
  • Bill S. Hansson, Max Planck Institute for Chemical Ecology, Germany
  • Oliver Niehuis, Department of Evolutionary Biology and Ecology, Ludwig Albert University of Freiburg, Germany, Germany
  • Ewald Große-Wilde, Max Planck Institute for Chemical Ecology, Germany

Short Abstract: Hymenoptera play important ecological roles: they are pollinators, scavengers, parasitoids, predators and invasive pests of food crops and forests. Importantly, the order exhibits transitions in lifestyles, for example solitary to eusocial, and entomophagous to phytophagous. Hymenoptera rely heavily on chemosensory detection of food sources, nests, mates and oviposition sites. Eusocial hymenoptera underwent unusual expansion of their chemosensory receptor families; there is evidence that this expansion is connected to their eusocial lifestyle. However, to test this hypothesis we need to analyze non-eusocial hymenopteran species, and only very few species that mostly are distantly related to eusocial lineages have been scrutinized. Can an expansion of chemoreceptor repertoires be explained by the emergence of eusociality? We aim to explain evolution of eusociality in solitary Hymenoptera wasps using comparative genomics and transcriptomics of chemoreceptor genes generated from a select number of solitary sister species, Psenulus fuscipennis, Ampulex compressa and Cerceris arenaria. The focus of our analysis will be the chemoreceptor gene families, odorant receptors (ORs), ionotropic receptors (IRs), and gustatory receptors (GRs). Preliminary results indicate that C. arenaria posseses >300 ORs, a larger number than any eusocial bee. It also has clearly distinct species-specific gene family expansions. The OR coding genes are mostly organized in clusters containing 2-18 OR genes. The 9-exon OR clade, supposedly connected to eusociality, has similar genes as bees. Overall, our preliminary results indicate that the expansion of the OR receptor family predates emergence of eusociality in ants and bees, and was likely driven by alternative selection pressures.

Session B-510: PANNZER 2: Annotate a complete proteome in minutes!
COSI: Function
  • Alan Medlar, University of Helsinki, Finland
  • Petri Toronen, University of Helsinki, Finland
  • Elaine Zosa, University of Helsinki, Finland
  • Liisa Holm, University of Helsinki, Finland

Short Abstract: As high-throughput sequencing has become increasingly efficient, downstream analysis has become a bottleneck in genome sequencing projects. Annotation of the organism’s proteome is a critical and time-consuming process. We present PANNZER 2, an interactive web server and standalone program for protein function prediction. It uses SANS-parallel, a high-throughput sequence search program that is thousands of times faster than BLAST, to identify homologous sequences used in prediction. It is, therefore, capable of analyzing tens of thousands of proteins interactively. Results are displayed as an HTML table summarizing both description text and GO predictions, including a statistical estimate for the reliability of each prediction. The results page additionally has links to the complete list of SANS hits, allowing users to understand how results were derived. PANNZER 2 provides four alternative scoring functions for GO prediction to highlight which predictions are more robust than others. The scoring functions include PANNZER, BLAST2GO and ARGOT2-like scores. PANNZER 2 formed the backbone of our CAFA 3 entry that was an ensemble method combining multiple data sources, including: sequence similarity, biomedical literature, inter-ontology annotation correlations and protein-protein interaction data.


View Posters By Category

Search Posters: