Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Posters

Poster Categories
Poster Schedule
Preparing your Poster - Information and Poster Size
How to mount your poster
Print your poster in Basel

View Posters By Category

Session A: (July 22 and July 23)
Session B: (July 24 and July 25)

Presentation Schedule for July 22, 6:00 pm – 8:00 pm

Presentation Schedule for July 23, 6:00 pm – 8:00 pm

Presentation Schedule for July 24, 6:00 pm – 8:00 pm

Session A Poster Set-up and Dismantle
Session A Posters set up: Monday, July 22 between 7:30 am - 10:00 am
Session A Posters should be removed at 8:00 pm, Tuesday, July 23.

Session B Poster Set-up and Dismantle
Session B Posters set up: Wednesday, July 24 between 7:30 am - 10:00 am
Session B Posters should be removed at 2:00 pm, Thursday, July 25.

C-01: GO FEAT2: an online tool for functional annotation and comparative analysis of metagenomes
COSI: Function COSI
  • Fabricio Araujo, UFPA, Brazil
  • Yan Pantoja, Federal University of Pará, Brazil
  • Ailton Sousa, UFPA, Brazil
  • Dener Maues, UFPA, Brazil
  • Artur Silva, Federal University of Pará, Brazil
  • Rommel Ramos, Federal University of Pará, Brazil

Short Abstract: Terrestrial biomass consists mainly of microorganisms. It’s possible to find them almost anywhere on Earth. It is estimated that the number of species of microorganisms on Earth is around one trillion, and only 1% ~ 2% of the microorganisms are viable for in vitro isolation and cultivation. Analysis of microorganisms was greatly favored by technological advances of Next Generation Sequencing (NGS), making sequencing more accurate, cheaper and faster. Additionally, NGS platforms allowed the sequencing of non-cultivable microorganisms, allowing them to be studied in their natural environments. This can provide answers about these microorganisms, especially about the functions of genes in the community. However, there is great difficulty in analyzing the large volume of data obtained after sequencing, and for this, bioinformatics programs are essential in metagenomic analyzes, but they require strong computational knowledge that may be limitations for users without knowledge of bioinformatics, especially in analyzes that require usage of several tools. Thus, we propose the development of an online tool, simple and intuitive, for the analysis of metagenomic data, performing the characterization of sequences belonging to samples and proceeding with comparative analysis with integration to biological databases such as NCBI, EBI, InterPro and Gene Ontology.

C-02: OutCyte: a novel tool for predicting unconventional protein secretion
COSI: Function COSI
  • Linlin Zhao, Heinrich-Heine-University Dusseldorf, Germany
  • Gereon Poschmann, Heinrich-Heine-University Dusseldorf, Germany
  • Daniel Waldera-Lupa, Heinrich-Heine-University Dusseldorf, Germany
  • Nima Rafiee, Heinrich-Heine-University Dusseldorf, Germany
  • Markus Kollmann, Heinrich-Heine-University Dusseldorf, Germany
  • Kai Stühler, Heinrich-Heine-University Dusseldorf, Germany

Short Abstract: The prediction of protein localization such as the extracellular space from high-throughput data is essential for functional downstream inferences. Most secreted proteins go through the conventional endoplasmic reticulum (ER) – Golgi pathway with the guidance of a signal peptide. However, some proteins have been found to reach the extracellular space following unconventional pathways. Reliable predictions of unconventional protein secretions (UPS) become increasingly demanding. Here, we present OutCyte, a fast and accurate tool for the prediction of UPS, which has been built upon experimentally proven proteins. It produces predictions in two steps, proteins with N-terminal signals are first accurately filtered out, then proteins without N-terminal signal are classified to be UPS or intracellular proteins by traits directly generated from their amino acid sequences. As there is a growing body of experimental data, we believe that OutCyte can serve as a significant step towards the better understanding of the different secretory pathways.

C-03: A proteome-wide resource for eukaryotic peripheral membrane proteins
COSI: Function COSI
  • Katerina Nastou, Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, Greece
  • Georgios Tsaousis, Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, Greece
  • Stavros Hamodrakas, Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, Greece
  • Vassiliki Iconomidou, Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, Greece

Short Abstract: Membranes are a critical cellular component, upon which most cell functions are based on. They are associated with a plethora of proteins that can be classified either as integral or peripheral, based on the nature of their interactions. Peripheral membrane proteins are an understudied, but nevertheless, very important protein group, considering the large number of cellular functions in which they are involved. PerMemDB presents the first effort to accumulate and categorize this heterogeneous group of membrane proteins in a dedicated database. Data sources include UniProt and MBPpred. Proteins with subcellular location “Peripheral membrane protein” and interaction partners of transmembrane proteins were isolated from UniProt; while, peripheral membrane proteins interacting directly with membrane lipids were identified with the use of MBPpred. In its current version, PerMemDB contains 241173 peripheral membrane proteins from 1216 organisms. PerMemDB entries are supplemented with detailed annotation and linked to many biological databases, in order to provide an overview of each protein. Moreover, entries collected with the use of MBPpred have additional information, regarding their characteristic domains that allow them to interact with membranes. To our knowledge, PerMemDB is currently the largest repository for eukaryotic peripheral membrane proteins. The PerMemDB web interface is available at http://bioinformatics.biol.uoa.gr/permemdb.

C-04: BUSCA for the annotation of protein subcellular localization
COSI: Function COSI
  • Castrense Savojardo, Biocomputing Group, University of Bologna, Italy
  • Piero Fariselli, Department of Medical Sciences, University of Torino, Torino, Italy., Italy
  • Martelli Luigi, University of Bologna - Biocomputing Group, Italy
  • Rita Casadio, University of Bologna - Biocomputing Group & National Research Council - Institute of Biomembranes and Bioenergetics, Italy

Short Abstract: In-silico prediction of protein subcellular localization is a key step of large-scale functional annotation projects aiming at understanding the role of each protein within the cellular complexity. We recently developed the Bologna Unified Subcellular Component Annotator (BUSCA) (http://busca.biocomp.unibo.it), a web server integrating different computational tools developed by our group and devised to predict protein subcellular localization. In particular, BUSCA combines: i) tools for the detection of signal and transit peptides (DeepSig and TPpred3), GPI-anchors (PredGPI) and transmembrane domains (ENSEMBLE3.0 and BetAware); ii) tools for assessing subcellular localization of both globular and membrane proteins (BaCelLo, MemLoci and SChloro). The different tools are organized into five pipelines for predicting subcellular localization in plants, animals, fungi, Gram-positive and Gram-negative bacteria. Overall BUSCA discriminate: sixteen different compartments in plants, nine for animals and fungi, four for Gram-negative and three for Gram-positive bacteria. Moreover, BUSCA provides residue-level feature annotation including membrane-spanning segments, signal and transit peptide cleavage-sites and GPI-anchors. In benchmarks performed on CAFA2 and CAFA3 targets, BUSCA reported performances comparable and even better that other methods: F1 measures are 0.49 and 0.59 in CAFA2 and CAFA3 experiments, respectively. We propose BUSCA as an accurate resource for the annotation of protein subcellular localization.

C-05: Metric Learning on Expression Data for Gene Function Prediction
COSI: Function COSI
  • Stavros Makrodimitris, Delft University of Technology, Netherlands
  • Roeland van Ham, Delft University of Technology, Netherlands
  • Marcel Reinders, TU Delft and Leiden University Medical Center, Netherlands

Short Abstract: Motivation: Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, using RNA-Seq datasets with many experimental condi-tions from diverse sources introduces batch effects and other artefacts that might obscure the real co-expression signal. Moreover, only a subset of experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similar functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. Results: We developed MLC (Metric Learning for Co-expression), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidop-sis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric ROC AUC.

C-06: Geometric Deep Learning Methods for Large-Scale Structure-Based Protein Function Prediction
COSI: Function COSI
  • Douglas Renfrew, Flatiron Institute, United States
  • Vladimir Gligorijević, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • Richard Bonneau, Center for Data Science, New York University, New York, NY, USA, United States

Short Abstract: We present a novel geometric deep learning method based on Graph Convolutional Neural Networks (GCNNs) for predicting functions from experimental and Rosetta-predicted protein structures. As opposed to CNNs, that have been the state-of-the-art in predicting protein function from sequence, we show that GCNNs are better at extracting features from proteins and predicting their functions by taking into account the graph-based structure of their amino acid residues represented by contact maps. Our method uses three graph convolutional layers followed by two fully connected layers to learn a complex structure-to-function relationship by first automatically extracting features from PDB protein structures and then mapping them to Gene Ontology functional classes. We show that GCNN-based method learns generalizable aspects of protein structure by robustly predicting functions of proteins with < 30% sequence identity to training set. We also demonstrate its comparable performance in predicting protein functions from ~300 Rosetta-predicted lowest energy structures compared to predictions from their native structures. Our model advances the state-of-the-art on a wide range of functions and removes the need for manual feature engineering from protein structures; can make predictions that go beyond homology-based transfer and can reliably be used in annotating proteins with predicted structures.

C-07: EnzymeMiner: Web Server for Automated Mining and Annotation of Soluble Enzymes in Genomic Databases
COSI: Function COSI
  • Jiří Hon, Brno University of Technology, Czechia
  • Simeon Borko, Brno University of Technology, Czechia
  • Martin Marušiak, Brno University of Technology, Czechia
  • Tomáš Martínek, Brno University of Technology, Czechia
  • David Bednář, Loschmidt Laboratories, Czechia
  • Jiri Damborsky, Loschmidt Laboratories, Masaryk University; International Clinical Research Center, Czechia

Short Abstract: Millions of protein sequences are being discovered at an incredible pace, representing an inexhaustible source of biocatalysts [1]. Traditional biochemical characterization techniques are time-demanding, cost-ineffective, and low-throughput. To address these limitations, we have developed EnzymeMiner web server for automated and periodic in silico screening of diverse family members. EnzymeMiner helps to effectively prioritize and select novel putative enzyme sequences by providing sequence similarity network visualization, active site and Pfam annotations, observations from BioProject database and prediction of protein solubility using SoluProt method [2]. EnzymeMiner provides highly interactive and easy-to-use web interface enabling well-informed selection of promising soluble enzymes for further experimental characterization. The only required input is an EC number. Using EnzymeMiner, a number of novel haloalkane dehalogenases with potential practical uses have been identified, characterized, and made available to the community in industry and academia [1]. A further application of EnzymeMiner to other enzyme families will expand our knowledge of protein evolution and will lead to the discovery of novel biocatalysts. [1] Vanacek et al. 2018, Exploration of Enzyme Diversity by Integrating Bioinformatics with Expression Analysis and Biochemical Characterization. ACS Catalysis 8: 2402–2412 [2] Hon et al. 2019, SoluProt: Prediction of Protein Solubility. Bioinformatics (in preparation)

C-08: Computing the Language of Life: NLP Approaches to Feature Extraction for Protein Family Classification
COSI: Function COSI
  • Ananthan Nambiar, Reed College, United States
  • Mark Hopkins, Reed Collge, United States
  • Anna Ritz, Reed College, United States

Short Abstract: Biologists have long viewed both deoxyribonucleic acid (DNA) and proteins as languages that make up life and predicting protein function from sequence information is an active area of research. We studied the effectiveness of several natural language processing methods (text embedding and convolutional neural networks), including existing methods such as ProtVec, DeepFam and seq2vec, on protein family classification. Using two datasets from SwissProt and Clusters of Orthologous Groups (COGs), we found that low level features of protein sequences have the potential to contribute to powerful classifiers. Based on this analysis, we evaluated an additional simpler approach based on term frequency and logistic regression. This simpler method obtained an average accuracy of 0.98 on the binary classification task of determining whether a protein belongs to a particular family, which is competitive with the state-of-the-art approaches. This suggests that, at least in the binary case, protein family classification may be a simpler problem than expected.

C-09: Computational identification of prion-like RNA-binding proteins that form liquid phase-separated condensates
COSI: Function COSI
  • Gabriele Orlando, Katholieke Universiteit Leuven, Italy
  • Daniele Raimondi, Katholieke Universiteit Leuven, Belgium
  • Francesco Tabaro, Institute of Biosciences and Medical Technology,, Finland
  • Francesco Codicè, University of Bologna, Italy
  • Wim Vranken, Vrije Universiteit Brussel, Belgium
  • Yves Moreau, KU Leuven, ESAT, Stadius, Belgium

Short Abstract: Eukaryotic cells contain different membrane-delimited compartments, which are crucial for the biochemical reactions necessary to sustain cell life. Recent studies showed that cells can also trigger the formation of membraneless organelles composed by phase-separated proteins to respond to various stimuli. These condensates provide new ways to control the reactions and phase-separation proteins (PSPs) are thus revolutionising how cellular organization is conceived. The small number of experimentally validated proteins, and the difficulty in discovering them, remain bottlenecks in PSPs research.ere we present PSPer, the first in-silico screening tool for prion-like RNA-binding PSPs. We show that it can prioritize PSPs among proteins containing similar RNA-binding domains, intrinsically disordered regions (IDRs) and prions. PSPer is thus suitable to screen proteomes, identifying the most likely PSPs for further experimental investigation. Moreover, its predictions are fully interpretable in the sense that it assigns specific functional regions to the predicted proteins, providing valuable information for experimental investigation of targeted mutations on these regions. Finally, we show that it can estimate the ability of artificially designed proteins to form condensates (r=-0.87), thus providing an in-silico screening tool for protein design experiments.

C-10: Unravelling the Dynamicity of POU2F1 based on Evolutionary Conservation and Structure Network Analysis
COSI: Function COSI
  • Sagnik Sen, Jadavpur University, India
  • Ashmita Dey, Jadavpur University, India
  • Ujjwal Maulik, Jadavpur University, India

Short Abstract: POU2F1, an octamer domain protein, is characterized by the presence of conserved bipartite DNA binding domain. Dynamic nature of POU2F1 is expressed in versatile biological events viz., neurogenesis, immunity, protein synthesis etc. Evidence shows that as a transcription factor, POU2F1 has a significant contribution towards epithelial to mesenchymal transition which leads to metastasis during carcinogenic progression. However, POU2F1 has not been identified as a potential drug target. To unravel the structural dynamicity of POU2F1 based on the evolutionary trait, two-fold experiment is performed. Firstly, depending on sequence conservation, Shannon Entropy (SE) is computed to demonstrate the evolutionary structural trait of the family. Thereafter based on sequence space, co-varying patches have been identified using Direct Coupling Analysis (DCA). To identify the sequential variation, sequence-based experiments are performed on evolutionary sequence space and human sequence space individually. The result reveals, while human sequential conservation is associated with disorderedness, evolutionary space conservation signifies orderedness. In the subsequent level, the detected co-varying patches are mapped onto the structure network of POU2F1 and reveal its structural facets that can explain the corresponding functional dynamicity. Finally, the structural malleability is mapped onto its functional duality by the pathway analysis which clarifies the moonlighting property of POU2F1.

C-11: CATH-Gardener: a luigi-based python pipeline to generate Functional Families (FunFams) for “mega” superfamilies in CATH
COSI: Function COSI
  • Nicola Bordin, University College London, United Kingdom
  • Sayoni Das, University College London, United Kingdom
  • Christine Orengo, University College London, United Kingdom
  • Ian Sillitoe, UCL, United Kingdom

Short Abstract: Background: The CATH/Gene3D database includes up-to-date information on evolutionary relationships and classification of protein domains into structural Superfamilies. Functional Families (FunFams) subdivide these superfamilies further in order to provide clusters and alignments of protein domains that perform closely related functions. FunFams have performed well in function prediction studies (CAFA) and have helped to provide insights both on functional sites and effects of variants on structure and disease. The upcoming release of the CATH-Gene3D database (v4.3) has seen the emergence of “mega” superfamilies, containing millions of protein domains. This has provided major computational challenges for functional classifications. Method: To address these issues, we created Gardener, a novel Luigi-based Python pipeline package for clustering massive protein datasets into CATH FunFams. This pipeline starts with an initial partitioning of domains by Multi-Domain Architecture (MDA) then iteratively applies tree-building/tree-cutting algorithms (GeMMA/FunFHMMER). Gardener is built with HPC integration and batch processing visualization. Results: The Gardener method has resulted in a general increase in FunFam coherence, a reduction in over-splitting of these functional families and a more flexible and robust infrastructure to generate FunFams data. This expanded and comprehensive set of FunFams will enable better coverage of predicted functional annotations in CATH-Gene3D, including functional site predictions.

C-12: Updates on the Critical Assessment of Function Annotation Challenge Series
COSI: Function COSI
  • Predrag Radivojac, Northeastern University, United States
  • Naihui Zhou, Iowa State University, United States
  • Yuxiang Jiang, Indiana University Bloomington, United States
  • Timothy Bergquist, University of Washington, United States
  • Kimberley Lewis, Geisel School of Medicine at Dartmouth, United States
  • Alex W. Crocker, Geisel School of Medicine at Dartmouth, United States
  • George Georghiou, EBML European Bioinformatics Institute, United Kingdom
  • Balint Z. Kacsoh, Geisel School of Medicine at Dartmouth, United States
  • Giovanni Bosco, Dartmouth College, United States
  • Deborah Hogan, Geisel School of Medicine at Dartmouth, United States
  • Claire O'Donovan, EBI, United Kingdom
  • Sandra Orchard, EMBL-EBI, United Kingdom
  • Sean Mooney, University of Washington, United States
  • Casey Greene, University of Pennsylvania, United States
  • Iddo Friedberg, Iowa State University, United States
  • Maria Jesus Martin, EMBL-EBI, United Kingdom

Short Abstract: The third Critical Assessment of Function Annotation challenge (CAFA3) released its prediction targets in September 2016, and preliminary results were announced July 2017. CAFA3 featured a term-centric track where predictors were asked to associate a large set of genes (the complete genomes of Candida albicans and Pseudomonas aeruginosa with a limited set of functions. By collaborating with experimental biologists, we were able to use unpublished whole-genome screen results to evaluate these predictions. To specifically address this question, we hosted an additional challenge CAFA-Pi that is dedicated to evaluating term-centric predictions. We will discuss the latest CAFA3 and CAFA-Pi analyses including comparisons with CAFA1 and CAFA2 results, methods, diversity analyses, and CAFA-pi results analysis. We also used CAFA to discover new genes associated with long term memory in in Drosophila melanogaster. Having conducted three challenges to-date, we provide a historical review of the progress made in CAFA, discuss the difficulties and opportunities in the field of function prediction, and the role of an ongoing community challenge in the progress made in the field.

C-13: Gene Prioritization Using Scalable Matrix Completion with Multi-Omics Side Information and Deep Learning
COSI: Function COSI
  • Pooya Zakeri, Aix-Marseille Univ, Inserm, CNRS, MMG, Marseille, France
  • Jaak Simm, Katholieke Universiteit Leuven, Belgium
  • Adam Arany, Katholieke Universiteit Leuven, Belgium
  • Anaïs Baudot, Aix Marseille Univ, INSERM, MMG, Marseille Medical Genetics, France
  • Yves Moreau, KU Leuven, ESAT, Stadius, Belgium

Short Abstract: To tackle the gene prioritization task more effectively, we recently developed Genehound [1], grounded in the concept of matrix completion. It proposed a generalization of Bayesian Probabilistic matrix factorization allowing to work with genomic and phenotypic information simultaneously. Here, we are extending Genehound to manage the factorization of a wide range of biological data models such as tensors and multiple relations between biological concepts. This extended version of Genehound can be used to develop a network-guided framework to integrate several omics data. We discuss the advantages and limitations of this approach versus the state-of-the-art network-guided framework. The latter, the random-walk-with-restart on multiplex-heterogeneous networks [2], also explores several layers of interactions between genes and diseases. We also further develop GeneHound by allowing the incorporation of heterogeneous -omics data as side information into the matrix factorization process, using deep learning techniques. Such incorporation of -omics data can be seen as a non-linear integration of side information into the factorization process of an incompletely filled gene-disease matrix. This appealing extension enables us to handle gene prioritization more effectively for diseases with very few known associated genes or for genes that have not yet been extensively characterized. [1] doi.org/10.1093/bioinformatics/bty289 [2] doi.org/10.1093/bioinformatics/bty637

C-14: The evolutionary signal in metagenome phyletic profiles predicts many gene functions
COSI: Function COSI
  • Vedrana Vidulin, Jožef Stefan Institute, Slovenia
  • Tomislav Šmuc, Ruđer Bošković Institute, Croatia
  • Sašo Džeroski, Jožef Stefan Institute, Slovenia
  • Fran Supek, Institute for Research in Biomedicine (IRB Barcelona), Spain

Short Abstract: Many methods were proposed that infer gene function from comparative analyses of whole-genome sequences, phyletic profiles(PP) being among the top-performing. PP encodes the pattern of presence/absence of a gene family members across genomes. Motivated by the PP’s accuracy and the abundance of metagenomic data, we propose a method to construct metagenome phyletic profiles(MPP), which reflect relative abundances of gene families across metagenomes. We derive MPP from 5049 metagenomes covering seven types of environments and compare against PP constructed from 2071 bacterial/archaeal genomes. Both data sets cover 3536 COGs annotated with 3358 Gene Ontology(GO) terms. Classification models are constructed using the CLUS-HMC algorithm that accounts for GO hierarchy. We find that MPP can predict hundreds of GO terms that could not be predicted by standard PP. In particular, exclusively the MPP but not PP can predict 29% of the 664 GO terms that have at least one inference at a stringent threshold of FDR<10%(by either PP or MPP). We also find that MPP’s accuracy does not necessarily increase with more data, but it does benefit by sampling from diverse environments. Simulation studies indicate that full accuracy is reached with only 500 metagenomes(out of the tested 5049) if maximum diversity is enforced.

C-15: Not all protein modification sites are created equal: Insights from a human-specific substitution matrix
COSI: Function COSI
  • Tair Shauli, The Hebrew University of Jerusalem, Israel
  • Nadav Brandes, The Hebrew University of Jerusalem, Israel
  • Michal Linial, The Hebrew University of Jerusalem, Israel

Short Abstract: The proteomic implication of human genetic variation is fundamental to our understanding of protein function. Such variation is commonly encapsulated in amino-acid (AA) substitution matrices (AASUMs), which are pivotal to the study of proteomics and molecular evolution. However, contemporary AASUMs, such as BLOSUM, are unlikely to reflect human-specific variation. In this study, we present a set of human-centric substitution matrices, at codon and AA resolutions. Derived from an exhaustive list of >8M variants collected from >60K healthy human individuals, our AASUMs substantially deviate from the BLOSUM matrices. Additionally, in contrast to inter-taxa derived matrices, our novel matrices convey directional information. For example, we find that substitutions into Ser and Leu prevail most substitutions of these AAs into others. All 400 possible AA substitutions are directly calculated using a probabilistic approach based on the frequencies of human genetic variants. We investigate how the presence of major PTMs leads to different patterns of AA substitutions. We show that with respect to the general human-model, substitutions in genomic sites that are associated with phospho-Thr/Ser are more dynamic and interchangeable, while sites of phospho-Tyr are far more robust. Our matrix provides a strong baseline for studying human protein function in health and disease.

C-16: Identification of putative trascription factors explain Leishmania spp. disease types
COSI: Function COSI
  • J. Eduardo Martinez, Universidad Mayor, Chile
  • Cristian Molina, Lircaytech, Chile
  • Bruno Correia, UFPB, Brazil
  • Vinicius Maracaja-Coutinho, Advanced Center for Chronic Diseases - ACCDiS, Facultad de Ciencias Químicas y Farmacéuticas Universidad de Chile, Chile
  • Alberto J.M. Martin, Universidad Mayor, Chile

Short Abstract: The approximately one million new cases of Leishmaniasis reported annually have a devastating impact on global health. Our ability to fight the many different strains of Leishmania spp. parasites, and the different disease forms they cause, is limited by our understanding of their underlying genetics. Thus, here we investigate the structural and functional differences of genomes across 26 strains of Leishmania spp. parasites by performing a pangenomic analysis. In particular, we focus on identifying the function of regulatory elements and their relationship with different disease forms. Our results show an open pangenome composed of 16046 genes. Regulatory proteins were identified in all strains, but transcription factors differed among strains depending on the associated disease form. These results could improve our understanding of how differences in the pathogenesis of particular Leishmania strains are linked to differences in their genetics.

C-17: Commonalities of differential expression across thousands of conditions
COSI: Function COSI
  • Megan Crow, Cold Spring Harbor Laboratory, United States
  • Nathaniel Lim, The University of British Columbia, Canada
  • Paul Pavlidis, The University of British Columbia, Canada

Short Abstract: The task of prioritizing genes for further investigation from long “hit lists” obtained from differential expression (DE) studies remains an unresolved problem. We posit that understanding the properties of gene differential expression across many conditions can assist in this task. Recently, Crow et al. (2019) showed that human gene differential expression is surprisingly predictable, due to some genes being frequently differentially expressed in diverse conditions. With this, we can generate a “prior” that ranks genes by their propensity to be DE. Here we extend this work in several ways. First, we show that the DE prior is replicated in a vastly expanded corpus from Gemma (gemma.msl.ubc.ca). We also show that the DE prior from mouse data is highly similar to the human one. Separately we show that the variance in baseline gene expression levels across experiments is also correlated with the likelihood of being differentially expressed due to specific manipulations. These findings confirm that the DE prior of Crow et al. is a highly general phenomenon. Finally, we investigated additional properties of genes which are rarely differentially expressed. These results collectively move us towards increased interpretability of differential expression results in genomics studies.

C-18: Deep mutational scanning and statistical inference to understand and redesign serine-proteases specificity
COSI: Function COSI
  • Valentin Senlis, CIRB, Collège de France, France
  • Dany Chauvin, Biozentrum, University of Basel, Switzerland
  • Anton Zadorin, ESPCI Paris, France
  • Olivier Rivoire, CIRB, Collège de France, France
  • Clément Nizak, ESPCI Paris, France

Short Abstract: How are the multiple functions of an enzyme encoded within its sequence? Theoretical studies based on physical modeling, structural information and/or protein sequences attempt to predict mutational effects and properties of the protein sequence-to-function landscape. Yet, due to the complexity of the molecular processes by which enzymes specifically recognize substrates and catalyze reactions, accurate predictions of enzyme core functional features from theoretical approaches alone is difficult. Hence, we propose to experimentally decipher the link between enzymes sequences and specificities discrepancies, by extensively chart sequence-to-function landscapes of serine-protease enzymes, for which the “protein sector” hypothesis was formulated. To do so, we developed and we are using a quantitative deep mutational scanning approach that couples high-throughput measurements of enzyme mutants phenotypes using droplet microfluidics, and next generation sequencing of mutant sequences. Preliminary results show that our experimental setup is able to gather valuable data about the enzymatic activity of trypsin variants. Subsequently, statistical inference and other machine learning methods will be used to process these experimental data and generate in silico novel sequences with tailor-made activities and substrate specificities.

C-19: Searching for new N-terminal acetyltransferases
COSI: Function COSI
  • Bojan Krtenic, University of Bergen, Norway

Short Abstract: N-terminal acetylation, one of the most common protein modifications, is catalyzed by a family of enzymes called N-terminal acetyltransferases (NATs). NATs belong to a rich and highly diverse, but structurally highly conserved superfamily of acetyltransferases. The exact number the NAT family members is still not known. In fact, we have very little knowledge about the acetyltransferase superfamily as a whole. Because there is no well-defined classification of acetyltransferases, including NATs, and since the entire acetyltransferase classification we currently have is based on the type of substrate they acetylate, rather than on their true evolutionary relationships, we have very little predictive power in our search for new NATs. To solve the problem of poor classification, we created sequence similarity networks (SSN) and searched for isofunctional clusters representing distinct enzyme groups in the entire acetyltransferase sequence space. Using SSNs in combination with sequence motif analyses, phylogeny and structure analyses we managed to predict several new NAT members, out of which two have been tested and their N-terminal acetylation activity confirmed experimentally. Our work changes the way we see the NAT family, and gives us significant predictive power for future studies of not only the NATs but of the entire acetyltransferase superfamily.

C-20: A Comprehensive FXR Signaling Atlas Derived from Pooled ChIP-seq Data
COSI: Function COSI
  • Emilian Jungwirth, Divison of Gastroenterology and Hepatology, Medical University Graz, Austria
  • Katrin Panzitt, Divison of Gastroenterology and Hepatology, Medical University Graz, Austria
  • Hanns-Ulrich Marschall, Department of Molecular and Clinical Medicine, University of Gothenburg and Sahlgrenska University Hospital, Sweden
  • Martin Wagner, Divison of Gastroenterology and Hepatology, Medical University Graz, Austria
  • Gerhard Thallinger, Institue of Computational Biotechnology, Graz University of Technology, Austria

Short Abstract: Introduction: Chromatin immunoprecipitation sequencing (ChIP-seq) is a method to identify genome-wide transcription factor (TF) binding sites. The TF FXR is a nuclear receptor that controls gene regulation of different metabolic pathways in the liver. Our aim is to standardize and combine all publicly available FXR-ChIP-seq data sets to create a global FXR signaling atlas. Methods and Results: Public FXR data sets were available for mouse, rat and primary human hepatocytes in different treatment conditions. In addition, we also had access to our own FXR-ChIP-seq data set from human liver tissue. Standardized (re-)analysis shows that the data sets are surprisingly heterogeneous concerning baseline quality criteria and often lack sufficient read coverage. We generated a combined mouse FXR-ChIP-seq data set by pooling mapped reads of the available four mouse data sets to gain a higher sequencing depth. The combined data set allowed us to recover more peaks, potentially regulated genes and functional pathways. However, several peaks present in individual mouse samples were not called anymore in the pooled data set. Conclusion: Published single FXR ChIP-seq data sets do not cover the full spectrum of FXR signaling. Combining different data sets and creating a “FXR super-signaling atlas” enhances understanding of FXR signaling capacities.

C-21: How does my drug target function in health and disease?
COSI: Function COSI
  • Vivek Poddar, EMBL-EBI, United Kingdom
  • Robert D. Finn, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Gustavo A Salazar, InterPro, EMBL-EBI, United Kingdom
  • Maria Jesus Martin, EMBL-EBI, United Kingdom
  • Ian Dunham, EMBL-EBI | Open Targets, United Kingdom
  • David Hulcoop, GSK | Open Targets, United Kingdom
  • Juan Antonio Vizcaino, EMBL-EBI, United Kingdom
  • Sameer Velankar, EMBL-EBI, United Kingdom
  • Pablo Porras, EMBL-EBI, United Kingdom
  • Andrew Leach, EMBL-EBI, United Kingdom
  • Henning Hermjakob, EMBL-EBI, United Kingdom
  • Michaela Spitzer, EMBL-EBI | Open Targets, United Kingdom
  • Anjali Shrivastava, EMBL-EBI, United Kingdom
  • Sandra Orchard, EMBL-EBI, United Kingdom
  • Juan Felipe Mosquera, EMBL-EBI, United Kingdom
  • Elaine McAuley, EMBL-EBI | Open Targets, United Kingdom
  • Andrew Jarnuczak, EMBL-EBI, United Kingdom
  • Andrew Hercules, EMBL-EBI, United Kingdom
  • Anna Gaulton, EMBL-EBI, United Kingdom
  • Leyla Jael García Castro, EMBL-EBI, United Kingdom
  • Adam Faulconbridge, EMBL-EBI | Open Targets, United Kingdom
  • Antonio Fabregat, EMBL-EBI, United Kingdom
  • Noemi Del Toro Ayllon, EMBl-EBI, United Kingdom
  • Miguel Carmona, EMBL-EBI | Open Targets, United Kingdom
  • John Berrisford, EMBL-EBI, United Kingdom
  • Denise Carvalho-Silva, EMBL-EBI | Open Targets, United Kingdom

Short Abstract: The mechanistic details relating to protein function are scattered across multiple resources making a unified mapping across this data difficult and time consuming. We have set out to design and build an integrative layer to map the diverse protein function data available in EMBL-EBI resources onto a reference protein entity/sequence. By capturing functional mechanisms, annotating genetic variants and post-translational modifications in protein sequences, we aim at facilitating the understanding of the role of potential drug targets in disease and the role of pathogenic mutations in this process. We have identified opportunities for additional mapping among protein resources at EMBL-EBI and the need to enhance data coverage in targeted disease areas. We are currently collecting use cases scenarios and creating mockups for this information to be available on the Open Targets Platform. Data will be provided in an intuitive and easy-to-visualise manner to help drug discovery scientists in the systematic prioritisation of drug targets. This approach will be designed to be extensible and scalable to enable the inclusion of additional protein function resources to encompass new data from Open Targets and elsewhere.

C-22: A Graph-based Method for Functional Annotation of InterPro Signatures
COSI: Function COSI
  • Bishnu Sarker, University of Lorraine and INRIA, France
  • Marie-Dominique Devignes, CNRS, France
  • Sabeur Aridhi, University of Lorraine and LORIA, France

Short Abstract: Proteins are important components of all biological systems. Understanding protein function is one of the keys to understanding life at the molecular level, and is central to understanding disease processes and guiding drug discovery efforts. Proteins perform many essential biological functions that are often carried out by distinct “domains”. Domains are natural building blocks of proteins and highly conserved regions within a multiple sequence alignment. Functional annotation of domains is important to understand the function of proteins as a whole. Manual domain annotation is expensive as well as time consuming. We present here a graph based approach to automatically annotate protein domain with corresponding Enzyme Commission (EC) number. In this work, we considered InterPro database as it integrates signatures from 14 member databases: CATH-Gene3D, the Conserved Domains Database (CDD), HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE Patterns, PROSITE Profiles, SMART, the Structure–Function Linkage Database (SFLD), SUPERFAMILY and TIGRFAMs. The preliminary result shows that the proposed method is a promising way of annotating domains with EC number.

C-23: ProfileView: a pipeline based on multiple probabilistic models resolving the functional organization of the cryptochrome/photolyase protein family
COSI: Function COSI
  • Riccardo Vicedomini, Sorbonne Univesité, UPMC-Univ. P6, CNRS, IBPS, Laboratory of Computational and Quantitative Biology - UMR 7238, France
  • Alessandra Carbone, Sorbonne Université, France

Short Abstract: Sequence functional classification became a bottleneck for understanding the large amount of protein sequences accumulating in our databases due to the recent advances in high-throughput sequencing. The diversity of homologous sequences often displays various functional activities we need to unravel for a fundamental understanding of living organisms and biotechnological applications. Computational tools classifying sequences by function would possibly help sequence screening towards the design of accurate functional testing experiments and the discovery of new functions. ProfileView is a novel computational pipeline designed to functionally classify sets of homologous protein sequences. It considers multiple probabilistic profiles for the same protein domain and exploits them to evaluate each sequence in the ensemble. The similarity of profile hits, capturing similar functional patterns, will determine which sequences cluster together on the ProfileView multidimensional space of sequences defined by the profile models. As a proof of concept, we apply ProfileView to the important class of photoactive proteins known to present a large variety of functions. Known functional characterizations confirm the soundness of the functional organization obtained with our approach, while laboratory experiments and modelization confirmed the interest of a new uncharacterized functional group we identified as a photoreceptor, paving the way to new possible analyses.

C-24: Antimicrobial Peptides’ Big Secrets
COSI: Function COSI
  • Kevin Luo, National Taiwan Ocean University, Taiwan
  • Ling-Yi Shih, National Taiwan Ocean University, Taiwan
  • Kuan Y. Chang, National Taiwan Ocean University, Taiwan

Short Abstract: It has been unclear to which antimicrobial activities of antimicrobial peptides (AMPs) a given physicochemical property matter most. We thus examined the relationships between antimicrobial activities and two major physiochemical properties of AMPs, amphipathicity and net charge, using A Database of Anti-Microbial peptides (ADAM). This large AMP collection reveals that (I) AMPs with anti-gram-negative bacterial activities exhibited the strongest propensity toward high amphipathicity and net positive charge, which could only be demonstrated by large data, (II) AMPs with anti-gram-positive bacterial activities could be one of a kind among those AMPs whose activities were significantly associated with amphipathicity and net charge, and (III) with respect to multipotent activities, the higher amphipathicity, the greater portion of AMPs possessing antibacterial and antifungal activities. These novel findings could be useful for identifying potent and therapeutic AMPs computationally.

C-25: Exploring the potential of reverse-docking for protein functional prediction
COSI: Function COSI
  • Eugenia Polverini, Department of Mathematical, Physical and Computer Sciences, University of Parma, Italy
  • Riccardo Percudani, University of Parma, Italy
  • Marco Malatesta, University of Parma, Italy

Short Abstract: Reverse-docking is a powerful tool that permits to scan a set of multiple receptors with one ligand. Reverse-docking is normally used to discover new protein targets using known compounds, but could be useful in principle to identify new enzymes or receptors when the substrates or ligands are known. Key aspects of this technique are the conditions of receptor structures determination and, as in standard docking, knowledge of the position of active sites to limit the search space. As a proof-of-principle we applied this technique to the identification of the human gene encoding 3-hydroxy-N6-trimethyllysine aldolase (HTMLA) in carnitine biosynthesis trough an automated procedure of reverse-docking. This procedure involves screening of the HTML compound (ligand) on the whole set of 56 human PLP-dependent enzymes (receptors) since the reaction requires pyridoxal phosphate as cofactor. We optimize the method by simulating the identification of ligands of PLP-dependent enzymes with known specificity. Results were evaluated through the comparison of binding energies and visual analysis of receptor-ligand complexes. This analysis provide a ranked list of candidates that will be tested in vitro for HTMLA activity. This procedure could have a general application to the molecular identification of enzymes involved in known metabolic reactions.

C-26: Enhanced enzyme annotation in UniProtKB using Rhea
COSI: Function COSI
  • Anne Morgat, SIB Swiss Institute of Bioinformatics, Switzerland
  • Elisabeth Coudert, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Kristian B. Axelsen, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Teresa B. Neto, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Elisabeth Gasteiger, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Edouard de Castro, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Monica Pozzato, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Sylvain Poux, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Nicole Redaschi, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Alan Bridge, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • The Uniprot Consortium, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Jerven Bolleman, Swiss Institute of Bioinformatics, Switzerland
  • Thierry Lombardot, SIB Swiss Institute of Bioinformatics, Switzerland
  • Sebastien Gehant, SIB Swiss Institute of Bioinformatics, Switzerland
  • Parit Bansal, Swiss Institute of Bioinformatics, Switzerland
  • Delphine Baratin, SIB Swiss Institute of Bioinformatics, Switzerland

Short Abstract: The UniProt Knowledgebase (http://www.uniprot.org) is a reference resource of protein sequences and functional annotation. More than 45% of expert curated UniProtKB/Swiss-Prot entries are enzymes, which were traditionally annotated using vocabularies such as the hierarchical enzyme classification of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) and the Gene Ontology. Here we describe our work on the enhancement of enzyme annotation in UniProtKB using Rhea (https://www.rhea-db.org/). Rhea is a comprehensive expert-curated knowledgebase of biochemical reactions that uses the ChEBI ontology to describe reaction participants, their chemical structures, and chemical transformations – a computationally tractable description of reaction chemistry. The adoption of Rhea as a reference vocabulary for enzyme annotation in UniProtKB improves the consistency and precision of enzyme annotation. It allows UniProt users to search, browse, and mine enzyme data in new ways, combining approaches from the fields of cheminformatics and bioinformatics. We will describe some use cases in this presentation.

C-27: Assessing Gene Ontology Automated Function Prediction Using Curated Ancestral Annotations
COSI: Function COSI
  • Alex Warwick Vesztrocy, University College London, United Kingdom

Short Abstract: With the ever-increasing number and diversity of sequenced species, the challenge to characterise genes with functional information is ever more important. In most species, this characterisation almost entirely relies on automated electronic methods. As such, it is critical to benchmark the various methods. The CAFA series of community experiments provide the most comprehensive benchmark, with a time-delayed analysis leveraging newly curated experimentally supported annotations. However, the definition of a false positive in CAFA has not fully accounted for the Open World Assumption (OWA), leading to systematic underestimation of precision. The main reason for this limitation is the relative paucity of negative experimental annotations. Here, we introduce a new, OWA-compliant benchmark based on a balanced test set of positive and negative annotations. The negative annotations are derived from expert-curated annotations of protein families on phylogenetic trees. This approach results in a large increase in the average information content (IC) of negative annotations. We tested the benchmark on the naïve and BLAST baseline methods as well as two orthology-based methods. This new benchmark could complement existing ones in future CAFA experiments.

C-28: sparql.uniprot.org: moving past search for research
COSI: Function COSI
  • Elisabeth Coudert, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Nicole Redaschi, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Alan Bridge, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Switzerland
  • Jerven Bolleman, Swiss Institute of Bioinformatics, Switzerland
  • Thierry Lombardot, SIB Swiss Institute of Bioinformatics, Switzerland
  • Sebastien Gehant, SIB Swiss Institute of Bioinformatics, Switzerland
  • Parit Bansal, Swiss Institute of Bioinformatics, Switzerland
  • Eduoard de Castro, Swiss Institute of Bioinformatics, Switzerland
  • Delphine Baratin, SIB Swiss Institute of Bioinformatics, Switzerland
  • Beatrice Cuche, SIB Swiss Institute of Bioinformatics, Switzerland
  • Andrea Auchincloss, SIB Swiss Institute of Bioinformatics, Switzerland
  • Chantal Hulo, SIB Swiss Institute of Bioinformatics, Switzerland
  • Patrick Masson, SIB Swiss Institute of Bioinformatics, Switzerland
  • Ivo Pedruzzi, SIB Swiss Institute of Bioinformatics, Switzerland
  • Catherine Rivoire, SIB Swiss Institute of Bioinformatics, Switzerland

Short Abstract: The UniProt knowledgebase has been available on the web for decades and is one of the most widely used life science resources. We explain how a semantic web approach using RDF and SPARQL have enabled new uses of the UniProt data at a reduced cost to the community. We will also describe how we are combining both our text search driven REST API with our SPARQL endpoint. Specifically we will describe how to do chemistry and cheminformatics based analytics beyond basic search. We also describe how UniProt annotation can be leveraged by external genome annotation pipelines - by encoding HAMAP annotation rules in SPARQL for execution using off the shelf open source solutions like Virtuoso and Apache Jena or commercial equivalents.

C-29: eggNOG-mapper v2: Fast functional annotation of genomes and metagenomes based on precomputed orthology predictions from 5090 reference organisms
COSI: Function COSI
  • Carlos P. Cantalapiedra, Centro de Biotecnología y Genómica de Plantas (CBGP) UPM-INIA, Madrid, Spain, Spain
  • Jaime Huerta-Cepas, Centro de Biotecnología y Genómica de Plantas (CBGP) UPM-INIA, Spain

Short Abstract: Orthology assignment is ideally suited for functional inference. However, because predicting orthology is computationally intensive at large scale, less precise homology-based functional transfer is still the default for (meta-)genome annotation. We, therefore, developed eggNOG-mapper (http://eggnog-mapper.embl.de), a tool for functional annotation of large sets of sequences based on fast orthology assignments using precomputed clusters and phylogenies from the eggNOG database. Using Gene Ontology benchmarking, we showed that orthology filters applied to homology-based results i) reduce the rate of false positive assignments, ii) increase the ratio of experimentally validated terms recovered over all the terms assigned per protein, and iii) predict more GO terms per protein. To validate eggNOG-mapper as part of a metagenomics analysis pipeline, we used a set of simulated metagenomes to assess the accuracy of functional assignments for predicted unigenes, obtaining better accuracy and computing time performance than homology-based methods. Here, we present recent improvements in eggNOG-mapper, which include i) a major update of the underlying database of fine-grained orthology records now based on 4M phylogenetic trees, covering 5090 reference genomes and 2502 viruses ii) the inclusion of more sources of functional annotations, and iii) a much-improved online service based on cloud computing.

C-30: Protein Function Annotation Using Deep Learning based Word Embedding
COSI: Function COSI
  • Bishnu Sarker, University of Lorraine and INRIA, France
  • Sabeur Aridhi, University of Lorraine and LORIA, France
  • David W. Ritchie, INRIA, France

Short Abstract: Thanks to recent developments in genomic sequencing technologies, the number of protein sequences in public databases is growing enormously. In order to exploit more fully this huge quantity of data, protein sequences need to be annotated with functional properties such as Enzyme Commission (EC) numbers and Gene Ontology terms. The UniProt Knowledgebase (UniProtKB) is currently the largest and most comprehensive resource for protein sequence and annotation data. The January 2019 release of the Uniprot Knowledge base (UniprotKB) contains around 140 million protein sequences. However, only about half of a million of these (UniprotKB/SwissProt) have been reviewed and functionally annotated by expert curators using data extracted from the literature and computational analyses. To reduce the gap between the annotated and unannotated protein sequences, it is essential to develop accurate automatic protein function annotation techniques. Here, we propose a deep learning based automatic protein function annotation technique. The proposed method depends only on the amino acid sequences. We have used Facebook's latest natural language modeling tool called FastText to train a supervised sequence classification model. The preliminary experiment shows a promising result in protein sequence annotation using Enzyme Commission (EC).

C-31: Predicting gene function through analysis of concordant gene transitions in eukaryote evolution.
COSI: Function COSI
  • Riccardo Percudani, University of Parma, Italy
  • Marco Malatesta, University of Parma, Italy

Short Abstract: Analysis of gene coevolution through phylogenetic profiles can potentially predict gene function by establishing functional associations between characterized and uncharacterized genes. The application of this method to eukaryotes has been limited so far by insufficient genome availability and the use of phylogeny-unaware metrics. Here we describe a metric for the identifications of gene coevolutionary relationships that overcomes some limitations of the currently implemented methods of phylogenetic profiling. Our method is based on the enumeration of concordant transitions in pairs of binary vectors describing the presence/absence of orthologous genes in complete genomes. Gene vectors are ordered according to species phylogeny, so that state transitions (1→0 or 0→1) correspond to simultaneous evolutionary events of gene loss or gain. We apply this metric to the analysis of 1271 eukaryotic genomes and 60675 orthologous gene clusters as defined in OrthoDB. Unsupervised clustering of significant pairwise associations with MCL revealed protein networks relating genes with known and unknown function. Our results provide insights into the evolutionary dynamics of the eukaryotic cell and reveal novel components of protein complexes and signaling pathways. We derive testable functional prediction for uncharacterized human genes and provide evidence for their possible involvement in known genetic diseases.

C-32: UniRule - A semi-automatic pipeline for functional annotation
COSI: Function COSI
  • Vishal Joshi, EMBL-EBI, United Kingdom
  • Rabie Saidi, EMBL-EBI, United Kingdom
  • Maria Jesus Martin, EMBL-EBI, United Kingdom

Short Abstract: As of April 2019 UniProtKB contains more than 147 million proteins, of which only about 0.4% are manually reviewed. To bridge this gap we have designed computational approaches and developed a stack of pipelines allowing us to propagate annotations to the unreviewed proteins. The UniRule system leverages expert curation for the automatic annotation of unreviewed proteins in UniProt Knowledgebase. It consists of manually created rules that specify functional annotations and the conditions which must be satisfied for them to be applied, such as taxonomic restrictions, family signatures and sequence features. The UniRule system provides a web interface to facilitate the creation and maintenance of rules. these rules are evaluated by means of statistical measures including confidence, sensitivity, coverage and overlap. These measures are updated in real time whenever changes are made to a rule. The latter could be flagged and then reviewed by curators if it performs sub-optimally. UniRule rules are applied at each monthly release keeping the propagated annotations up-to-date. On the UniProtKB website, information added by UniRule rules is clearly highlighted as such using evidence tags. These tags can also be used as search terms to specifically search for and/or filter out annotation added by a rule.

C-33: Graph-Regularized Autoencoders for Protein Feature Learning
COSI: Function COSI
  • Meet Barot, Center for Data Science, New York University, New York, NY, USA, United States
  • Vladimir Gligorijević, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • Richard Bonneau, Center for Data Science, New York University, New York, NY, USA, United States

Short Abstract: In protein function prediction, as with many biological classification tasks, most samples are unlabeled or have missing information, while others have extra information that is helpful for learning relationships between samples and the labels. Recently, autoencoders were used in a method called deepNF to learn protein features useful for protein function prediction from protein interaction networks in a completely unsupervised fashion. In order to improve on this, we present a semi-supervised learning technique using autoencoders that incorporates a protein-protein similarity graph constructed using GO annotations. We use a regularization term that enforces features of samples to be similar if they are connected in the GO similarity graph, without needing all samples to be labelled. We train autoencoders on protein-protein interaction networks with this regularization term. We then, as in deepNF, train SVM classifiers using these learned features to predict GO terms. We test our model using a temporal holdout validation scheme and show that our method outperforms previous network-based methods on yeast and human STRING networks. Additionally, we show that this same technique can be used to incorporate various kinds of “privileged information” in protein representation learning.

C-34: Large-Scale Benchmarking of Protein Descriptors for Protein Ligand Prediction in Target-Based Modelling and Proteochemometrics
COSI: Function COSI
  • Ahmet Sureyya Rifaioglu, Middle East Technical University, Turkey
  • Heval Atas, Middle East Technical University, Turkey
  • Maria Jesus Martin, EMBL-EBI, United Kingdom
  • Rengül Atalay, Middle East Technical University, Turkey
  • Tunca Dogan, European Bioinformatics Institute, Turkey
  • Mehmet Volkan Atalay, Middle East Technical University, Turkey

Short Abstract: The identification drug/compound-target interactions (DTIs) constitutes the basis of computational drug discovery/repurposing studies. To generate DTI prediction models using supervised machine learning techniques, input ligands and/or proteins are converted into quantitative feature vectors, using various types of molecular descriptors. Therefore, the selection of descriptor sets is crucial to generate predictive models with high performance. While there are many studies for the benchmarking of compound descriptors, protein descriptor analysis studies are scarce. Here, we perform a large-scale benchmark analysis of various sequence-based protein descriptors using random forests and support vector machine algorithms, using target-based and proteochemometric modelling approaches. Proteochemometric approach is a relatively new paradigm that use both compound and protein features for modelling, which is critical for the identification of druggability potential of proteins that haven’t been targeted before. This study will help to identify the protein feature types with better representation capabilities to be used both in DTI prediction and for other types of automated protein annotation. Large-scale DTI datasets prepared in this study by extensive filtering operations (to generate training and benchmark sets suitable for proteochemometric modelling) are expected to fill an important gap and serve the computational drug discovery and repurposing community.

C-35: Sequence to Function: Protein Function Prediction without experiments
COSI: Function COSI
  • Mateo Torres, Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, United Kingdom
  • Alfonso E. Romero, Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, United Kingdom
  • Haixuan Yang, School of Mathematics, Statistics and Applied Mathematics. National University of Ireland, Galway, Ireland
  • Alberto Paccanaro, Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, United Kingdom

Short Abstract: The experimental characterisation of protein function is a lengthy and expensive process. With Next Generation Sequencing techniques, the gap between functionally annotated and unannotated proteins has dramatically increased, with less than 1% of proteins functionally characterised. Computational methods that go beyond sequence similarity typically require complementary information about proteins (e.g. protein-protein interactions or co-expression), which are obtained through extensive experimental efforts. The functional characterisation of newly sequenced organisms still represents an open challenge. In these cases, computational methods based on sequence similarity are the only option currently available. We propose S2F (Sequence to Function), a new framework for protein function prediction that is suitable for newly sequenced organisms. S2F exploits the “guilt-by-association” principle in networks to propagate potentially relevant functions between proteins. It starts by making a sensible prediction based on sequence similarity – the initial labels. Then, a protein-protein network is constructed by transferring protein-protein relations from organisms with known experimental evidence. Finally, the initial labels are propagated through the network using a novel label propagation model. We show the performance of our method in 12 Bacteria, where it outperforms state-of-the-art methods for newly sequenced organisms. S2F was listed among the top performers of the CAFA2 challenge.

C-36: CrowdGO: predicting protein functions using a wisdom of the crowd approach
COSI: Function COSI
  • Maarten Reijnders, University of Lausanne, Department of Ecology and Evolution, Switzerland

Short Abstract: Motivation: Predicting protein functions in high-throughput is notoriously challenging. Sequence similarity-based annotation methods often have a high specificity and sensitivity in case of well-characterized orthologs, but have a harder time predicting functionality in the absence of such orthologs. Homology independent-based methods do exist, but usually have a lower specificity, as other protein features are less informative than sequence conservation. In an ideal case scenario, different prediction methods should be combined to achieve high specificity and sensitivity for as many proteins as possible. Methods: To achieve higher sensitivity and specificity in de novo protein function prediction, CrowdGO combines predictions from different sources, e.g. multiple homology dependent and independent protein function prediction methods. It uses Gene Ontology semantic similarity to correlate and compare various functional predictions and reassesses these predicted terms using a random forest. Results: CrowdGO is shown to significantly improve GO term prediction over singular methods (P =< 2.22e-16). Additionally, CrowdGO shows it successfully changes false positive and false negative predictions from each singular method, to true negative and true positive predictions respectively. Given the significant increase in both sensitivity and specificity, CrowdGO would be a good addition to any omics study in need of high-throughput prediction of its functionome.

C-37: Deciphering Protein Functional Complexity With Alternative Splicing Evolution
COSI: Function COSI
  • Hugues Richard, Sorbonne Université - Laboratory of Computational and Quantitative Biology (LCQB, CNRS-SU), France
  • Elodie Laine, Sorbonne Université - Laboratory of Computational and Quantitative Biology (LCQB, CNRS-SU), France
  • Diego Zea, Sorbonne Université, France

Short Abstract: Alternative Splicing (AS) is an essential regulatory process by which multiple isoforms are produced from the same gene. It has the potential to greatly expand the protein repertoire and to contribute to functional diversity. Growing experimental evidence has shown that AS modulates various biochemical processes and that alternative isoforms of the same gene can accomplish different biological functions by interacting with different partners. We propose two methods, ThorAxe and PhyloSofS, to reconstruct AS evolution and to functionally annotate exons and transcripts. ThorAxe combines pairwise and multiple sequence alignments to establish transcript-aware homologous relationships between exons across species, while PhyloSofS reconstructs the evolutionary history of the set of transcripts, and makes homology-based structural models of their proteins isoforms. We analyse a set of a dozen genes whose AS has been shown to be functionally relevant where we identified non trivial groups of homologous exon and date the corresponding set of transcripts. To our knowledge we are the first one to provide such a rich description about the complexity of AS induced protein functional diversity.

C-38: New methods for searching for similar low complexity regions in proteins
COSI: Function COSI
  • Patryk Jarnot, Silesian University of Technology, Poland
  • Joanna Ziemska-Legięcka, University of Warsaw, Poland
  • Marcin Grynberg, Polish Academy of Science, Poland
  • Aleksandra Gruca, Silesian University of Technology, Poland

Short Abstract: About 14% of proteins contain low complexity regions (LCRs). These regions are characterized by low diversity of amino acids composition. They could play key roles in protein functions and are relevant to protein structure. However, current statistical models implemented in state-of-the-art methods for searching for similarities in protein sequences are not designed for analysis of low complexity regions. We propose three different methods which are able to search for similar LCRs. The first one creates graphs from LCRs and create clusters including similar sequences. Subsequently it finds cycles in this graph and finally uses these cycles to create clusters including similar sequences. The second method calculates PSSMs based on similar LCRs, and then by using these matrices compares sequences. The third method is LCR-BLAST which is modified version of the BLAST. Each of the method could be used for different purposes. Graph method searches for similar sequences using repeats. PSSM method is very sensitive and does not care about position of the similar part of the sequence. LCR-BLAST is able to search for LCR’s which have similar part of sequence. Results obtained with new methods are quantitatively and qualitatively compared with the standard BLAST approach.

C-39: Site2Vec: Deep neural network based vector embeddings for protein-ligand binding sites
COSI: Function COSI
  • Arnab Bhadra, Indian Institute of Technology Tirupati, India
  • Kalidas Yeturu, Indian Institute of Technology Tirupati, India

Short Abstract: We report here development of a novel method for reference frame invariant vector embedding of the 3D structure of a protein-ligand binding site. Protein-ligand binding sites represent hot spots of interaction between a protein and a ligand at the subtlest molecular level of living systems. Determining similarities between binding sites has been a main focus area since the last decade in the fields of structural bioinformatics and computational drug discovery. One of the primary methods is computing a vector descriptor for a given binding site which can be used for clustering, classification and other machine learning processes. However methods today, are limited by mainly by two issues - (i) being sensitive to the frame of reference and (ii) suffer from the curse of dimensionality in case of 3D structural fingerprints. In this context, we have developed a novel algorithm that deduces a descriptor in a reference frame invariant representation of a binding site via pairwise atomic distances and then normalizing the descriptor to a fixed dimensional vector using deep neural networks. The algorithm is fast on standard computer configuration and yields more than 80-90% accuracies when tested on data sets of known similar and dissimilar protein-ligand binding site associations.

C-40: Creating functional maps of protein sequences
COSI: Function COSI
  • Maximilian Miller, Rutgers University, United States
  • Daniel Vitale, George Washington University, United States
  • Burkhard Rost, Technical University of Munich, Germany
  • Yana Bromberg, Rutgers University, United States

Short Abstract: Evaluating the impact of non-synonymous genetic variants is essential for uncovering disease associations. Understanding the corresponding changes in sequence can further facilitate synthetic protein design and stability assessments. Despite continuous efforts in the field, little improvement in performance has been observed in recent years. One reasons for this might be that most approaches exploit similar sets of gene/protein features for model development, e.g. sequence conservation. While high levels of conservation clearly highlight residues essential for protein activity, much of the in vivo observable variation is arguably weaker in its impact and, thus, requires evaluation of a higher level of resolution. We developed function Neutral/Toggle/Rheostat predictor (funtrp) to classify protein sequence positions based on the expected range of mutational impacts: Neutral (mostly no/weak effects), Rheostat (range of effects; i.e. functional tuning), or Toggle (mostly strong effects). Three conclusions of our work are most salient: (i) position types do not correlate strongly with familiar protein features such as conservation or protein disorder; (ii) position type distribution varies across different enzyme classes; (iii) position types reflect experimentally derived functional effects, improving performance of existing variant effect predictors. This suggests that future predictors would greatly benefit from incorporating funtrp functional maps as additional feature.

C-41: NetGO: Improving Large-scale Protein Function Prediction with Massive Network Information
COSI: Function COSI
  • Fengzhu Sun, Department of Biological Sciences, University of Southern California, United States
  • Xiaodi Huang, Charles Sturt University, Australia
  • Hiroshi Mamitsuka, Kyoto University, Japan
  • Shanfeng Zhu, Fudan University, China
  • Ronghui You, Fudan University, China
  • Shuwei Yao, Fudan University, China
  • Yi Xiong, Shanghai Jiao Tong University, China

Short Abstract: Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology (GO) terms as its labels. Based on our GOLabeler -- a state-of-the-art method for the third critical assessment of functional annotation (CAFA3), in this paper we propose NetGO, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information. Specifically, the advantages of NetGO are threefold in using network information: 1) NetGO relies on a powerful learning to rank (LTR) framework from machine learning to effectively integrate both sequence and network information of proteins; 2) NetGO uses the massive network information of all species in STRING (other than only some specific species); and 3) NetGO still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO. Experimental results have clearly demonstrated that NetGO significantly outperforms GOLabeler. The NetGO web server is freely available at http://issubmission.sjtu.edu.cn/netgo/.

C-42: Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
COSI: Function COSI
  • Alice C. McHardy, Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Germany
  • Ehsaneddin Asgari, University of California Berkeley, Computational Biology of Infection Research, Helmholtz Centre for Infection Research, United States
  • Mohammad R.K. Mofrad, University of California Berkeley, Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Lab, United States

Short Abstract: We present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text-compression algorithm. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences and then can be widely used as the input to any downstream machine learning task. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variable-length protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in protein function prediction tasks.