Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


Function COSI


Schedule subject to change
Monday, July 13th
10:40 AM-11:40 AM
Function COSI Keynote: DNA methylases – computation, experiments and new biology
Format: Live-stream

  • Richard J. Roberts
12:00 PM-12:20 PM
Update on Protein Functional Annotation in UniProt in 2020
Format: Pre-recorded with live Q&A

  • María Martin, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Rabie Saidi, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Hermann Zellner, EMBL, United Kingdom

Presentation Overview: Show

With the increasing number of data generated by sequencing projects, researchers need reliable systems to provide the functional annotation of proteins. UniProtKB is using two functional annotation systems, UniRule and ARBA (Association-Rule-Based Annotator), to automatically annotate UniProtKB/TrEMBL in an efficient and scalable manner with a high degree of accuracy. These systems use protein signatures and taxonomy classifications to infer the biochemical features and biological functions of proteins. This knowledge is expressed in the form of rules: sets of IF-THEN statements coming from expert curation (UniRule [1]) or generated by machine learning (ARBA [2]). Rules are applied at each release keeping the propagated annotations up-to-date. On the UniProtKB website, information added by annotation rules is clearly highlighted as such using evidence tags. These tags can also be used as keywords to search for or filter out annotation added by a rule.
The protein function community could also benefit from those rules, as some sequences may not be yet available in public databases or could be present in highly redundant proteomes absent from UniProtKB. This has been made possible via UniFIRE (The UniProt Functional annotation Inference Rule Engine), an engine to execute rules in the URML (UniProt Rule Markup Language) format.

12:20 PM-12:40 PM
Pruning the Protein Jungle: recent developments in the CATH-Gardener function analysis and prediction pipeline.
Format: Pre-recorded with live Q&A

  • I Sillitoe, Institute of Structural and Molecular Biology, University College London, United Kingdom
  • Christine Orengo, Institute of Structural and Molecular Biology, University College London, United Kingdom
  • Clemens Rauer, University College London, United Kingdom
  • Nicola Bordin, University College London, United Kingdom
  • Hm Scholes, Institute of Structural and Molecular Biology, University College London, United Kingdom

Presentation Overview: Show

The CATH-Gene3D database includes evolutionary relationships between protein domains, and classifies domains into superfamilies. Functional Families (FunFams) subdivide superfamilies further to provide clusters and alignments of protein domains that all perform closely related functions. FunFams have performed well in function prediction assessments (CAFA) and provide insights both on functional sites and effects of variants on structure and disease.

FunFams are created with Gardener: a novel pipeline for clustering massive protein datasets. Domains are first partitioned by Multi-Domain Architecture (MDA) then tree-building/tree-cutting algorithms (GeMMA/FunFHMMER) are applied iteratively. To deal with huge numbers of sequences from metagenome projects, we have employed a novel random sampling approach that reduces search space, whilst retaining functional diversity required to build informative alignments.

For CAFA4, we implemented a generalised Fast and Frugal Tree algorithm to filter and combine search results from a variety of sources. Sources contributing to these results included: FunFams built from CATH and Pfam domains, a predictor based on autoencoding of network information and a Machine Learning (ML) approach trained to recognise patterns in FunFam sequence matches.

Benchmarks suggest significant improvement in accuracy for the most recent FunFams, also that network data considerably enhances function prediction in in-silico validation on fission yeast.

2:00 PM-2:20 PM
Proceedings Presentation: Benchmarking Gene Ontology Function Predictions Using Negative Annotations
Format: Pre-recorded with live Q&A

  • Christophe Dessimoz, University of Lausanne, Switzerland
  • Alex Warwick Vesztrocy, University of Lausanne, Switzerland

Presentation Overview: Show

With the ever-increasing number and diversity of sequenced species, the challenge to characterise genes with functional information is even more important. In most species, this characterisation almost entirely relies on automated electronic methods. As such, it is critical to benchmark the various methods. The CAFA series of community experiments provide the most comprehensive benchmark, with a time-delayed analysis leveraging newly curated experimentally supported annotations. However, the definition of a false positive in CAFA has not fully accounted for the Open World Assumption (OWA), leading to a systematic underestimation of precision. The main reason for this limitation is the relative paucity of negative experimental annotations. This paper introduces a new, OWA-compliant, benchmark based on a balanced test set of positive and negative annotations. The negative annotations are derived from expert-curated annotations of protein families on phylogenetic trees. This approach results in a large increase in the average information content (IC) of negative annotations. The benchmark has been tested using the naïve and BLAST baseline methods, as well as two orthology-based methods. This new benchmark could complement existing ones in future CAFA experiments.

2:20 PM-2:40 PM
Proceedings Presentation: The Ortholog Conjecture Revisited: the Value of Orthologs and Paralogs in Function Prediction
Format: Pre-recorded with live Q&A

  • Predrag Radivojac, Northeastern University, United States
  • Moses Stamboulian, Indiana University, United States
  • Rafael Guerrero, Indiana University, United States
  • Matthew Hahn, Indiana University, United States

Presentation Overview: Show

The computational prediction of gene function is a key step in making full use of newly sequenced genomes. Function is generally predicted by transferring annotations from homologous genes or proteins for which experimental evidence exists. The ''ortholog conjecture'' proposes that orthologous genes should be preferred when making such predictions, as they evolve functions more slowly than paralogous genes. Previous research has provided little support for the ortholog conjecture, though the incomplete nature of the data cast doubt on the conclusions. Here we use experimental annotations from over 40,000 proteins, drawn from over 80,000 publications, to revisit the ortholog conjecture in two pairs of species: (i) Homo sapiens and Mus musculus and (ii) Saccharomyces cerevisiae and Schizosaccharomyces pombe. By making a distinction between questions about the evolution of function versus questions about the prediction of function, we find strong evidence against the ortholog conjecture in the context of function prediction, though questions about the evolution of function remain difficult to address. In both pairs of species, we quantify the amount of data that must be ignored if paralogs are discarded, as well as the resulting loss in prediction accuracy. Taken as a whole, our results support the view that the types of homologs used for function transfer are largely irrelevant to the task of function prediction. Aiming to maximize the amount of data used for this task, regardless of whether it comes from orthologs or paralogs, is most likely to lead to higher prediction accuracy.

2:40 PM-3:00 PM
Proceedings Presentation: Discovery of multi-operon colinear syntenic blocks in microbial genomes
Format: Pre-recorded with live Q&A

  • Dina Svetlitsky, Ben Gurion University of the Negev, Israel
  • Tal Dagan, Kiel University, Germany
  • Michal Ziv-Ukelson, Ben Gurion University of the Negev, Israel

Presentation Overview: Show

An important task in comparative genomics is to detect functional units by analyzing gene-context patterns. Colinear syntenic blocks (CSBs) are groups of genes that are consistently encoded in the same neighborhood and in the same order across a wide range of taxa. Such colinear syntenic blocks are likely essential for the regulation of gene expression in prokaryotes. Recent results indicate that colinearity can be conserved across multiple operons, thus motivating the discovery of multi-operon CSBs. This computational task raises scalability challenges in large datasets.

We propose an efficient algorithm for the discovery of cross-strand multi-operon CSBs in large genomic datasets. The proposed algorithm uses match-point arithmetic, which is scalable for large datasets of microbial genomes in terms of running time and space requirements. The algorithm is implemented and incorporated into a tool with a graphical user interface, denoted CSBFinder-S. We applied CSBFinder-S to data mine 1,485 prokaryotic genomes and analyzed the identified cross-strand CSBs. Our results indicate that most of the syntenic blocks are exclusively colinear. Additional results indicate that transcriptional regulation by overlapping transcriptional genes is abundant in bacteria. We demonstrate the utility of CSBFinder-S to identify common function of the gene-pair PulEF in multiple contexts, including Type 2 Secretion System, Type 4 Pilus System, and DNA uptake.

3:20 PM-4:00 PM
Function COSI Keynote: Saving Time at the Bench and in the Field: Predicting Gene Function and Phenotype in Crops
Format: Pre-recorded with live Q&A

  • Carolyn Lawrence-Dill, Iowa State University, United States
4:00 PM-4:20 PM
DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier
Format: Pre-recorded with live Q&A

  • Maxat Kulmanov, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

Predicting the phenotypes resulting
from molecular perturbations is one of the key challenges in
genetics. Both forward and reverse genetic screen are employed to
identify the molecular mechanisms underlying phenotypes and disease,
and these resulted in a large number of genotype--phenotype
association being available for humans and model organisms.
Combined with recent advances in machine learning, it may now be
possible to predict human phenotypes resulting from particular
molecular aberrations.
We developed DeepPheno, a neural network based
hierarchical multi-class multi-label classification method for
predicting the phenotypes resulting from complete loss-of-function
in single genes. DeepPheno uses the functional annotations with gene
products to predict the phenotypes resulting from a
loss-of-function; additionally, we employ a two-step procedure in
which we predict these functions first and then predict
phenotypes. Prediction of phenotypes is ontology-based and we
propose a novel ontology-based classifier suitable for very large
hierarchical classification tasks. These methods allow us to predict
phenotypes associated with any known protein-coding gene. We
evaluate our approach using evaluation metrics established by the
CAFA challenge and compare with top performing CAFA2 methods as well
as several state of the art phenotype prediction approaches,
demonstrating the improvement of DeepPheno over state of the art

4:20 PM-4:40 PM
HPOLabeler: Improving Prediction of Human Protein-Phenotype Associations by Learning to Rank
Format: Pre-recorded with live Q&A

  • Lizhi Liu, Fudan University, China
  • Xiaodi Huang, school of mathematics and computing, Charles Sturt University, Australia
  • Hiroshi Mamitsuka, Kyoto University / Aalto University, Japan
  • Shanfeng Zhu, Fudan University, China

Presentation Overview: Show

Annotating human proteins by abnormal phenotypes has become an important topic. As of Nov. 2019, only less than 4,000 proteins have been annotated with Human Phenotype Ontology (HPO). Thus a computational approach for accurately predicting protein-HPO associations would be important, while no methods have outperformed a simple Naive approach in the CAFA2 (second Critical Assessment of Functional Annotation, 2013-14). We present HPOLabeler, which can use a wide variety of evidence, such as protein-protein interaction networks (PPI), Gene Ontology (GO), InterPro, trigram frequency and HPO term frequency, in the framework of learning to rank (LTR). Given an input protein, LTR outputs the ranked list of HPO terms from a series of input scores given to the candidate HPO terms by component learning models (logistic regression, nearest neighbor and a Naive method), which are trained from given multiple evidence. We empirically evaluate HPOLabeler extensively through mainly two experiments of cross-validation and temporal validation, for which HPOLabeler significantly outperformed all component models and competing methods including the current state-of-the-art method. We further found that 1) PPI is most informative for prediction among diverse data sources, and 2) low prediction performance of temporal validation might be caused by incomplete annotation of new proteins.

5:00 PM-5:20 PM
Enzymes, Moonlighting Enzymes, Pseudoenzymes: Similar in Sequence, Different in Function
Format: Pre-recorded with live Q&A

  • Constance Jeffery, University of Illinois at Chicago, United States

Presentation Overview: Show

The function of a newly sequenced protein is often estimated by sequence alignment with the sequences of proteins with known functions. However, members of a protein superfamily can share significant amino acid sequence identity but vary in the reaction catalyzed and/or the substrate used. In addition, a protein superfamily can include moonlighting proteins, which have two or more functions, and pseudoenzymes, which have a three-dimensional fold that resembles a conventional catalytically active enzyme, but do not have catalytic activity. I will discuss several examples of protein families that contain enzymes with noncanonical catalytic functions, pseudoenzymes, and/or moonlighting proteins. Pseudoenzymes and moonlighting proteins are widespread in the evolutionary tree and are found in many protein families, and they are often very similar in sequence and structure to their monofunctional and catalytically active counterparts. A greater understanding is needed to clarify when similarities and differences in amino acid sequences and structures correspond to similarities and differences in biochemical functions and cellular roles. This information can help improve programs that identify protein functions from sequence or structure and assist in more accurate annotation of sequence and structural databases, as well as in our understanding of the broad diversity of protein functions.

5:20 PM-5:40 PM
eCAMI: simultaneous classification and motif identification for enzyme annotation
Format: Pre-recorded with live Q&A

  • Yanbin Yin, Department of Food Science and Technology, Nebraska Food for Health Center, University of Nebraska-Lincoln, United States

Presentation Overview: Show

Carbohydrate-active enzymes (CAZymes) are extremely important to bioenergy, human gut microbiome, and plant pathogen researches and industries. We developed a new amino acid k-mer based CAZyme classification, motif identification, and genome annotation tool using a bipartite network algorithm. Using this tool, we classified 390 CAZyme families into thousands of subfamilies each with distinguishing k-mer peptides. These k-mers represented the characteristic motifs of each subfamily, and thus were further used to annotate new genomes for CAZymes. This idea was also generalized to extract characteristic k-mer peptides for all the Swiss-Prot enzymes classified by the EC (enzyme commission) numbers and applied to enzyme EC prediction.

This new tool was implemented as a Python package named eCAMI. Benchmark analysis of eCAMI against the state-of-the-art tools on CAZyme and enzyme EC datasets found that: (i) eCAMI has the best performance in terms of accuracy and memory use for CAZyme and enzyme EC classification and annotation; (ii) the k-mer based tools (including PPR-Hotpep, CUPP, eCAMI) perform better than homology-based tools and deep-learning tools in enzyme EC prediction. Lastly, we confirmed that the k-mer based tools have the unique ability to identify the characteristic k-mer peptides in the predicted enzymes.

The paper is published at https://doi.org/10.1093/bioinformatics/btz908.

5:40 PM-6:00 PM
Learning sequence, structure and network features for protein function prediction
Format: Pre-recorded with live Q&A

  • Vladimir Gligorijevic, Flatiron Institute, Simons Foundation, United States
  • Daniel Berenberg, Flatiron Institute, Simons Foundation, United States
  • Richard Bonneau, New York University, United States
  • Meet Barot, Center for Data Science, New York University, United States
  • James Morton, Flatiron Institute, Simons Foundation, United States

Presentation Overview: Show

Recently, self-supervised and unsupervised representation learning approaches have shown tremendous potential in exploring the huge volume of protein data in many databases and learning features indicative of protein function. Building upon our previous works on sequence, structure, and network data, we propose to systematically investigate the contribution of these features on classification performance on individual GO terms, study their complementarity, and build an integrative model.

We compute: sequence-based features using our LSTM language model pre-trained on ~10 million protein sequences; structure-based features from contact maps using a Graph Autoencoder pretrained on ~30k domain structures from CATH and network-based features using our deepNF model pre-trained on 6 different networks from STRING. Then we train a separate neural network (NN) for predicting GO term probabilities on each individual feature set.

We will show results from an experiment we conducted on ~18,000 human proteins using their sequences, 3D structures retrieved from PDB and SWISS-MODEL and PPI networks. We will also present the performance results obtained by training a multi-modal NN with all three feature sets for multiple organisms. We show that different modalities contribute to different GO terms, and show that the model integrating information from all sources outperforms the individual models.

Tuesday, July 14th
10:40 AM-11:20 AM
Function COSI Keynote: Gene function prediction using unsupervised biological network integration.
Format: Live-stream

  • Gary Bader, University of Toronto, Canada

Presentation Overview: Show

Biological networks have the power to map cellular function, but only when unified to overcome their individual limitations such as bias and noise. Unsupervised network integration addresses this, automatically weighting input information to obtain an accurate, unified result. However, existing unsupervised network integration methods do not adequately scale to the number of nodes and networks present in genome-scale data and do not handle frequently encountered data characteristics (e.g. partial network overlap). To address this, we have developed an unsupervised deep learning-based network integration algorithm that incorporates recent advances in reasoning over unstructured data – namely the Graph Convolutional Network (GCN) – that can effectively learn dependencies between physical, co-expression and genetic interaction network topologies. Our method, BIONIC (Biological Network Integration using Convolutions), produces high quality gene and protein features which capture and unify information across many diverse functional interaction networks. BIONIC learns features which contain substantially more functional information compared to existing approaches, linking genes and proteins that share co-complex, pathway and bioprocess relationships.

11:20 AM-11:40 AM
Cross-species functional prediction by global network alignment
Format: Pre-recorded with live Q&A

  • Wayne Hayes, UCI, United States

Presentation Overview: Show

We report the first successful cross-species pre-
diction of protein function based solely on topology-driven
global network alignment. Using SANA (the Simulated Annealing
Network Aligner), we pair the proteins of one species with those
of another solely by maximizing the number of aligned edges
between the networks. We find that SANA’s confidence, called
NAF, in each individual pair of aligned proteins correlates with
their functional similarity. We then apply SANA to BioGRID
3.0 networks from April 2010, and use GO data from the
same month to transfer GO annotations from better-annotated
proteins to lesser-annotated ones. We validate the predictions on
a recent GO release and find an AUPR of up to 0.4 depending
on the predicting GO evidence code, even when restricting
predictions to proteins that have no observed sequence or
homology relationship. Finally, we apply the same method
to recent BioGRID PPI networks of mouse and human, and
predict novel cilia-related GO terms in human proteins based on
their confident alignment with cilia-annotated mouse proteins;
the most confident predictions have literature validation rates
above 80%. We propose topology-based alignment of PPI
networks as a novel source for prediction of protein function
that is independent of sequence or structural information

12:00 PM-12:20 PM
CAFA4 overview
Format: Live-stream

  • Predrag Radivojac, Northeastern University, United States
12:20 PM-12:40 PM
CAFA4 Overview
Format: Live-stream

  • Damiano Piovesan, Department of Biomedical Sciences, University of Padova, Italy
2:00 PM-2:20 PM
CAFA4 Talk1
Format: Live-stream

  • Shanfeng Zhu, Fudan University, China
2:20 PM-2:40 PM
CAFA4 Talk 2
Format: Live-stream

  • Christine Orengo, Institute of Structural and Molecular Biology, University College London, United Kingdom
2:40 PM-3:00 PM
CAFA4 Talk 3
Format: Live-stream

  • Hongryul Ahn
3:20 PM-3:40 PM
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
Format: Pre-recorded with live Q&A

  • Amelia Villegas-Morcillo, University of Granada, Spain
  • Stavros Makrodimitris, Delft University of Technology, Netherlands
  • Roeland C.H.J. van Ham, Delft University of Technology, Netherlands
  • Angel M. Gomez, University of Granada, Spain
  • Victoria Sanchez, University of Granada, Spain
  • Marcel Reinders, Delft University of Technology, Netherlands

Presentation Overview: Show

Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. In this work we applied an existing deep sequence model that had been pre-trained in an unsupervised setting on the supervised task of protein function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for deep prediction models, as a two-layer perceptron was enough to achieve state-of-the-art performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that three-dimensional structure is also potentially learned during the unsupervised pre-training.

3:40 PM-4:00 PM
Embeddings allow GO annotation transfer beyond homology
Format: Pre-recorded with live Q&A

  • Maria Littmann, Department of Informatics, Technical University of Munich, Germany
  • Michael Heinzinger, (TUM) Technical University of Munich, Germany
  • Burkhard Rost, Rostlab, Germany

Presentation Overview: Show

Understanding protein function is crucial for molecular and medical biology, nevertheless Gene Ontology (GO) annotations have manually been confirmed for fewer than 0.5% of all known protein sequences. Computational methods bridge this sequence-function gap, but the best prediction methods need evolutionary information to predict function. Here, we proposed a new method predicting GO terms through annotation transfer not using sequence similarity. Instead, the method uses SeqVec embeddings to transfer annotations between proteins through proximity in embedding space. SeqVec’s data driven feature extraction transferred knowledge from large unlabeled databases to smaller but labelled datasets (transfer learning). Replicating the conditions of CAFA3, our method reached an Fmax of 50%, 59%, and 65% for BPO, MFO, and CCO, respectively. This was numerically higher than all methods that had actually participated in CAFA3 for BPO and CCO and scored second for MFO. Restricting the lookup dataset to proteins with less than 20% pairwise sequence identity to the targets, performance dropped clearly (Fmax BPO 38%, MFO 46%, CCO 56%), but continued to clearly outperform simple homology-based inference. Thereby, the new method may help in annotating novel proteins not belonging to large families.

4:00 PM-4:20 PM
ContactPFP: Protein Function Prediction Using Predicted Contact Information
Format: Pre-recorded with live Q&A

  • Yuki Kagaya, Tohoku University, Japan
  • Daisuke Kihara, Purdue University, United States
  • Aashish Jain, Purdue University, United States
  • Sean Flannery, Purdue University, United States

Presentation Overview: Show

Protein function prediction is an important task in bioinformatics. Although many current function prediction methods rely heavily on sequence-based information, three-dimensional (3D) structure of proteins is very useful to identify the evolutionary relationship of proteins, from which function similarity can be inferred. Here, we developed a novel protein function prediction method called ContactPFP, which uses predicted residue-residue contact maps to identify the structural similarity of proteins. ContactPFP showed comparable performance to existing sequence-based methods.

4:20 PM-4:40 PM
Fine-tuning of Language Model-Based Representation for Protein Functional Annotation
Format: Pre-recorded with live Q&A

  • Ehsaneddin Asgari, Helmholtz Center for Infection Research, Germany
  • Andrew Dickson, University of California, Berkeley, United States
  • Meisam Ahmadi, Iran University of Science and Technology (IUST), Iran
  • Mohammad Khodabakhsh, University of Michigan, United States
  • Alice McHardy, Helmoltz Centre for Infection Research, Germany
  • Mohammad R.K. Mofrad, University of California, Berkeley, United States

Presentation Overview: Show

We present our approach in fine-tuning of language model-based representation for the prediction of (i) gene-ontology (GO), (ii) human-phenotype-ontology (HPO), and (iii) disorder-ontology (DO) terms, as subtasks of CAFA 4. Recently, transfer learning showed significant improvements in many machine learning problems, in particular at the scarcity of annotated data. Combinations of being self-supervised as well as being general enough, makes neural language modeling an ideal candidate for transfer learning on the sequential data. Subsequently, the trained language modeling network can be fine-tuned for any particular task, even when only a limited number of annotations are available. In CAFA 4 subtasks, we make use of language model-based transfer learning throughout the following steps: (i) we train a language model-based representation of protein sequences on a large collection of protein sequences (UniRef50) (ii) we fine-tune the obtained model for the supervised task of neural GO prediction, which relatively has more training instances than HPO and DO (iii) for the second time we fine-tune the model already tuned for the GO prediction, this time for the prediction of HPO and DO. To improve the predictions, we use an ensemble of different fine-tuning paths from language modeling to the supervised annotation prediction of interest.

5:00 PM-5:20 PM
INGA protein function prediction for the dark proteome
Format: Pre-recorded with live Q&A

  • Damiano Piovesan, Department of Biomedical Sciences, University of Padova, Italy
  • Silvio Tosatto, Department of Biomedical Sciences, University of Padova, Italy

Presentation Overview: Show

Our current knowledge of complex biological systems is stored in a computable form through the Gene Ontology (GO) which provides a comprehensive description of gene function. Prediction of GO terms from the sequence remains, however, a challenging task, which is particularly critical for novel genomes. Here we present INGA version 3.0, a new version of the INGA software for protein function prediction. INGA exploits homology, domain architecture and information from the ‘dark proteome’, i.e. all those features difficult to detect looking at sequence conservation. In the new version, which was implemented for CAFA4, we also included low complexity, coiled coil and homorepeat information. Dark features are used to enrich domain architectures allowing to increase the specificity of related functions. Also, they allow the characterization of those proteins not matching any known domain. INGA was ranked in the top ten methods on CAFA2 and second for the majority of CAFA3 challenges. The new algorithm can process entire genomes in a few hours or even less when additional input files are provided. The INGA web server, databases and benchmarking are available from URL: https://inga.bio.unipd.it/.

5:20 PM-5:40 PM
CaoLab: Protein function and disorder prediction from sequence based on RNN
Format: Pre-recorded with live Q&A

  • Kyle Hippe, Pacific Lutheran University, United States
  • Sola Gbenro, Pacific Lutheran University, United States
  • Renzhi Cao, Pacific Lutheran University, United States

Presentation Overview: Show

A lot of progress has been made in the machine learning and natural language processing field. Here we introduce the CaoLab server that attended the latest CAFA4 experiment. We used natural language processing and machine learning techniques to tackle the protein function prediction and disorder prediction problem. ProLanGO2 is used to predict protein function using protein sequence, and ProLanDO would make protein disorder prediction from protein sequence. The latest version of UniProt database (on 12/12/2019) is used for extracting the top 2000 most frequent k-mers (k from 3 to 7) to build a fragment sequence database FSD. The ProLanDO method uses DO database provided by CAFA4 (https://www.disprot.org/) while each sequence is filtered with FSD, and the character-level RNN model is trained to classify the DO term. The ProLanGO2 method is an updated version of ProLanGO published in 2017, which uses the latest version of Uniprot database filtered by FSD. The Encoder-Decoder network, a model consisting of two RNNs (encoder and decoder), is used to train models on the dataset, and the top 100 best performing models are used to select ensemble models as the final model for protein function prediction.

5:40 PM-6:00 PM
Protein function inference via curated aggregate co-expression networks.
Format: Pre-recorded with live Q&A

  • John Lee, Cold Spring Harbor Laboratory, United States
  • Jesse Gillis, Cold Spring Harbor Laboratory, United States
  • John Hover, Cold Spring Harbor Laboratory, United States

Presentation Overview: Show

We explore the specific utility of co-expression data for the CAFA challenge. Our lab has assembled a set of highly-curated aggregate co-expression networks derived from 895 datasets (~39 517 samples) from 14 species, providing a well-powered source for inference.
We tested performance via the standard CAFA approach of calculating Fmax per protein. Our “direct expression” method simply finds the top co-expressed genes of benchmark genes, and transfers functions where available. Our “orthoexpression” extends species coverage by first looking up top homologs to the benchmarks, finding coexpressed genes from those, and then transferring functions.
We found evidence that expression data contains information that can be used to improve functional annotation. Both our methods performed well relative to the pure sequence-based (phmmer) baseline, particularly for BP. To quantify potential performance improvements we ask how well we could do if we knew in advance which method is better. Plots of predictions against one another show substantial room for improvement. We find that while aggregated coexpression data can successfully be used to improve protein function prediction, integration with sequence-based methods is a major area for potential progress in the field.