All times listed are in UTC
- Ronghui You, Fudan University, China
- Shuwei Yao, Fudan University, China
- Hiroshi Mamitsuka, Kyoto University / Aalto University, Japan
- Shanfeng Zhu, Fudan University, China
Presentation Overview: Show
Motivation: Automated function prediction (AFP) of proteins is a large-scale multi-label classification problem. Two limitations of most network-based methods for AFP are 1) a single model must be trained for each species and 2) protein sequence information is totally ignored. These limitations cause weaker performance than sequence-based methods. Thus, the challenge is how to develop a powerful network based method for AFP to overcome these limitations.
Results: We propose DeepGraphGO, an end-to-end, multispecies graph neural network-based method for AFP, which makes the most of both protein sequence and high-order protein network information. Our multispecies strategy allows one single model to be trained for all species, indicating a larger number of training samples than existing methods. Extensive experiments with a large-scale dataset show that DeepGraphGO outperformed a number of competing state-of-the-art methods significantly, including DeepGOPlus and three representative network-based methods: GeneMANIA, deepNF and clusDCA. We further confirme the effectiveness of our multispecies strategy and the advantage of DeepGraphGO over so-called difficult proteins. Finally, we integrate DeepGraphGO into the state-of-the-art ensemble method, NetGO, as a component and achieve a further performance improvement.
- Davide Baldazzi, University of Bologna - Biocomputing Group, Italy
- Castrense Savojardo, University of Bologna - Biocomputing Group, Italy
- Pier Luigi Martelli, University of Bologna - Biocomputing Group, Italy
- Rita Casadio, University of Bologna - Biocomputing Group, Italy
Presentation Overview: Show
We benchmark our BENZ, a method recently introduced for the four level Enzyme Commission (EC) Number annotation, on the human reference proteome. The analysis of false positives (enzymes that the system predicts from the pool of sequences without EC annotation) reveals 5741 new enzymes. Our annotations are independently validated by a method relying on GOtoEC and Pfam/InterPro mapping. Results support the relevance of BENZ as a reliable method for EC number annotations. We introduce an independent validation of the false positive predictions and this allows finding new enzymes in the pool of the UniProt non-enzyme sequences. Accuracy (96.84%), wrong predictions (2.06%), unpredicted (3.57%) and false positives (2.69%) values are quite good and similar to those obtained over previous benchmarks on different data sets, as already reported (NAR, 2021 Web Server Issue, in the press). We fail 679 enzyme sequences, and gain 5741 new monofunctional enzymes distributed among the seven enzymatic classes (EC1: 691; EC2: 2332; EC3: 1884; EC4: 199; EC5: 301; EC6: 172 and EC7: 162).
- Jordan Sicherman, Graduate Program in Bioinformatics, Canada
- Nathaniel Lim, Graduate Program in Genome Science and Technology, Canada
- Paul Pavlidis, Michael Smith Laboratories - Department of Psychiatry, Canada
Presentation Overview: Show
A persistent challenge in genomics is the interpretation of “hit lists” of genes, leading to the almost universal application of methods such as Gene Ontology (GO) enrichment. However, these and similar approaches based on gene annotations leave much to be desired and they are often used as a “sanity check” rather than a way to make new discoveries. To offer a complementary perspective with one potential way forwards, we developed and evaluated an algorithm that helps put gene sets into biological context by performing large-scale mining on patterns of differential expression. Here, we present our work which mines over 10,000 transcriptomic datasets in a process we term “condition enrichment”. We show that it successfully predicts the top biological conditions most highly associated with a set of query genes in simulated data from their differential expression properties. We further show our algorithm’s real-world utility by recovering known gene-condition relationships from previously published sex-specific, tissue-specific and pathway-associated genes. Overall, this method of condition enrichment offers an effective way of discovering gene function in terms of what biological conditions they tend to associate with. We have developed and will soon release a web app to make this more widely accessible.
- Paul Pavlidis, Michael Smith Laboratories and Department of Psychiatry, University of British Columbia, Canada
- Qinkai Wu, Graduate Program in Bioinformatics, University of British Columbia, Canada
Presentation Overview: Show
Coexpression analysis has been widely used for gene function prediction, based on the principle of guilt by association. Most studies use transcriptomic data obtained from bulk tissues, where the expression level of genes reflects the contribution of multiple cell types. In previous work we documented how variability of cellular composition impacts coexpression analysis. However, the connection between the predictability of gene functions, coexpression networks and cell type profiles has not been studied. We hypothesized that one reason bulk-data-derived coexpression networks contain signals relevant to function prediction is that it contains information about genes’ expression profiles across cell types. Focusing on human brain datasets, we applied several approaches to test this hypothesis, including creating simulated bulk datasets from single-nucleus data and bulk data deconvolution. We find that much predictive power can be attributed to cell type proportion variation. Consequently, a more explicit and interpretable function prediction can be made directly using expression patterns across cell types, which not only yields similar results but also clearly reveals the association between the functional terms and specific cell types. These findings have important implications for coexpression analysis and function prediction.
- Damiano Piovesan, University of Padova, Italy
- Silvio C. E. Tosatto, University of Padova, Italy
Presentation Overview: Show
Intrinsically disordered proteins, defying the traditional protein structure–function paradigm, are a challenge to study experimentally. Because a large part of our knowledge rests on computational predictions, it is crucial that their accuracy is high. The Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment was established as a community-based blind test to determine the state of the art in prediction of intrinsically disordered regions and the subset of residues involved in binding. A total of 43 methods were evaluated on a dataset of 646 proteins from DisProt. The best methods use deep learning techniques and notably outperform physicochemical methods. The top disorder predictor has Fmax = 0.483 on the full dataset and Fmax = 0.792 following filtering out of bona fide structured regions. Disordered binding regions remain hard to predict, with Fmax = 0.231. Interestingly, computing times among methods can vary by up to four orders of magnitude.
- Jeffrey Yunes, Yunes Foundation for Research on Aging, Portsmouth, NH, USA, United States
- Chengxin Zhang, Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06511, USA, United States
- Petri Törönen, Institute of Biotechnology, University of Helsinki, Finland, Finland
Presentation Overview: Show
CAFA is a community-wide challenge for automated prediction of protein functions (AFP). AFP has several characteristics that complicate its evaluation, such as incomplete gold standards, imbalanced classes, and structured outputs. After a decade of participation, we have observed ways in which CAFA deviates from the task of experimental characterization of protein function. Firstly, the proteins, GO terms, and function annotations are filtered to a distribution that is biased towards a few model organisms, especially mammals, and towards certain types of terms and annotations. Secondly, some of the main assessment metrics do not account for imbalanced classes and incomplete data via information content-based metrics. Thirdly, CAFA’s current “blast” baseline is oversimplified and performs worse than other BLAST-based method. After collecting suggestions through a community-wide virtual town hall, we tentatively propose the following changes to on-going and future CAFA challenges. First, we propose to evaluate on all proteins with new experimental annotations, instead of limiting the task to the Swiss-Prot subset with only certain experimental terms and annotations. Second, CAFA should prioritize metrics that use information content-based metrics, to account for imbalanced classes and an incomplete gold standard. Third, blast baselines should be re-implemented to reflect best practices for such an analysis.
- Predrag Radivojac, Iddo Friedberg, Mark Wass
Presentation Overview: Show
Introduction of the Critical Assessment of Function Annotation to newcomers; Input from the Critical Assessment of Function Annotation community on the challenge; and the next steps of CAFA
- Sanne Abeln, Vrije Universiteit Amsterdam, Netherlands
- Katharina Waury, Vrije Universiteit Amsterdam, Netherlands
- Dea Gogishvili, Vrije Universiteit Amsterdam, Netherlands
Presentation Overview: Show
Extracellular vesicles (EVs) comprise a heterogeneous group of membranous structures that are secreted by cells into the extracellular space, and are of major importance for cell-to-cell communication. EVs are promising biomarker candidates as they have been implicated in several pathologies. Results from EV isolation studies have been collected in databases providing a valuable resource for the exploration of associated proteins. In this study, we investigated if we could predict EV association from amino acid (AA) sequence, and what the dominating protein features are for EV association. Sequence-based features were utilized to build a random forest classifier to predict EV association. The model achieved a surprisingly high AUC of 0.82, which increases further to 0.87 when incorporating post-translational modification (PTM) annotations. Based on the feature analysis EV proteins appear to be large, soluble, stable, and with low IP compared to non-EV proteins. Importantly, de-novo discovered motifs and known domains, including RAS profile and WW domains are positively correlated with EV packaging. EV associated proteins are rich in aspartic and glutamic acid and undergo various PTMs, from which lipidation emerges to be of big importance as the presence of palmitoylation sites was the most predictive feature found.
- Sandra Orchard EMBL-EBI
- Rıza Özçelik, Boğaziçi University, Turkey
- Hakime Öztürk, Boğaziçi University, Turkey
- Arzucan Ozgur, Bogazici University, Turkey
- Elif Ozkirimli, Bogazici University, Turkey
Presentation Overview: Show
Identification of high affinity drug-target interactions is a major research question in drug discovery. Proteins are generally represented by their structures or sequences. However, structures are available only for a small subset of biomolecules and sequence similarity is not always correlated with functional similarity. We propose ChemBoost, a chemical language based approach for affinity prediction using SMILES syntax. We hypothesize that SMILES is a codified language and ligands are documents composed of chemical words. These documents can be used to learn chemical word vectors that represent words in similar contexts with similar vectors. In ChemBoost, the ligands are represented via chemical word embeddings, while the proteins are represented through sequence-based features and/or chemical words of their ligands. ChemBoost is able to predict drug-target binding affinity as well as or better than state-of-the-art machine learning systems. when powered with ligand-centric representations, ChemBoost is more robust to the changes in protein sequence similarity and successfully captures the interactions between a protein and a ligand, even if the protein has low sequence similarity to the known targets of the ligand.
- Rıza Özçelik, Boğaziçi University, Turkey
- Alperen Bağ, Bogazici University, Turkey
- Berk Atıl, Boğaziçi University, Turkey
- Elif Ozkirimli, Bogazici University, Turkey
- Arzucan Ozgur, Bogazici University, Turkey
Presentation Overview: Show
Computational prediction of high affinity drug - target pairs is an interesting problem in drug discovery research, especially for novel drugs and targets. The machine learning models are effective at predicting the interactions between known biochemicals but their generalization to novel drugs and targets has left an open question. Here, we introduce DebiasedDTA, the first model training strategy that is specifically designed for novel drug - target affinity (DTA) prediction and applicable to all optimization-based models. DebiasedDTA ensembles an existing DTA model with a weak learner that aims to quantify dataset biases. The output of the weak learner is then used during the training of the DTA model to improve generalizability. The experiments showed that the proposed approach can indeed quantify the bias embedded in the training sets and DTA models can use this information to boost their performance both on known and novel molecules.
- Yue Cao, Texas A&M University, United States
- Yang Shen, Texas A&M University, United States
Presentation Overview: Show
Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions. To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizability to novel sequences, we use self-attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions, we embed protein function labels together with inputs/features in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence-function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability; and a GO term-centric analysis was also provided.
- Gabriela Merino, IBB-CONICET-UNER, Argentina
- Diego Milone, sinc(i)-CONICET-UNL, Argentina
- Maria Martin, EBI-EMBL, United Kingdom
- Georgina Stegmayer, sinc(i)-CONICET-UNL, Argentina
- Rabie Saidi, EBI-EMBL, United Kingdom
Presentation Overview: Show
Manual curation based on experimental evidence is a precise strategy for function annotation but extremely expensive and time-consuming. Hence it cannot cope with the exponential growth of data. Although computational methods for function prediction are being constantly developed, their performance is still subject to improvement, especially for no-knowledge (NK) proteins.
We propose a novel end-to-end deep learning model for predicting Gene Ontology (GO) terms by integrating multiple features from sequence and taxon of proteins. Our model was trained and evaluated as the CAFA3 challenge. For training, NK proteins were augmented using CAFA3 training proteins with no changes or added annotations up to 02/2017 (challenge deadline). For evaluation, CAFA3 benchmark proteins were used obtaining F-max scores of 0.34, 0.55, and 0.55 for biological process (BP), cellular component (CC), and molecular function (MF), respectively. These results revealed our model performed in the top 5 CAFA3 methods, achieving very competitive scores to those of the best competitors for BP and CC. It is also the second-best method when predicting MF. Our results suggest deep learning integrating multi-source data and using data augmentation during training is a promising tool for function prediction.
- Meet Barot, Center for Data Science, New York University, United States
- Vladimir Gligorijevic, Center for Computational Biology, Flatiron Institute, United States
- Kyunghyun Cho, Center for Data Science, New York University, United States
- Richard Bonneau, Flatiron Institute, New York University, United States
Presentation Overview: Show
Despite the large number of protein function prediction methods developed, many proteins elude categorization into known functional groups. This motivates the need for ways of classifying proteins without labeled data, as many proteins likely perform functions which are not currently known. In this work, we aim to categorize protein sequences into functionally relevant groups without using labels. We use a semantic clustering algorithm similar to a method which has recently shown to achieve state of the art results on unsupervised image classification tasks. We evaluate our method on several datasets: 10 and 558 family datasets constructed from Pfam, and a dataset of lyases with EC numbers from BRENDA. We compare our protein class discovery method with a baseline of K-means clustering of PCA components of sequence embeddings and find that we outperform the baseline in all tasks. We are able to show that our model categorizes proteins into existing classes (after hiding those class labels) without any access to labels. Furthermore, these results point towards a possibility of discovering new functions from only sequence information.
- Dustin Hancks, UT Southwestern Medical Center, United States
- Sruthi Chappidi, UT Southwestern Medical Center, United States
- Mahsa Sorouri, UT Southwestern Medical Center, United States
Presentation Overview: Show
Although genomics has led to an expansive set of predicted genes, functional annotation of gene products remains rate-limiting. To drive the discovery of gene functions, we exploit evolutionary signatures of host-pathogen conflict. In addition to revealing host defense mechanisms, studies of infected cells have led to the definition of fundamental cellular processes and key master regulators (e.g. SRC, P53). Here, we report VIROLOG - an integrative informatics and experimental framework based on gene expression, subcellular localization, and genomic scars of conflict like rapid evolution and viral homologs. The merit of our strategy is illustrated by the identification of a vertebrate specific MItochondrial STress Response (MISTR) circuit. MISTR is executed by related electron transport chain factors and regulated by ultraconserved miRNAs induced by stress signals such as cytokines and hypoxia. With VIROLOG, we are defining new battlefronts in mitochondria as highlighted by hundreds of viral-encoded factors that may target this organelle during infection to drive viral replication. Collectively, VIROLOG continues the rich history of using viral systems to drive biological discovery by exploiting a combination of classic evolutionary and molecular signatures paired with experimental analysis to characterize mechanisms.
- Yanbin Yin, University of Nebraska - Lincoln, United States
- Catie Ausland, Northern Illinois University, United States
- Jinfang Zheng, University of Nebraska - Lincoln, United States
Presentation Overview: Show
PULs (polysaccharide utilization loci) are discrete gene clusters of CAZymes (Carbohydrate Active EnZymes) and other genes that work together to digest and utilize carbohydrate substrates. While PULs have been extensively characterized in Bacteroidetes, there exist PULs from other bacterial phyla, as well as archaea and metagenomes, that remain to be catalogued in a database for efficient retrieval. We have developed an online database dbCAN-PUL (http://bcb.unl.edu/dbCAN_PUL/) to display experimentally verified CAZyme-containing PULs from literature with pertinent metadata, sequences, and annotation. Compared to other online CAZyme and PUL resources, dbCAN-PUL has the following new features: (i) Batch download of PUL data by target substrate, species/genome, genus, or experimental characterization method; (ii) Annotation for each PUL that displays associated metadata such as substrate(s), experimental characterization method(s) and protein sequence information, (iii) Links to external annotation pages for CAZymes (CAZy), transporters (UniProt) and other genes, (iv) Display of homologous gene clusters in GenBank sequences via integrated MultiGeneBlast tool and (v) An integrated BLASTX service available for users to query their sequences against PUL proteins in dbCAN-PUL. With these features, dbCAN-PUL will be an important repository for CAZyme and PUL research, complementing our other web servers and databases (dbCAN2, dbCAN-seq). We have further shown that PULs targeting the same or similar substrates tend to have similar gene composition (i.e., protein family/domain combinations). Therefore, the PUL-substrate associations in dbCAN-PUL can be used to classify computer-predicted CAZyme gene clusters (CGCs) into substrate groups (e.g., xylan, pectin, starch, etc.). This will allow the prediction of the glycan substrates of CGCs given sequenced microbiome samples and contribute to addressing two fundamental personalized nutrition questions: (i) Is a gut microbe able to use a specific type of glycan? (ii) Can a person carrying certain gut microbes respond to an individualized diet?
Paper published at https://doi.org/10.1093/nar/gkaa742
- Iddo Friedberg
- Monique Zahn, SIB Swiss Institute of Bioinformatics, Switzerland
- Paula Duek, University of Geneva and SIB Swiss Institute of Bioinformatics, Switzerland
- Camille Mary, University of Geneva, Switzerland
- Amos Bairoch, University of Geneva and SIB Swiss Institute of Bioinformatics, Switzerland
- Lydie Lane, University of Geneva and SIB Swiss Institute of Bioinformatics, Switzerland
Presentation Overview: Show
Research on the human proteome reached a milestone last year with >90% of predicted human proteins detected, according to the HUPO Human Proteome Project. Its neXt-CP50 project aims to characterize 50 of the 1273 proteins with evidence at the protein level (PE1) and no experimentally determined function. In order to support this project and the scientific community in its efforts to complete the human functional proteome, neXtProt has begun to host protein function predictions. The CC BY 4.0 license that applies to the data in neXtProt will also apply to these predictions to promote their reuse. The submitter(s) can remain anonymous or have their ORCID(s) linked to the prediction to give them credit. Predictions for 7 entries are now in neXtProt - for an example, see https://www.nextprot.org/entry/NX_Q6P2H8/function-predictions. These predictions were obtained in the frame of the Fonctionathon course for undergraduates given at the University of Geneva in 2020.We are calling on the community to propose functional predictions for the proteins with no known function resulting from manual analysis of the available data and literature. This approach will complement the Critical Assessment of protein Function Annotation algorithms (CAFA) experiment that uses computational methods to predict protein function.
- Lara Mangravite, Sage Bionetworks, USA
Presentation Overview: Show
Data for health research is all around us. In the last decade, we have moved from a paradigm where research data is collected in a research clinic to a paradigm where research data may stem from anywhere – including our visits to the doctor and our daily interactions with technology. These information streams offer tremendous opportunity to advance research in areas from public health to precision medicine. They can also be extremely intrusive – requiring us to evolve the ways in which we collect, manage, and analyze research data. As always, the translation of science into medicine requires robust and reproducible outcomes with clear actionable consequence. Here, we will discuss approaches to apply open science principles – transparency, reproducibility, and independent contribution – to meet the evolving needs of data-intensive biomedical research.
Lara Mangravite, PhD, is president of Sage Bionetworks, a non-profit research organization that focuses on open practices to advance biomedicine through data-driven science and digital research. Recognizing that all research is limited by restrictions placed on the distribution of information, Sage works closely with institutes, foundations, and research communities to redefine how complex biological data is gathered, shared and used. By improving information flow and research practices, Sage seeks to enable research outcomes of sufficient confidence to support translation. Dr. Mangravite obtained a PhD in pharmaceutical chemistry from the University of California, San Francisco, and completed a postdoctoral fellowship in cardiovascular pharmacogenomics at the Children’s Hospital Oakland Research Institute.