Function COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Monday, July 11th
10:30-10:40
Introduction
Room: Madison B
Format: Live from venue

Moderator(s): Dukka KC

  • Kim Reynolds
  • Dukka KC
10:40-11:20
Keynote Presentation: Modern resources for the intrinsic disorder and disorder function prediction
Room: Madison B
Format: Live from venue

Moderator(s): Dukka KC

  • Lukasz Kurgan


Presentation Overview: Show

Over the years, intrinsic disorder was shown to play key roles in a wide spectrum of cellular functi...

11:20-11:30
Protein embeddings and deep learning predict binding residues for various ligand classes
Room: Madison B
Format: Live-stream

Moderator(s): Dukka KC

  • Maria Littmann, Technical University of Munich, Germany
  • Michael Heinzinger, Department of Informatics, Technical University of Munich, Germany
  • Christian Dallago, Department of Informatics. Technical University of Munich, Germany
  • Konstantin Weissenow, Department of Informatics. Technical University of Munich, Germany
  • Burkhard Rost, Department of Informatics, Technical University of Munich, Germany


Presentation Overview: Show

One important aspect of protein function is the binding of proteins to ligands, including small molecules, metal ions, and macromolecules such as DNA or RNA. Despite decades of experimental progress many binding sites remain obscure. We proposed bindEmbed21, a method predicting whether a protein residue binds to metal ions, nucleic acids, or small molecules. The Artificial Intelligence (AI)-based method exclusively uses embeddings from the Transformer-based protein Language Model (pLM) ProtT5 as input. Using only single sequences without creating multiple sequence alignments (MSAs), bindEmbed21DL outperformed MSA-based predictions. Combination with homology-based inference increased performance to F1=48±3% (95% CI) and MCC=0.46±0.04 when merging all three ligand classes into one. All results were confirmed by three independent data sets. Focusing on very reliably predicted residues could complement experimental evidence: For the 25% most strongly predicted binding residues, at least 73% were correctly predicted even when ignoring the problem of missing experimental annotations. The new method bindEmbed21 is fast, simple, and broadly applicable - neither using structure nor MSAs. Thereby, it found binding residues in over 42% of all human proteins not otherwise implied in binding and predicted about 6% of all residues as binding to metal ions, nucleic acids, or small molecules.

11:30-11:40
AHRD - Protein Function Transfer and Evaluation
Room: Madison B
Format: Live-stream

Moderator(s): Dukka KC

  • Florian Boecker, Crop Bioinformatics Uni Bonn, Germany
  • Heiko Schoof, Crop Bioinformatics Uni Bonn, Germany


Presentation Overview: Show

Genome-scale protein annotation can be performed by the transfer of functions from known proteins matched via sequence similarity. Errors can propagate when annotations falsely generated in this manner make their way into public databases and are used as basis for subsequent function transfers. Our program "Automatic assignment of Human Readable Descriptions" (AHRD) can overcome these pitfalls by avoiding previously transferred annotations. It emulates the decision process of a human curator to select a description and GO terms from sequence similarity search results. To assess the annotation performance of AHRD and competing methods accurately, we firstly generated an unbiased ground truth set of high quality protein annotations with minimal redundancy. It enables one to contrast annotation methods because it contains many proteins that are difficult to annotate. Secondly, we implemented an evaluation metric that uses the information content of GO terms to determine the semantic similarity of GO annotations. Using both, we are able to compare AHRD with some of its competitors and can show that it is able to predict descriptions and GO annotations at high quality while simultaneously maintaining a competitive coverage.

11:40-11:50
Co-evolution based machine-learning for predicting functional interactions between human genes
Room: Madison B
Format: Live-stream

Moderator(s): Dukka KC

  • Doron Stupp, Hebrew university, Israel
  • Marinka Zitnik, Harvard medical school, United States
  • Yuval Tabach, Hebrew university, Israel


Presentation Overview: Show

Over the next decade, more than a million eukaryotic species are expected to be fully sequenced. This has the potential to improve our understanding of genotype and phenotype crosstalk, gene function and interactions, and answer evolutionary questions. Here, we develop a machine-learning approach for utilizing phylogenetic profiles across 1154 eukaryotic species. This method integrates co-evolution across eukaryotic clades to predict functional interactions between human genes and the context for these interactions. We benchmark our approach showing a 14% performance increase (auROC) compared to previous methods. Using this approach, we predict functional annotations for less studied genes. We focus on DNA repair and verify that 9 of the top 50 predicted genes have been identified elsewhere, with others previously prioritized by high-throughput screens. Overall, our approach enables better annotation of function and functional interactions and facilitates the understanding of evolutionary processes underlying co-evolution. The manuscript is accompanied by a webserver available at: https://mlpp.cs.huji.ac.il.

11:50-12:00
Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive accuracy
Room: Madison B
Format: Live from venue

Moderator(s): Dukka KC

  • Peter Tonner, National Institute of Standards and Technology, United States
  • Abe Pressman, National Institute of Standards and Technology, United States
  • David Ross, National Institute of Standards and Technology, United States


Presentation Overview: Show

Large-scale measurements linking genetic background (genotype) to corresponding function (phenotype) have grown steadily in size and scale. From these measurements, our ability to predict novel relationships in the genotype-phenotype landscape (GPL) directly impacts diverse biological sciences. Currently, these predictions are often made with neural networks (NNs) due to their unsurpassed accuracy in out-of-sample extrapolation. But NNs cannot easily explain their predictions, which hinders insight into the underlying biophysical system measured by the GPL. model of GPLs, called LANTERN. LANTERN learns a representation of every genetic mutation in the form of a latent, low-dimensional vector. Predictions from LANTERN then combine these representations, and connect them to measured phenotypes through a smooth, nonlinear surface. LANTERN is therefore fully interpretable: every prediction from the model is easily explained through these simple components. These components also reveal how different biophysical mechanisms, including structural biochemistry and changes in free energy, influence different GPLs. Despite its simplicity, we also show that LANTERN equals or outperforms alternative models in predictive accuracy, including NNs. So, LANTERN provides state-of-the-art prediction on GPL data while also being fully interpretable.

12:00-12:10
Integrating multimodal data through interpretable heterogeneous ensembles
Room: Madison B
Format: Live from venue

Moderator(s): Dukka KC

  • Yan Chak Li, Icahn School of Medicine at Mount Sinai, United States
  • Linhua Wang, Baylor College of Medicine, United States
  • Jeffrey Law, National Renewable Energy Laboratory, United States
  • T. M. Murali, Virginia Tech, United States
  • Gaurav Pandey, Icahn School of Medicine at Mount Sinai, United States


Presentation Overview: Show

Most biomedical data integration methods follow the early approach of aggregating multimodal datasets into a uniform representation like networks that can then be analyzed. This approach may diminish local information exclusive to each data modality. We propose the novel Ensemble Integration (EI) framework based on the late approach that can address this challenge. EI offers the flexibility of inferring effective local predictive models from individual modalities, before aggregating them into a global heterogeneous ensemble model. We also propose a novel interpretation method for EI. We tested EI for predicting GO term annotations from multimodal STRING data. Across 2,139 GO terms, EI performed significantly better than other data integration approaches and individual STRING modalities (FDR<9.34x10-14). For predicting COVID-19 mortality from multimodal electronic health record data, EI performed significantly better than the individual modalities (FDR<0.0082), and slightly better than an XGBoost model. The best-performing EI model also revealed several features, such as age, minimum oxygen saturation and several laboratory tests, that have been confirmed as relevant to COVID-19 mortality in previous clinical studies. These results demonstrated the effectiveness of EI for biomedical data integration and predictive modeling, as well as revealing problem-relevant information from complex multimodal datasets.

12:10-12:20
Machine Learning to Uncover Microbial Function Indicators for Earth Biomes
Room: Madison B
Format: Live from venue

Moderator(s): Dukka KC

  • Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States
  • Ziming Yang, Brookhaven National Laboratory, United States
  • Sean Jungbluth, Lawrence Berkeley National Laboratory, United States
  • Shane Cannon, Lawrence Berkeley National Laboratory, United States
  • Paramvir Dehal, Lawrence Berkeley National Laboratory, United States
  • Adam Arkin, Lawrence Berkeley National Laboratory, United States


Presentation Overview: Show

Microbial life is a critical component of Earth biomes and the taxonomic and functional information from environmental genomics can provide insights into microbial roles in the environment. However, comparing this data across metagenomes can be challenging and furthermore abundance differences may not reflect important functional differences between environments. Here we aim to use machine learning to: 1) build and evaluate robust biome classification models using standardized data from ~32,000 metagenomes; and 2) identify important classification features and how they relate to our understanding of environments on Earth. We constructed a feature table associating metagenome sample environment labels with Gene Ontology (GO) term abundance profiles. Using top multiclass classification methods we performed clasification model-building experiments by varying data preparation and feature selection strategies. With a permutation analysis for feature importance from the top evaluated models, we extracted functions important for classification as well as function indicators for specific biomes. The important biome functions lead to better separation and grouping of samples according to their biome labels and furthermore, model predictions suggest sample labeling improvements. Our results provide a high performance metagenome biome classification model and enable model interpretability to learn important biome indicator functions as well as biome and function relationships.

12:20-12:30
Prioritizing important regions of sequencing data for function prediction
Room: Madison B
Format: Live from venue

Moderator(s): Dukka KC

  • Mahdi Baghbanzadeh, Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, United States
  • Tyson Dawson, Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, United States
  • Ali Rahnavard, Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, United States


Presentation Overview: Show

Genomics data holds great promise in understanding the causes, characteristics, and potential interventions for diseases because of the specificity of DNA snapshots and mutations from individuals. There are many challenges with this data, including non-independent observations, large noise components, nonlinearity, collinearity, and high dimensionality which make them suitable for machine learning techniques to capture nonstructural patterns.
In recent years, these large datasets have made it possible for researchers to implement machine learning algorithms to study the genotype-phenotype association and develop models with high-performance metrics. In this study, we present a unified and generic approach that not only compares the metrics of multiple fitted models and reports the best-fitted model but also assists in identifying the most important positions (features) in the genomic sequence data that contribute mainly to identifying phenotypic traits or variants.
We have developed, deepBreaks, a computational tool, to identify important changes in association with the phenotype of interest using sequencing data. We validated deepBreaks in diverse applications, revealing new significant genotype-phenotype associations, experimentally validated associations, and a set of novel findings from related microbial strains, coronavirus families, SARS-COV-2 strains, and opsins. deepBreaks is open-source software, and the Python implementation is available online at http://github.com/omicsEye/deepBrekas.

14:30-14:40
Snekmer: A scalable pipeline for protein sequence fingerprinting based on amino acid recoding
Room: Madison B
Format: Live from venue

Moderator(s): Predrag Radivojac

  • Christine Chang, Pacific Northwest National Laboratory, United States
  • William Nelson, Pacific Northwest National Laboratory, United States
  • Aaron Wright, Pacific Northwest National Laboratory, United States
  • Robert Egbert, Pacific Northwest National Laboratory, United States
  • Jason McDermott, Pacific Northwest National Laboratory, United States


Presentation Overview: Show

Advances in sequencing and the subsequent explosion of available sequence data has motivated the need for faster, more sensitive methods for protein function annotation. Current methods for functional annotation—including sequence similarity assessment via e.g., BLAST, and hidden Markov models (HMMs)—scale poorly, suffer from reference or training set biases, and require time-consuming curation processes. Here, we introduce a method for rapid protein classification combining the kmer approach, in which proteins are represented as sets of peptide subsequences of length k, and amino acid recoding (AAR), which reduces the set of characters representing the 20 amino acids through similarity groupings. Proteins are thus simplified into vector representations capable of fingerprinting structural information for functional differentiation. To enable automated, high-throughput sequence analysis, we have developed Snekmer, a software tool which allows users to: (a) build classification models for protein families, (b) search a list of input protein sequences against pre-trained AAR-kmer family models and assign family probability scores, or (c) apply clustering algorithms for de novo protein family determination. We have evaluated Snekmer using an example set of 33 protein families, and demonstrate Snekmer’s utility in enabling rapid and accurate differentiation between families. Snekmer is available at http://github.com/PNNL-CompBio/Snekmer.

14:40-14:50
AlphaFold models can inform the prediction of functional LIR-motifs
Room: Madison B
Format: Live-stream

Moderator(s): Predrag Radivojac

  • Vasilis Promponas, Department of Biological Sciences, University of Cyprus, Cyprus
  • Agathangelos Chatzichristofi, Department of Biological Sciences, University of Cyprus, Cyprus
  • Vasileios Sagris, University of Cyprus, Cyprus
  • Aristos Pallaris, University of Cyprus, Cyprus
  • Marios Eftychiou, University of Cyprus, Cyprus
  • Ioanna Kalvari, University of Cyprus, Cyprus
  • Nicholas Price, University of Cyprus, Cyprus
  • Theodosios Theodosiou, University of Cyprus, Cyprus
  • Ioannis Iliopoulos, U of Crete, Greece
  • Ioannis Nezis, The University of Warwick, United Kingdom


Presentation Overview: Show

In selective macroautophagy, selectivity is mainly achieved through receptor proteins that bind to the surface of members of the Atg8 protein family by anchoring a small linear peptide (AIM in plants and fungi, LIR-motif in mammals). After the discovery of the first few LIR-containing proteins (LIRCPs), several definitions of the LIR-motif have appeared in the literature, highlighting a short core consensus sequence described by the regular expression pattern [WFY]xx[VLI]; accumulating experimental evidence pinpoints the importance of regions flanking the core LIR-motif. There are several mentions in the literature reporting that short linear motif binders often reside in intrinsically disordered regions, however there is no extensive evidence in the particular case of LIR-motifs, primarily due to the lack of relevant experimental data. In this work, we explore predicted 3D structures from the AlphaFold Database and the recently reported association of regions predicted with low pLDDT score with intrinsic disorder for studying LIR-motif containing proteins. We systematically characterize the “disorderliness” properties of LIR-motifs and their flanking regions and suggest that such properties can be used to develop successful predictors of LIR-motifs in combination with sequence-derived features.

14:50-15:00
Inferring MicroRNA Regulation from Inspecting the Proteome
Room: Madison B
Format: Live from venue

Moderator(s): Predrag Radivojac

  • Michal Linial, The Hebrew University of Jerusalem, Israel
  • Dan Ofer, The Hebrew University of Jerusalem, Israel


Presentation Overview: Show

Post-transcriptional regulation in multicellular organisms is mediated by microRNAs (miRNAs). However, the mechanisms that determine if a gene is regulated by miRNAs are poorly understood. Previous works focused mostly on miRNA seed matches and features of the 3’-UTR of transcripts. These common computational approaches still yield poor, inconsistent results, flooded with false positives. In this work, we present a novel, automated machine learning (ML) framework, we use sequence as well as diverse proteome-derived functional annotations to train models on multiple organisms using experimentally validated data. We present insights from millions of features extracted and ranked from different modalities. We show high predictive performance per organism and in generalization across species. We provide a list of novel predictions for Danio rerio (zebrafish) and Arabidopsis thaliana (mouse-ear cress). We found that most membranous and disease related proteins are regulated by miRNAs, but G-protein coupled receptor (GPCR) family is an exception, being mostly unregulated by miRNAs. We further show that the evolutionary conservation among duplicated genes does not imply a coherence in miRNA regulation. We conclude that duplicated genes diverge in their tendency to be miRNA regulated. However, protein function is informative across species in predicting post-transcriptional miRNA regulation in living cells.

15:00-15:10
Network Based Analysis of Microbial Function Reveals Putative Bacterial Pathways
Room: Madison B
Format: Live from venue

Moderator(s): Predrag Radivojac

  • Henri Chung, Iowa State University, United States
  • Yannick Mahlich, Rutgers University, United States
  • Iddo Friedberg, Iowa State University, United States
  • Yana Bromberg, Rutgers University, United States


Presentation Overview: Show

Even though bacterial functions have been studied for over 150 years, the vast majority of the bacterial world remains unexplored. Here, we incorporated a significantly larger number of proteins into our classification scheme for a more in-depth look into microbial molecular functionality. In addition, we developed a novel network-based method for identifying relationships between functions based on their phylogenetic profiles: the presence or absence in individual bacterial strains. Linked protein functions, e.g. members of the same pathway, are likely to have been preserved or eliminated in new species throughout their evolutionary history For a set of known fusion functions, which were mapped to Enzyme Commission Classification (EC) numbers and corresponding known pathways from Kyoto Encyclopedia of Genes and Genome (KEGG), we demonstrate a shorter average phylogenetic distance than that of functions from different pathways. We further show that clustering the network of functional relationships captures groups of functions which participate in shared pathways. We thus highlight previously unseen groupings of functions that could potentially encode for unknown or unannotated pathways. We also note that a similar approach to pathway annotation can be applied to metagenomic data for identification of emergent functionality carried out across multiple organisms.

15:10-15:20
Discovering Proteins: Function to Name
Room: Madison B
Format: Live from venue

Moderator(s): Predrag Radivojac

  • Akshay Agarwal, IBM Research, United States
  • Sara Capponi, IBM Research, CCC, United States
  • Edward Seabolt, IBM, United States
  • Kristen Beck, IBM, United States
  • James Kaufman, IBM, United States


Presentation Overview: Show

Currently approximately half of all microbial proteins are tagged as putative or hypothetical proteins and lack functional annotation which leads to a reduced understanding of biological function at the genome-level and limits the classification of microorganisms especially pathogens. Here, we developed an approach to perform functional annotation of hypothetical proteins from over 50 million named proteins and 27K functional codes (InterProScan domain codes). We train 3 separate models for performing functional annotations at domain, family, and superfamily levels using Kraken. Furthermore, we construct a functional space to visualize these proteins and perform biological validation of results, while also enabling the discovery of potentially new proteins and their function. Most interestingly, this high dimensional functional space will facilitate the shift from genotype to phenotype for named proteins. Leveraging this space, we identify function-based clusters; if new clusters are formed due to improved annotation of hypothetical proteins, we will possibly uncover and understand evolutionary paths shared with known proteins. We use data from our Functional Genomics Platform for our work which has over 300K prokaryotic genomes, 75 million gene sequences, 55 million protein sequences, and over 260 million functional domains.

15:20-15:30
Integrated gene expression and variant analysis to investigate CVD genes with associated phenotypes among high-risk Heart Failure patients.
Room: Madison B
Format: Live from venue

Moderator(s): Predrag Radivojac

  • Zeeshan Ahmed, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States


Presentation Overview: Show

Cardiovascular disease (CVD) is a leading cause of premature mortality in the US and the world. CVD comprises of several complex and mostly heritable conditions, which range from myocardial infarction to congenital heart disease. Here, we report our findings from an integrative analysis of gene expression, disease-causing gene variants, and associated phenotypes among CVD populations, with a focus on high-risk Heart Failure (HF) patients. Our in-depth gene expression analysis revealed differentially expressed genes associated with HF (41 genes) and other CVDs (23 genes). Furthermore, a variant analysis of whole-genome sequence data of CVD patients identified genes with altered gene expression (FLNA, CST3, LGALS3, and HBA1) with functional and nonfunctional mutations in these genes. Our study highlights the importance of an integrative approach that leverages gene expression, genetic mutations, and clinical data that will allow the prioritization of key driver genes for complex diseases to improve personalized healthcare.

16:00-16:10
Tensor decomposition and principal component analysis based unsupervised feature extraction with optimized standard deviation applied to gene expression, DNA methylation and histone modification
Room: Madison B
Format: Live from venue

Moderator(s): Kim Reynolds

  • Y-H. Taguchi, 田口善弘, Japan
  • Ryo Ishibashi, Chuo University, Japan


Presentation Overview: Show

Tensor decomposition and principal component analysis based unsupervised feature extraction were proposed almost ten years and five years ago, respectively. Although they were successfully applied to wide range of problems, they have some fundamental problems; the null hypothesis that the principal component and singular value vectors derived should obey the Gaussian distribution is not fully satisfied and the number of selected features is too small to regard that there are no false negatives. These are recently improved by the introduction of standard deviation optimization. In this paper, we briefly describe this recent progress.

16:10-16:20
Protein functional prediction via multi-scale data integration
Room: Madison B
Format: Live from venue

Moderator(s): Kim Reynolds

  • Shawn Gu, University of Notre Dame, United States
  • Meng Jiang, University of Notre Dame, United States
  • Pietro Hiram Guzzi, UMG, Italy
  • Tijana Milenkovic, University of Notre Dame, United States


Presentation Overview: Show

Prediction of proteins' functions is an important task in computational biology. One common category of approaches analyzes proteins' 3D structures, which have important implications for their functions, by modeling them as protein structure networks (PSNs); in a PSN, nodes are amino acids and edges join amino acids that are close in the crystal structure of the protein. Another common category of approaches analyzes a protein-protein interaction (PPI) network (PPIN), used to model the interactions between proteins, which are ultimately what carry out cellular functioning; in a PPIN, nodes are proteins and edges are PPIs. Importantly, network-based approaches, whether using PSNs or a PPIN, have been shown to be state-of-the-art for protein functional prediction. Going further, a multi-scale relationship is evident here: a node (protein) at the PPIN level is itself a network at the PSN level. As such, we integrate these two complementary scales into a network of networks (NoN) for the first time, incorporating the advantages of both PSN- and PPIN-based modeling. Then, we show that protein functional prediction via NoN-based data integration often outperforms and even complements single-scale PSN- or PPIN-based functional prediction. As such, NoN-based data integration is an important and exciting research direction.

16:20-16:30
Paralog function annotation with ProfileView
Room: Madison B
Format: Live-stream

Moderator(s): Kim Reynolds

  • Edoardo Sarti, Inria Université Côte d'Azur, France
  • Théo Le Moigne, Sorbonne Université, France
  • Julien Henri, Sorbonne Université, France
  • Alessandra Carbone, Sorbonne Université, France


Presentation Overview: Show

Annotating the function of paralogous sequences has always been very challenging both in small-scale, expert-guided assays and large-scale bioinformatics studies, where paralogs are the most important source of functional annotation errors. ProfileView is a novel computational method designed to functionally classify sets of homologous sequences. It constructs a library of probabilistic models accurately representing the functional variability of protein families, and extracts biologically interpretable information from the classification process. Although ProfileView has been tested with success on several non-isofunctional protein families, paralog function recognition poses a new, complex challenge that requires careful analysis. We have devised a pipeline that applies ProfileView on the problem of paralog recognition, and transfers the knowledge of small-scale expert-guided annotations to the entire protein family. We have tested it on the 11 proteins composing the Calvin-Benson cycle, and obtained fully consistent results on 8 of them, and partially consistent results on other 2. The knowledge about paralog function annotation in the CBC is being now employed for matching same-function paralog sequences for producing joint MSAs for protein-protein interaction studies.

16:30-16:40
The field of protein function prediction as viewed by different domain scientists
Room: Madison B
Format: Live from venue

Moderator(s): Kim Reynolds

  • Rashika Ramola, Northeastern University, United States
  • Iddo Friedberg, Iowa State University, United States
  • Predrag Radivojac, Northeastern University, United States


Presentation Overview: Show

Experimental biologists, biocurators, and computational biologists all play a role in characterizing a protein’s function. The discovery of protein function in the laboratory by experimental scientists is the foundation of our knowledge about proteins. Experimental findings are compiled in knowledge-bases by biocurators to provide standardized, readily accessible, and computationally amenable information. Computational biologists train their methods using these data to predict protein function and guide subsequent experiments. To understand the state of affairs in this ecosystem, centered here around protein function prediction, we surveyed scientists from these three constituent communities. Most strikingly, we find that experimentalists rarely use modern prediction software, but when presented with predictions, report many to be surprising and useful. Ontologies appear to be highly valued by biocurators, less so by experimentalists and computational biologists, yet controlled vocabularies bridge the communities and simplify the prediction task. Additionally, many software tools are not readily accessible and the predictions presented to the users can be broad and uninformative. To meet both the social and technical challenges in the field, a more productive and meaningful interaction between members of the core communities is necessary.

16:40-16:50
ProteinBERT: a universal deep-learning model of protein sequence and function
Room: Madison B
Format: Live-stream

Moderator(s): Kim Reynolds

  • Dan Ofer, The Hebrew University of Jerusalem, Israel
  • Nadav Brandes, The Hebrew University of Jerusalem, Israel
  • Yam Peleg, Deep Trading, Israel
  • Nadav Rappoport, Ben-Gurion University of the Negev, Israel
  • Michal Linial, The Hebrew University of Jerusalem, Israel


Presentation Overview: Show

Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs.
ProteinBERT obtains near state-of-the-art performance, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data

16:50-17:00
RAPPPID: Deep, Regularised Protein-Protein Interaction Prediction that Generalises to Unseen Proteins
Room: Madison B
Format: Live from venue

Moderator(s): Kim Reynolds

  • Joseph Szymborski, McGill University, Canada
  • Amin Emad, McGill University, Canada


Presentation Overview: Show

Computational methods for the prediction of protein-protein interactions, while important tools for researchers, are plagued by challenges in generalising to unseen proteins. Datasets used for modelling protein-protein predictions are particularly predisposed to information leakage and sampling biases.

In this study, we introduce RAPPPID, a method for the Regularised Automatic Prediction of Protein-Protein Interactions using Deep Learning. RAPPPID is a twin AWD-LSTM network which employs multiple regularisation methods during training time to learn generalised weights. Testing on stringent interaction datasets composed of proteins not seen during training, RAPPPID outperforms state-of-the-art methods. Further experiments show that RAPPPID’s performance holds regardless of the particular proteins in the testing set and its performance is higher for biologically supported edges. This study serves to demonstrate that careful attention to data leakage and the according adjustments to architecture and training routine that RAPPPID represents are important to creating protein-protein interaction models which generalise to unseen proteins. Additionally, as part of this study, we provide datasets corresponding to several data splits of various strictness, in order to facilitate assessment of PPI reconstruction methods by others in the future.

Code and datasets are freely available at https://github.com/jszym/rapppid .

A pre-print is available at https://doi.org/10.1101/2021.08.13.456309 .

17:00-17:10
The Expanding World of Metabolic Enzymes Moonlighting as RNA Binding Proteins
Room: Madison B
Format: Live from venue

Moderator(s): Kim Reynolds

  • Constance Jeffery, University of Illinois at Chicago, United States
  • Nicole Curtis, University of Illinois at Chicago, United States
  • Cesar Siete, University of Illinois at Chicago, United States
  • Victoria Ogunniyi, University of Illinois at Chicago, United States
  • Krupa Patel, University of Illinois at Chicago, United States


Presentation Overview: Show

RNA binding proteins play critical roles throughout the lifetime of RNA in processing of new transcripts, regulation of translation, and RNA stability. Recently, proteomics studies have identified dozens of enzymes in intermediary metabolism that bind to RNA but do not contain canonical RNA binding domains. Studies of a few “classic” examples such as aconitase have shown that combining catalytic and RNA binding functions in one protein can be a mechanism to sense the cell’s metabolic state through availability of the enzyme’s ligands and respond by regulating translation of specific transcripts. Conversely, RNA binding could regulate the enzyme’s catalytic activity, through blocking the active site, allosteric effects, acting as a scaffold, or sequestering enzymes. More information about the locations of the RNA binding sites will aid in predicting which other proteins also act as noncanonical RNA binding proteins. Information gained from studying enzymes in carbohydrate, amino acid, and lipid metabolism that act as noncanonical RNA binding proteins will increase our understanding of the coordination between central metabolic pathways and RNA metabolism. This information can be applied in the future to the design of novel proteins that regulate RNA translation, stability, and lifetime, as well as RNAs that regulate enzyme function.

17:10-17:20
Anti-CRISPR prediction by using Transformer Model
Room: Madison B
Format: Live-stream

Moderator(s): Kim Reynolds

  • Chan-Seok Jeong, Korea Institute of Science and Technology Information, South Korea


Presentation Overview: Show

Anti-CRISPR, a family of proteins that hinder the CRISPR-Cas system of prokaryotic immune system, has recently emerged as a natural inhibitor of the CRISPR-Cas system, allowing for post-translational regulation of CRISPR-Cas system in a range of applications. Although experimental strategies for discovering anti-CRISPR have been developed, bioinformatic prediction may give a more cost-effective screening strategy. However, algorithm development is problematic due to a dearth of verified anti-CRISPR data and poor sequence similarity. Here, we describe an approach for predicting anti-CRISPR proteins from amino acid sequences by fine-tuning a pre-trained Transformer model for classification task. We predict the anti-CRISPR function of a given amino acid sequence by adding an additional classification layer to the Transformer model pre-trained on the unlabeled amino acid sequence of Pfam. The resulting model is fine-tuned with further training on validated anti-CRISPR and putative non-CRISPR prophage data sets. We evaluate performance on independent data sets compared to conventional predictors. Unlike traditional predictors, which require additional feature calculations and pre-filtering procedures, the present method only requires amino acid sequences, making it ideal for genome-scale investigations.

17:20-17:40
Proceedings Presentation: DeepMHCII: A Novel Binding Core-Aware Deep Interaction Model for Accurate MHC II-peptide Binding Affinity Prediction
Room: Madison B
Format: Live-stream

Moderator(s): Kim Reynolds

  • Ronghui You, Fudan University, China
  • Wei Qu, Fudan University, China
  • Hiroshi Mamitsuka, Kyoto University / Aalto University, Japan
  • Shanfeng Zhu, Fudan University, China


Presentation Overview: Show

Motivation: Computationally predicting MHC-peptide binding affinity is an important problem in immunological bioinformatics. Recent cutting-edge deep learning-based methods for this problem are unable to achieve satisfactory performance for MHC class II molecules. This is because such methods generate the input by simply concatenating the two given sequences: (the estimated binding core of) a peptide and (the pseudo sequence of) an MHC class II molecule, ignoring biological knowledge behind the interactions of the two molecules. We thus propose a binding core-aware deep learning-based model, DeepMHCII, with binding interaction convolution layer (BICL), which allows to integrate all potential binding cores (in a given peptide) with the MHC pseudo (binding) sequence, through modeling the interaction with multiple convolutional kernels.
Results: Extensive empirical experiments with four large-scale datasets demonstrate that DeepMHCII significantly outperformed four state-of-the-art methods under numerous settings, such as five-fold cross-validation, leave one molecule out, validation with independent testing sets, and binding core prediction. All these results and visualization of the predicted binding cores indicate the effectiveness of our model, DeepMHCII, and importance of properly modeling biological facts in deep learning for high predictive performance and efficient knowledge discovery.

17:40-17:50
Ontological analysis of coronavirus associated human genes at the COVID-19 Disease Portal
Room: Madison B
Format: Live from venue

Moderator(s): Kim Reynolds

  • Shur-Jen Wang, Medical College of Wisconsin, United States
  • Jeff De Pons, Medical College of Wisconsin, United States
  • Wendy Demos, medical college of wisconsin, United States
  • G. Thomas Hayman, Medical College of Wisconsin, United States
  • Morgan Lee Hill, Medical College of Wisconsin, United States
  • Mary Kaldunski, Medical College of Wisconsin, United States
  • Stan Laulederkind, Medical College of Wisconsin, United States
  • Jennifer R. Smith, Rat Genome Database, Medical College of Wisconsin, United States
  • Monika Tutaj, Medical College of Wisconsin, United States
  • Mahima Vedi, Medical College of Wisconsin, United States
  • Melinda Dwinell, Medical College of Wisconsin, United States
  • Anne Kwitek, Medical College of Wisconsin, United States


Presentation Overview: Show

The Disease Portal at Rat Genome Database (RGD) is a premier disease resource where disease-associated genome objects are integrated with their associated data in one place according to disease areas. In response to SARS-CoV-2 pandemic, RGD developed a “COVID-19 Diseases Portal” https://rgd.mcw.edu/rgdweb/portal/home.jsp?p=14). In the COVID-19 Portal, gene-disease associations are established by manual curation of Pubmed literature. All the functional annotations and genome data associated with these disease genes are accessible from the portal. We performed analyses on the COVID-19 disease gene set using tools developed at RGD. The Disease Ontology term enrichment analysis showed that the COVID-19 disease gene set is highly enriched with coronavirus infectious disease and related diseases. Several less related disease areas such as liver disease, and rheumatic disease also highly enriched. Using the comparison heatmap, we showed that close to 60 percent of the COVID-19 genes were associated with nervous system disease and 40 percent were associated with gastrointestinal disease. Our analysis confirms the role of immune system in COVID-19 pathogenesis as shown by substantial enrichment of immune system related Gene Ontology terms. Using the integrated data sets in the RGD COVID-19 Portal will help elucidate mechanisms of COVID-19 and ultimately leads to prevention or treatments.