Presentation Overview: Show
There are now several hundred million protein sequences and well over 100,000 protein structures, bu...
Presentation Overview: Show
There are four recent events that have set the stage for incredible opportunities for careers in pro...
Presentation Overview: Show
MicroRNAs (miRNAs) regulate gene expression and have recognized roles in numerous physiological processes and diseases, including cancer. A single nucleotide polymorphism (SNP) in the miR-125a gene leads to poor breast cancer prognosis by blocking the cleavage of the primary miRNA transcript (pri-miRNA). The SNP does not affect pri-miR-125a’s minimal energy structure, leading to the hypothesis that the lost signal must be dynamical. Leveraging high-throughput data on the maturation efficiency of 29 478 pri-miR-125a mutant sequences, we applied our sequence-sensitive elastic network ENCoM to study the full 3D conformational space of pri-miRNAs and its impact on their maturation efficiency. The model predicts maturation efficiency with high accuracy (predictive R-squared of 0.75) when both the ENCoM dynamical signature and the MC-Fold enthalpy of folding are combined, highlighting the synergy between these respectively entropy- and enthalpy-based methods. Looking at the patterns apparent from the model’s coefficients, we corroborate motifs previously identified as necessary for the cleavage of pri-miRNAs but also challenge established notions such as the necessity for a rigid hairpin structure. Our novel approach is fast enough to predict theoretical maturation efficiencies for millions of miRNA sequences, the extremes of which we are currently testing in the lab.
Presentation Overview: Show
Motivation: We propose a practical algorithm based on graph theory, with the purpose to identify CTCF- mediated chromatin loops that are linked in 3D space. Our method is based on finding certain graph structures, K6 minors, in graphs constructed from pairwise chromatin interaction data obtained from the ChIA-PET experiments. We show, that such graph structures, representing particular arrangement of loops, mathematically necessitate linking, if co-occurring in an individual cell. The presence of these linked structures can advance our understanding of the principles of spatial organization of the genome. Results: We apply our method to graphs created from in situ ChIA-PET data for GM128787, H1ESC, HFFC6 and WTC11 cell lines, and from long-read ChIA-PET data. We look at these datasets as divided into CCDs – closely interconnected regions defined on the basis of CTCF loops. We find numerous candidate regions with minors, indicating the presence of links. The graph-theoretic characteristics of these linked regions, including betweenness and closeness centrality, differ from regions without, in which no minors were found, which supports their non-random nature. We provide two versions of the algorithm: one efficient enough to be applied to large datasets, and the other with greater detection capabilities.
Presentation Overview: Show
Glycans play important roles in protein folding and cell-cell interactions – and, furthermore, glycosylation of protein antigens can dramatically impact immune responses. Previously, we developed an in silico tool GLYCO (GLYcan COverage), to quantify the glycan shielding of protein surfaces. We applied it to determine glycan-free surface of SARS-CoV-2 NTD supersite and to correlate glycan coverage with antigen-antibody properties. Here we developed a user-friendly web server, GLYCO-2.0, and improved the computational speed by replacing the previous linear parametrization with a new analytical cylinder method with KD-trees when retrieving atom positions within the coordinate space. The use of these new methods increased computational speed by ~4-5 fold in single and multiprocessing settings. GLYCO-2.0 can estimate glycan shielding from a single coordinate file or multiple frames derived from for instance molecular dynamics simulations or NMR spectroscopy to account for the inherent flexibility of oligosaccharides. The server offers email notifications, allowing the retrieval of results within a week. Also, we showcased the applicability of GLYCO-2.0 by estimating the glycan shield development of influenza’s hemagglutinin proteins over time. Overall, quantification of glycans by GLYCO-2.0 provides a comprehensive understanding of glycan shielding of glycosylated proteins and contributes to glycoprotein-involved research such as vaccine design.
Presentation Overview: Show
Human interferon-gamma (hIFNγ) is a crucial immunomodulating cytokine, which binds to a high-affinity cellular receptor hIFNγR1. The cytokine also binds to the glycosaminoglycans (GAGs) heparin and heparan sulfate (HS), which modulates its physico-chemical properties.
We report molecular dynamics studies of the interaction of hIFNγ and HS-derived oligosaccharides in two different scenarios – in the circulation, and at the cell-surface, when the cytokine forms a complex with its receptor.
HS oligosaccharides bind to the C-termini of free IFNγ with high affinity, forming very stable complexes due to the strong electrostatic attraction, and also interact with the positively charged solvent-exposed domains in the cytokine globule. This impedes further interaction of the cytokine with hIFNγR1.
On the other hand, GAGs, and HS in particular, may be crucial participants in the formation
of the hIFNγ–hIFNγR complex at the cell surface. Our in silico results demonstrate, that placing HS oligosaccharides between the two receptor units facilitates the formation of the cytokine–receptor complex by pulling down the hIFNγ globule via electrostatic attraction of its C-termini. Experiments performed on cell culture confirm, that inhibition of the sulfation of HS proteoglycans by addition of NaClO3 to the cell medium leads to decreased hIFNγ activity.
Presentation Overview: Show
Cellular functions are governed by proteins. While some proteins work independently, most function by interacting with each other. It is crucially important to know the binding sites that facilitate the interactions. Experimental methods are costly and time consuming, therefore it is essential to develop effective computational methods. We present PITHIA, a deep learning model for protein interaction site prediction that exploits several of the most powerful tools in bioinformatics: alignment, attention, and embedding. The recently introduced MSA-transformer uses the power of attention to learn from millions of multiple sequence alignments, a language model that surpasses previous unsupervised methods by a wide margin. We use the contextual embeddings produced by the MSA-transformer as inputs to our program. The architecture of PITHIA is attention based as well, selected by a thorough comparison with multiple candidates. For meaningful comparison with existing programs, we update several widely used datasets with the most current protein binding site information and create a new one, which is the largest and most challenging to date. PITHIA greatly surpasses the competition on five datasets with respect to multiple measures, exceeding the closest competitor by up to 35% in terms of area under the precision-recall curve.
Presentation Overview: Show
Neoepitopes (neoantigen) are cancer-specific antigens and are significant therapeutic cancer vaccine candidates. Epitopes bind the major histocompatibility complex (MHC), which is an immune receptor. Tumor neoepitopes induce an immune response to eliminate cancer cells. This immune activation depends on the affinity between antigen peptide and MHC ligand. Epitope-MHC binding assay is a technologically difficult, time-consuming, and high-expensive experiment. Therefore, the prediction tools, which predict the affinity between antigen peptide and MHC ligand, have been developed using computational approaches. However, it is insufficient data volume for predicting the epitope-MHC binding. The performance of these predictions is not enough. Here, we proposed a novel deep learning model that can predict epitope-MHC binding from a small amount of training data.
MTL4MHC2 has two multi-task Bi-LSTM models, which are the antigen peptides learning model and the MHC peptides learning model. Each multi-task model shares the learning parameters of MHC class I and II. MTL4MHC2 achieves an AUC-ROC score of 82.2%, outperforming state-of-the-art models.
We demonstrated the effectiveness of multi-task learning for improving prediction performance from low amounts of data. MTL4MHC2 can be applied to developing novel cancer therapeutics like a cancer vaccine.
Presentation Overview: Show
Catalytic, binding and metal-binding sites are important and conserved regions of proteins. Their identification can provide important information and insights into protein function. Several computational methods have been developed to identify binding sites based on both sequence and structural information. These have, however, presented limited performance, mostly relying on structural similarity, restricting their application to small binding sites, and not being capable of handling conservative mutations or identifying inter-domain sites.
Here we present the GASS platform, a family of methods for searching similar sites in proteins based on parallel genetic algorithms. GASS was previously successfully used to search for similar catalytic and binding sites, based on templates from the Mechanism Catalytic Site Atlas (M-CSA), correctly identifying more than 90% of the catalogued catalytic sites, ranking fourth among the 18 methods in the CASP 10 competition. GASS was also compared with 8 other state-of-the-art methods for detecting metal-binding sites, outperforming similar methods and achieving an MCC of up to 0.57 and detecting up to 96% of the metal-binding sites correctly.
The GASS platform (https://gassmetal.unifei.edu.br, http://gass.unifei.edu.br/) provides accurate and easy-to-use methods that can be adapted to searching for binding sites in proteins.
Presentation Overview: Show
To assign structural and functional annotations to the ever-increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or profile Hidden Markov Model methods, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, this problem is computationally hard.
We introduced PPalign, a program based on Integer Linear Programming, to compute the optimal pairwise alignment of Potts models representing proteins. The approach was assessed on reference pairwise sequence alignments with low sequence identity (3% to 20%). In this experimentation, Potts models were aligned in reasonable time (1’37” on average), and PPalign yielded a better mean F1 score and found significantly better alignments than HHalign and independent-site PPalign in some cases.
These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time.
Presentation Overview: Show
We employ coarse-grained normal mode analysis to calcu-late dynamical signatures of different ligand/G-protein Coupled Receptors (GPCRs) complexes. Dynamical signa-tures show changes in flexibility of different parts of the structure upon ligand binding. As a first experiment, we docked a large set of ligands with known Emax for GTP-gammaS binding to a crystal structure of the active mu (MOR) and kappa (KOR) opioid receptors, calculated the dynamical signature for each ligand and obtained predictors using multiple linear regression. We obtained a Pearson’s correlation of R=0.46 and R=0.57 in a leave-one-out vali-dation (a scenario where we present a totally new ligand to the system) and a Pearson’s correlation of R=0.8 and R=0.7 in an 80:20 validation (a best-case scenario where new molecules are like training set molecules), for MOR and KOR reactively. These results, shows that even with a limited training set, we can get good estimation of Emax of new drug candidates, therefore predicting their role as agonists, antagonists, or partial agonists computationally and potentially as part of high-throughput screening. More-over, by analyzing the coefficients of these predictors, we see what regions of the receptor have the largest influence in its activation (highlighting helices 5, 6 and the binding-site).
Presentation Overview: Show
Discovering genome-wide chemical-protein interactions is instrumental for chemical genomics, drug discovery and precision medicine. However, more than 90% of gene families remain dark, i.e., their small molecular ligands are undiscovered. Existing approaches typically fail when the dark protein of interest differs from those with known ligands or structures. To address this challenge, we developed a deep learning framework PortalCG. PortalCG consists of three novel components: (i) end-to-end step-wise transfer learning in recognition of sequence-structure-function paradigm, (ii) out-of-cluster meta-learning in light of protein evolution for generalizing machine learning models to unstudied gene families, and (iii) stress model selection to facilitate model deployment in a real-world scenario. In rigorous benchmark experiments, PortalCG considerably outperformed state-of-the-art sequence- and structure-based techniques when applied to dark gene families. Experimental validations on 65 compounds supported the accuracy and robustness of PortalCG. Thus, PortalCG is a viable solution to the out-of-distribution (OOD) problem in exploring the dark protein functional space, and can be applied to a wide variety of scientific domains.
Presentation Overview: Show
Prediction and structural modeling of protein-protein interactions (PPIs) are essential for understanding biological processes. Most large-scale experimental and computational approaches that predict PPIs do not provide structural information. We present a novel approach, XLEC, combining cross-linking mass spectrometry (XL-MS) and evolutionary couplings (ECs) data for efficient proteome-wide prediction and modeling of PPIs. While ECs derived from multiple sequence alignments primarily yield information on direct contacts between proteins across the interface, XL-MS data preferentially captures longer-range interactions, hence these methods contain complementary information. XLEC integrates information from both approaches in a machine learning-based model and subsequent constraint-based modeling of the complex structure. We applied XLEC to data from murine mitochondrial proteomes and compared its performance to those of XL-MS and ECs separately. Our preliminary assessment suggests that XLEC outperforms XL-MS or ECs-based identification of PPIs (precision/recall: XLEC 76%/76%; XL-MS only: 71%/57%; ECs: 68%/57%). Furthermore, XLEC-based modeling of PPIs achieved excellent L-RMSD (<10 Å) for 20% of the benchmark dataset (XL-MS only: 2%; ECs only: 11%). Using XLEC, we generated around 500 de novo PPI models revealing novel insights into the mitochondrial interactome.
Presentation Overview: Show
With the recent advances in protein 3D structure prediction, protein interactions are becoming more central than ever before. Here, we address the problem of determining how proteins interact with one another. More specifically, we investigate the possibility of discriminating near-native protein complex conformations from incorrect ones by exploiting local environments around interfacial residues. Deep Local Analysis (DLA)-Ranker is a deep learning framework applying 3D convolutions to a set of locally oriented cubes representing the protein interface. It explicitly considers the local geometry of the interfacial residues along with their neighboring atoms and the regions of the interface with different solvent accessibility. We assessed its performance on three docking benchmarks made of half a million acceptable and incorrect conformations. We show that DLA-Ranker successfully identifies near-native conformations from ensembles generated by molecular docking. It surpasses or competes with other deep learning-based scoring functions. We also showcase its usefulness to discover alternative interfaces.
Presentation Overview: Show
In the context of crowded cellular environment, one of the important challenges is to elucidate how proteins distinguish their native partners from a wide variety of non-interactors. The increasing availability of experimentally determined protein-protein complexes provides an opportunity to investigate preferences in protein-protein interactions. We systematically explored the shape complementarity of the interacting proteins using binary hetero complexes from the Protein Data Bank (PDB). The results showed that protein shape characteristics and the corresponding intermolecular energy landscape, sampled by a systematic docking protocol, can discriminate the non-interacting proteins. The number of minima on the energy landscape of known protein interactors, as well as the clustering patterns of the energy minima, are different from those of the non-native protein ligands. The findings provide an insight into fundamental properties of protein recognition. The results can be used to generate more adequate sets of protein-protein complexes for knowledge-based modeling.
Presentation Overview: Show
Computational methods to predict protein-protein interaction (PPI) typically segregate into sequence-based ""bottom-up"" methods that infer properties from the characteristics of the individual protein sequences, or global ""top-down"" methods that infer properties from the pattern of already known PPIs in the species of interest. However, a way to incorporate top-down insights into sequence-based bottom-up PPI prediction methods has been elusive. We thus introduce Topsy-Turvy, a method that newly synthesizes both views in a sequence-based, multi-scale, deep-learning model for PPI prediction. While Topsy-Turvy makes predictions using only sequence data, during the training phase it takes a transfer-learning approach by incorporating patterns from both global and molecular-level views of protein interaction. In a cross-species context, we show it achieves state-of-the-art performance, offering the ability to perform genome-scale, interpretable PPI prediction for non-model organisms with no existing experimental PPI data. In species with available experimental PPI data, we further present a Topsy-Turvy hybrid (TT-Hybrid) model which integrates Topsy-Turvy with a purely network-based model for link prediction that provides information about species-specific network rewiring. TT-Hybrid makes accurate predictions for both well- and sparsely-characterized proteins, outperforming both its constituent components as well as other state-of-the-art PPI prediction methods. Furthermore, running Topsy-Turvy and TT-Hybrid screens is feasible for whole genomes, and thus these methods scale to settings where other methods (e.g., AlphaFold-Multimer) might be infeasible. The generalizability, accuracy and genome-level scalability of Topsy-Turvy and TT-Hybrid unlocks a more comprehensive map of protein interaction and organization in both model and non-model organisms.
Software availability: https://topsyturvy.csail.mit.edu
Presentation Overview: Show
Proteins mediate the critical processes of life and beautifully solve the challenges faced during th...
Presentation Overview: Show
AlphaFold2, a ML-based method developed by DeepMind, revolutionised the field of structural biology by predicting the 3D structure of proteins with an accuracy often comparable to experimental characterization. In a joint effort with EMBL-EBI, protein structures for 21 model organisms were made available. To exploit these, assigning modelled domains to their evolutionary families helps in understanding how genetic variations modify structure and ultimately function. The CATH database includes evolutionary relationships between protein domains and classifies them into superfamilies. We identify structural domains in AlphaFold2 models and classify them in CATH. While most domain assignments are obtainable by Hidden Markov Models-based methods, remote homologs often are elusive. We recently established CATHe, a supervised machine learning approach that exploits sequence embeddings from the ProtT5 PLM to detect remote homologs. Using CATHe and a new fast structural aligner, Foldseek, we established thresholds for confirming homology. Before structurally validating the assignments, small, disordered, non-globular domains or poorly packed domains were removed. 93% of domains passing these thresholds could be brought into CATH, with the remainder belonging to ~4200 putative novel families. Manual curation efforts on human domains from these novel families, lead to the identification of one new architecture and ~100 new folds.
Presentation Overview: Show
Structure comparison is fundamental for understanding proteins, specifically for studying their sequence and structural evolution and for guiding our efforts to predict their structures from their sequences of amino acids. Coordinate based structural alignment methods optimize the distances traversed by aligned residue pairs during the linear interpolation between two superimposed structures. Current alignment scores do not take into account if there is room for this morph, if it causes steric clashes or if it causes topological changes to the compared structures.
ProteinAlignmentObstruction finds steric clashes and self-intersections occurring during the linear interpolation between two aligned and superimposed structures. Self-intersections that can be avoided by re-folding at most M (user-defined) residues are called removable and the remaining self-intersections detect different threading or topology and are called essential.
We find examples of homologous protein pairs with distinct threading and many pairs of distinctly classified folds that easily are morphed into each other emphasizing the continuous nature of parts of protein fold space. I will present our new server Steric and TOPological Model Hindrance and examples of threading errors it finds in CASP14 models. There are many applications where the ability to detect if structures are close in configuration space may prove important.
Presentation Overview: Show
Protein families evolve by the accumulation of sequence variations that translate into changes in the folding pathways and the structure and dynamics of the native state of their members. These changes are constrained by the features of the folding energy landscape as well as the cellular context where these proteins perform their molecular function.
Natural proteins fold by minimizing the energetics of those interactions that are present in their native states. Although the free energy is globally minimized, not all interactions that are present in the native state can be energetically optimized. These conflicting, frustrated, signals have been linked with different functional aspects such as protein-protein interactions, allosterism and catalytic activity.
Here we present FrustraEvo, a tool that measures local frustration conservation patterns within protein families as a proxy to define residues that are important either for stability or function and relate them to their sequence variability signatures. We additionally compare homologous protein families to understand how they have diversified their functional patterns from a common ancestral origin. We will showcase how FrustraEvo can shed light into the functional understanding of structurally characterized protein families as well as of poorly characterized ones, thanks to recent advances in structure predictions.
Presentation Overview: Show
Motivation: Alignments are correspondences between sequences. How reliable are alignments of amino-acid sequences of proteins, and what inferences about protein relationships can be drawn? Using techniques not previously applied to these questions, by weighting every possible sequence alignment by its posterior probability we derive a formal mathematical expectation, and develop an efficient algorithm for computation of the distance between alternative alignments allowing quantitative comparisons of sequence-based alignments with corresponding reference structure alignments.
Results: By analyzing the sequences and structures of one million protein domain pairs, we report the variation of the expected distance between sequence-based and structure-based alignments, as a function of (Markov time of) sequence divergence. Our results clearly demarcate the `daylight', `twilight' and `midnight' zones for interpreting residue-residue correspondences from sequence information alone.
Presentation Overview: Show
For the past half-century, structural biologists relied on the notion that similar protein sequences give rise to similar structures and functions. While this assumption has driven research to explore certain parts of the protein universe, it disregards spaces that don't rely on this assumption. Here we explore areas of the protein universe where similar protein functions can be achieved by different sequences and different structures. We created an open access database of ~200,000 newly predicted structures for diverse protein sequences from 1,003 representative genomes across the microbial tree of life and annotated them functionally on a per-residue basis. Structure prediction is accomplished using the World Community Grid. In order to make our analysis more robust, we used two methods: Rosetta and DMPFold. Critical parts of our dataset were also verified with AlphaFold which confirmed our previous predictions. The resulting database is complementary to the AlphaFold database with regards to domains of life as well as sequence diversity and sequence length. We annotate these models functionally and analyze the features of the resulting protein structure-function universe, including fold novelty and structure-function relationships. We identify 148 novel folds and describe examples where we map specific functions to structural motifs.
Presentation Overview: Show
Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structure motifs made up of smaller numbers of amino acids constituting e.g. the DNA-binding His2/Cys2 Zinc Finger motif or the His-Asp-Ser catalytic triad in serine proteases. We present an efficient yet flexible structure motif search service, which allows users for the first time to search for arbitrary structure motifs across the entire PDB archive within seconds.
Our implementation is available as part of RCSB Protein Data Bank rcsb.org web portal and its search infrastructure. Users can extract motifs from the 3D view of a protein structure, execute their search, and visualize all alignments between the query motif and each retrieved hit using the Mol* 3D viewer. As a proof-of-concept, we demonstrate how our solution readily scales to millions of protein structures. It is capable of searching the entire PDB archive plus the AlphaFold Protein Structure Database within seconds. Efficient methods to navigate the 3D structure space have never been more in demand and the RCSB PDB structure motif search service is one of the tools that allow users to effectively confront the deluge of 3D biostructure data.
Presentation Overview: Show
Highly accurate structure prediction methods, such as AlphaFold2 and RoseTTAFold, are generating an avalanche of publicly available protein structures. Searching through these structures with current structural alignment tools is becoming the main bottleneck in their analysis. Here we propose Foldseek a fast and sensitive protein structures alignment method to compare large structure sets. Foldseek encodes structures as sequences over a 20-state 3Di alphabet. 3Di describes discretized tertiary residue-residue interactions, which is critical for reaching high sensitivities. Foldseek's novel local alignment stage combines structural and amino acid substitution scores to improve sensitivity without sacrificing speed. It reaches sensitivities similar to state-of-the-art structural aligners while being at least 20,000 times faster. The open-source Foldseek software is available at foldseek.com and a webserver at search.foldseek.com
Presentation Overview: Show
The structure of proteins can help understand the mechanism of diseases associated with missense mutations and help develop therapeutics. With improved deep learning techniques such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologues. We modelled and extracted the domains from 553 disease-associated human proteins without known protein structures or sequential homologues in the Protein Databank. Domains that could be assigned to CATH superfamilies had higher quality and lower RMSD between AlphaFold and RoseTTAFold models compared to those that could only be assigned to Pfam or neither. Using these models, we predicted ligand-binding sites, protein-protein interfaces, conserved residues, destabilising effects, and pathogenicity caused by missense mutations. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization, or pathogenicity. These mutations were more buried, pathogenic, closer to predicted functional sites and had higher predicted ddG of mutation compared to polymorphisms. Usage of models from the two state-of-the-art techniques and multiple predictors predicting the same mutation to have an effect provides higher confidence in our predictions. We explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.
Presentation Overview: Show
State-of-the-art structure prediction systems such as AlphaFold2 and RoseTTAFold recently achieved remarkable performance in full-atom prediction of protein 3D structure, essentially solving the protein structure prediction problem for most proteins. However, even this pinnacle of 50 years of research has short-comings: all SOTA methods rely on correlated mutations captured by multiple sequence alignments (MSAs). The extensive database searches needed to create MSAs significantly increase the total time required to obtain structure predictions from these models. Irrespective of runtime, the quality of predicted 3D structures relies heavily on the size and diversity of sequences within the MSA. In addition, predictions based on MSAs are family-averaged and not protein-specific, rendering those models less sensitive to effects of mutations.
In this work we present a novel structure prediction system solely trained on sequence representations from protein language models. Our model predicts protein structure orders of magnitude faster than current state-of-the-art systems, allowing high-throughput structure mutation experiments which are computationally too expensive with existing systems. While the overall quality of predictions does not reach the level of AlphaFold2, we show that our system is considerably more sensitive to point mutations than family-averaged predictors.
Presentation Overview: Show
Several viral glycoproteins go through conformational changes, fundamental to infection processes. The SARS-CoV-2 Spike protein is of particular importance during the current pandemic. This protein interacts with the human acetylcholinesterase 2 (ACE2) receptor as part of the viral entry mechanism. To do so, the receptor-binding domain (RBD) of Spike needs to be in an open state conformation. Here we utilize coarse-grained Normal Mode Analyses to model the dynamics of SARS-CoV-2 Spike protein variants as well as the transition probabilities between open and closed conformations. We performed 17081 possible in silico single mutations of Spike to determine positions and mutations that may affect the occupancy of the conformational states. Based on that, we successfully predicted some of the main mutations that constitute Alpha, Beta and Gamma variants. We also built a simplified model for binding evaluation, validated with experimental data of the binding between RBD mutants and ACE2, which is now being applied to the evaluation of interfaces between conformational ensembles of Spike and antibody structures, with preliminary results offering a consensus among the various experimental interfaces determined, to propose a method to evaluate mutants that integrates dynamics, binding, and immune escape.
Presentation Overview: Show
The development of new vaccines and antibody therapeutics typically takes several years and requires over $1bn in investment. Accurate knowledge of the paratope (antibody binding site) can speed up and reduce the cost of this process by improving our understanding of antibody-antigen binding.
We present Paragraph, an open-source structure-based paratope prediction tool that outperforms current state-of-the-art tools using simpler feature vectors and no antigen information. Representing the antibody variable region as a graph, Paragraph uses equivariant graph neural network layers to predict the probability of each residue belonging to the paratope.
Given the lack of readily available antibody crystal data, it is essential that structure-based prediction tools work on model structures. As such, all our results are on models.
In addition to improving paratope prediction accuracy, we also identify issues with currently used benchmark datasets and metrics. To overcome this, we develop a larger, cleaner dataset to be used in future efforts and suggest metrics well-suited to evaluating highly class-imbalanced problems.
Paragraph achieves a PR AUC of 0.725 on ABlooper model structures of our expanded dataset. Promisingly, Paragraph’s performance increases with model confidence, suggesting our accuracy may rise with future improvements to antibody structure prediction.
Presentation Overview: Show
Linear B-Cell epitope refers to a class of antigenic determinants that could bind to B-Cell receptors or antibodies released by the adaptive immune system. Among the two types of epitope classes, the continuous (or linear) and the discontinuous, both only exist upon the detection and binding of the antigen by an antibody. In a scalable and less expensive process, computational approaches aim to contribute with epitope-based vaccines and immunotherapies development, identifying from a protein sequence, which residues are more likely to be part of an epitope.
A variety of prediction methods have been developed over the years, however, their reliability for clinical applications is still questionable based on medium to low performance (Matthew’s Correlation Coefficients ranging from 0.32 to 0.62). Additionally, current machine learning models also lack interpretability, limiting biological insights that could otherwise be obtained. Here, we introduce CSM-epitopes, an interpretable machine learning method, capable of accurately identifying linear B-cell epitopes, leveraging a new graph-based signature representation of protein sequences, based on our well established CSM (Cutoff Scanning Matrix) algorithm.
Presentation Overview: Show
SARS-CoV-2 infection manifests a range of clinical presentations from mild illness to life-threatening disease. As a mediator of viral entry, ACE2 is an a priori candidate genetic risk factor. The affinity of SARS-CoV-2 Spike for ACE2 is a key parameter influencing host-range and tropism and so we determined the affinities of several reported ACE2 population variants experimentally and predicted the effects of many more. We found ACE2 alleles that strongly inhibited binding to Spike and some with moderately increased affinity. Comparison to recent infectivity studies indicates that the affinity ranges of ACE2 variants can protect cells from infection and so some almost certainly confer resistance to carriers; this is now being tested with clinical data. We will also highlight the strengths and weaknesses of current generation predictors, and present new results on the interplay between ACE2 variants and different SARS-CoV-2 strains.