3DSIG COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Wednesday, July 13th
10:30-11:10
Keynote Presentation: Unprecedented Opportunities for Datamining of Protein Sequences and Structures
Room: EH
Format: Live from venue

Moderator(s): Rafael Najmanovich

  • Robert Jernigan


Presentation Overview: Show

There are now several hundred million protein sequences and well over 100,000 protein structures, bu...

11:10-11:30
Keynote Presentation: Unprecedented Opportunities for Careers in Protein Science
Room: EH
Format: Live from venue

Moderator(s): Rafael Najmanovich

  • Robert Jernigan


Presentation Overview: Show

There are four recent events that have set the stage for incredible opportunities for careers in pro...

11:30-11:50
Sequence-sensitive elastic network captures dynamical elements necessary for human microRNA maturation
Room: EH
Format: Live from venue

Moderator(s): Rafael Najmanovich

  • Olivier Mailhot, Université de Montréal, Canada
  • François Major, Université de Montréal, Canada
  • Rafael Najmanovich, Université de Montréal, Canada


Presentation Overview: Show

MicroRNAs (miRNAs) regulate gene expression and have recognized roles in numerous physiological processes and diseases, including cancer. A single nucleotide polymorphism (SNP) in the miR-125a gene leads to poor breast cancer prognosis by blocking the cleavage of the primary miRNA transcript (pri-miRNA). The SNP does not affect pri-miR-125a’s minimal energy structure, leading to the hypothesis that the lost signal must be dynamical. Leveraging high-throughput data on the maturation efficiency of 29 478 pri-miR-125a mutant sequences, we applied our sequence-sensitive elastic network ENCoM to study the full 3D conformational space of pri-miRNAs and its impact on their maturation efficiency. The model predicts maturation efficiency with high accuracy (predictive R-squared of 0.75) when both the ENCoM dynamical signature and the MC-Fold enthalpy of folding are combined, highlighting the synergy between these respectively entropy- and enthalpy-based methods. Looking at the patterns apparent from the model’s coefficients, we corroborate motifs previously identified as necessary for the cleavage of pri-miRNAs but also challenge established notions such as the necessity for a rigid hairpin structure. Our novel approach is fast enough to predict theoretical maturation efficiencies for millions of miRNA sequences, the extremes of which we are currently testing in the lab.

11:50-12:10
Intrinsic linking of chromatin in human cells
Room: EH
Format: Live from venue

Moderator(s): Rafael Najmanovich

  • Maciej Borodzik, Institute of Mathematics, University of Warsaw, ul. Banacha 2, 02-097 Warsaw, Poland, Poland
  • Michał Denkiewicz, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland, Poland
  • Krzysztof Spalinski, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland, Poland
  • Kamila Winnicka, Centre of New Technologies, University of Warsaw, ul. Banacha 2c, 02-097 Warsaw, Poland, Poland
  • Kaustav Sengupta, Centre of New Technologies, University of Warsaw, ul. Banacha 2c, 02-097 Warsaw, Poland, Poland
  • Marcin Pilipczuk, Institute of Informatics, University of Warsaw, ul. Banacha 2, 02-097 Warsaw, Poland, Poland
  • Michał Pilipczuk, Institute of Informatics, University of Warsaw, ul. Banacha 2, 02-097 Warsaw, Poland, Poland
  • Yijun Ruan, The Jackson Laboratory for Genomic Medicine, USA, United States
  • Dariusz Plewczynski, Centre of New Technologies, University of Warsaw & Warsaw University of Technology, Warsaw, Poland, Poland


Presentation Overview: Show

Motivation: We propose a practical algorithm based on graph theory, with the purpose to identify CTCF- mediated chromatin loops that are linked in 3D space. Our method is based on finding certain graph structures, K6 minors, in graphs constructed from pairwise chromatin interaction data obtained from the ChIA-PET experiments. We show, that such graph structures, representing particular arrangement of loops, mathematically necessitate linking, if co-occurring in an individual cell. The presence of these linked structures can advance our understanding of the principles of spatial organization of the genome. Results: We apply our method to graphs created from in situ ChIA-PET data for GM128787, H1ESC, HFFC6 and WTC11 cell lines, and from long-read ChIA-PET data. We look at these datasets as divided into CCDs – closely interconnected regions defined on the basis of CTCF loops. We find numerous candidate regions with minors, indicating the presence of links. The graph-theoretic characteristics of these linked regions, including betweenness and closeness centrality, differ from regions without, in which no minors were found, which supports their non-random nature. We provide two versions of the algorithm: one efficient enough to be applied to large datasets, and the other with greater detection capabilities.

12:10-12:30
GLYCO-2.0: a web-based server to quantify glycan shielding of glycosylated proteins with improved data processing and computational speed
Room: EH
Format: Live from venue

Moderator(s): Rafael Najmanovich

  • Myungjin Lee, National Institutes of Health, United States
  • Mateo Reveiz, National Institutes of Health, United States
  • Reda Rawi, National Institutes of Health, United States
  • Peter Kwong, National Institutes of Health, United States


Presentation Overview: Show

Glycans play important roles in protein folding and cell-cell interactions – and, furthermore, glycosylation of protein antigens can dramatically impact immune responses. Previously, we developed an in silico tool GLYCO (GLYcan COverage), to quantify the glycan shielding of protein surfaces. We applied it to determine glycan-free surface of SARS-CoV-2 NTD supersite and to correlate glycan coverage with antigen-antibody properties. Here we developed a user-friendly web server, GLYCO-2.0, and improved the computational speed by replacing the previous linear parametrization with a new analytical cylinder method with KD-trees when retrieving atom positions within the coordinate space. The use of these new methods increased computational speed by ~4-5 fold in single and multiprocessing settings. GLYCO-2.0 can estimate glycan shielding from a single coordinate file or multiple frames derived from for instance molecular dynamics simulations or NMR spectroscopy to account for the inherent flexibility of oligosaccharides. The server offers email notifications, allowing the retrieval of results within a week. Also, we showcased the applicability of GLYCO-2.0 by estimating the glycan shield development of influenza’s hemagglutinin proteins over time. Overall, quantification of glycans by GLYCO-2.0 provides a comprehensive understanding of glycan shielding of glycosylated proteins and contributes to glycoprotein-involved research such as vaccine design.

14:30-14:40
Interaction between hIFNγ and HS oligosaccharides
Room: EH
Format: Live-stream

Moderator(s): Ravinder Abrol

  • Elena Lilkova, Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Bulgaria
  • Elena Krachmarova, Institute of Molecular Biology “Roumen Tsanev”, Bulgarian Academy of Sciences, Bulgaria
  • Peicho Petkov, Faculty of Physics, Sofia University "St. Kliment Ohridski", Bulgaria
  • Nevena Ilieva, Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Bulgaria
  • Kristina Malinova, Institute of Molecular Biology “Roumen Tsanev”, Bulgarian Academy of Sciences, Bulgaria
  • Genoveva Nacheva, Institute of Molecular Biology “Roumen Tsanev”, Bulgarian Academy of Sciences, Bulgaria
  • Leandar Litov, Faculty of Physics, Sofia University "St. Kliment Ohridski", Bulgaria


Presentation Overview: Show

Human interferon-gamma (hIFNγ) is a crucial immunomodulating cytokine, which binds to a high-affinity cellular receptor hIFNγR1. The cytokine also binds to the glycosaminoglycans (GAGs) heparin and heparan sulfate (HS), which modulates its physico-chemical properties.
We report molecular dynamics studies of the interaction of hIFNγ and HS-derived oligosaccharides in two different scenarios – in the circulation, and at the cell-surface, when the cytokine forms a complex with its receptor.
HS oligosaccharides bind to the C-termini of free IFNγ with high affinity, forming very stable complexes due to the strong electrostatic attraction, and also interact with the positively charged solvent-exposed domains in the cytokine globule. This impedes further interaction of the cytokine with hIFNγR1.
On the other hand, GAGs, and HS in particular, may be crucial participants in the formation
of the hIFNγ–hIFNγR complex at the cell surface. Our in silico results demonstrate, that placing HS oligosaccharides between the two receptor units facilitates the formation of the cytokine–receptor complex by pulling down the hIFNγ globule via electrostatic attraction of its C-termini. Experiments performed on cell culture confirm, that inhibition of the sulfation of HS proteoglycans by addition of NaClO3 to the cell medium leads to decreased hIFNγ activity.

14:40-14:50
PITHIA: protein interaction site prediction using multiple sequence alignments and attention
Room: EH
Format: Live-stream

Moderator(s): Ravinder Abrol

  • Seyedmohsen Hosseini, University of Western Ontario, Canada
  • Lucian Ilie, University of Western Ontario, Canada


Presentation Overview: Show

Cellular functions are governed by proteins. While some proteins work independently, most function by interacting with each other. It is crucially important to know the binding sites that facilitate the interactions. Experimental methods are costly and time consuming, therefore it is essential to develop effective computational methods. We present PITHIA, a deep learning model for protein interaction site prediction that exploits several of the most powerful tools in bioinformatics: alignment, attention, and embedding. The recently introduced MSA-transformer uses the power of attention to learn from millions of multiple sequence alignments, a language model that surpasses previous unsupervised methods by a wide margin. We use the contextual embeddings produced by the MSA-transformer as inputs to our program. The architecture of PITHIA is attention based as well, selected by a thorough comparison with multiple candidates. For meaningful comparison with existing programs, we update several widely used datasets with the most current protein binding site information and create a new one, which is the largest and most challenging to date. PITHIA greatly surpasses the competition on five datasets with respect to multiple measures, exceeding the closest competitor by up to 35% in terms of area under the precision-recall curve.

15:00-15:10
MTL4MHC2: MHC class II binding prediction by using multi-task learning
Room: EH
Format: Live from venue

Moderator(s): Ravinder Abrol

  • Kazuhiro Ikkyu, University of Tsukuba, Riken, Japan
  • Itoshi Nikaido, University of Tsukuba, Riken, Tokyo Medical and Dental University, Japan


Presentation Overview: Show

Neoepitopes (neoantigen) are cancer-specific antigens and are significant therapeutic cancer vaccine candidates. Epitopes bind the major histocompatibility complex (MHC), which is an immune receptor. Tumor neoepitopes induce an immune response to eliminate cancer cells. This immune activation depends on the affinity between antigen peptide and MHC ligand. Epitope-MHC binding assay is a technologically difficult, time-consuming, and high-expensive experiment. Therefore, the prediction tools, which predict the affinity between antigen peptide and MHC ligand, have been developed using computational approaches. However, it is insufficient data volume for predicting the epitope-MHC binding. The performance of these predictions is not enough. Here, we proposed a novel deep learning model that can predict epitope-MHC binding from a small amount of training data.
MTL4MHC2 has two multi-task Bi-LSTM models, which are the antigen peptides learning model and the MHC peptides learning model. Each multi-task model shares the learning parameters of MHC class I and II. MTL4MHC2 achieves an AUC-ROC score of 82.2%, outperforming state-of-the-art models.
We demonstrated the effectiveness of multi-task learning for improving prediction performance from low amounts of data. MTL4MHC2 can be applied to developing novel cancer therapeutics like a cancer vaccine.

15:10-15:20
GASS platform: identifying active sites and binding sites on protein structures using parallel genetic algorithms
Room: EH
Format: Live-stream

Moderator(s): Ravinder Abrol

  • Vinícius Paiva, Universidade Federal de Viçosa - UFV, Brazil
  • Murillo Mendonça, Universidade Federal de Itajubá - UNIFEI, Brazil
  • Sabrina Silveira, Universidade Federal de Viçosa - UFV, Brazil
  • David Ascher, University of Melbourne, Australia
  • Douglas Pires, University of Melbourne, Australia
  • Sandro Izidoro, Universidade Federal de Itajubá - UNIFEI, Brazil


Presentation Overview: Show

Catalytic, binding and metal-binding sites are important and conserved regions of proteins. Their identification can provide important information and insights into protein function. Several computational methods have been developed to identify binding sites based on both sequence and structural information. These have, however, presented limited performance, mostly relying on structural similarity, restricting their application to small binding sites, and not being capable of handling conservative mutations or identifying inter-domain sites.

Here we present the GASS platform, a family of methods for searching similar sites in proteins based on parallel genetic algorithms. GASS was previously successfully used to search for similar catalytic and binding sites, based on templates from the Mechanism Catalytic Site Atlas (M-CSA), correctly identifying more than 90% of the catalogued catalytic sites, ranking fourth among the 18 methods in the CASP 10 competition. GASS was also compared with 8 other state-of-the-art methods for detecting metal-binding sites, outperforming similar methods and achieving an MCC of up to 0.57 and detecting up to 96% of the metal-binding sites correctly.

The GASS platform (https://gassmetal.unifei.edu.br, http://gass.unifei.edu.br/) provides accurate and easy-to-use methods that can be adapted to searching for binding sites in proteins.

15:20-15:30
PPalign: optimal alignment of Potts models representing proteins with direct coupling information
Room: EH
Format: Live-stream

Moderator(s): Ravinder Abrol

  • Hugo Talibart, Institut de Systématique, Evolution, Biodiversité (ISYEB), MNHN, Sorbonne Université, EPHE, UA, CNRS, France
  • François Coste, Univ Rennes, Inria, CNRS, IRISA, Rennes, France, France
  • Mathilde Carpentier, Institut de Systématique, Evolution, Biodiversité (ISYEB), MNHN, Sorbonne Université, EPHE, UA, CNRS, France


Presentation Overview: Show

To assign structural and functional annotations to the ever-increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or profile Hidden Markov Model methods, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, this problem is computationally hard.
We introduced PPalign, a program based on Integer Linear Programming, to compute the optimal pairwise alignment of Potts models representing proteins. The approach was assessed on reference pairwise sequence alignments with low sequence identity (3% to 20%). In this experimentation, Potts models were aligned in reasonable time (1’37” on average), and PPalign yielded a better mean F1 score and found significantly better alignments than HHalign and independent-site PPalign in some cases.
These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time.

16:00-16:20
Normal Mode Analysis Applied to GPCRs
Room: EH
Format: Live from venue

Moderator(s): Alexey Porollo

  • Gabriel Tiago Galdino, Universite de Montreal, Canada
  • Oliver Mailhot, Universite de Montreal, Canada
  • Rafael Najamanovich, Universite de Montreal, Canada


Presentation Overview: Show

We employ coarse-grained normal mode analysis to calcu-late dynamical signatures of different ligand/G-protein Coupled Receptors (GPCRs) complexes. Dynamical signa-tures show changes in flexibility of different parts of the structure upon ligand binding. As a first experiment, we docked a large set of ligands with known Emax for GTP-gammaS binding to a crystal structure of the active mu (MOR) and kappa (KOR) opioid receptors, calculated the dynamical signature for each ligand and obtained predictors using multiple linear regression. We obtained a Pearson’s correlation of R=0.46 and R=0.57 in a leave-one-out vali-dation (a scenario where we present a totally new ligand to the system) and a Pearson’s correlation of R=0.8 and R=0.7 in an 80:20 validation (a best-case scenario where new molecules are like training set molecules), for MOR and KOR reactively. These results, shows that even with a limited training set, we can get good estimation of Emax of new drug candidates, therefore predicting their role as agonists, antagonists, or partial agonists computationally and potentially as part of high-throughput screening. More-over, by analyzing the coefficients of these predictors, we see what regions of the receptor have the largest influence in its activation (highlighting helices 5, 6 and the binding-site).

16:20-16:40
Structure-enhanced Deep Meta-learning Predicts Genome-Wide Uncharted Chemical-Protein Interactions
Room: EH
Format: Live from venue

Moderator(s): Alexey Porollo

  • Tian Cai, Hunter College, The City University of New York, United States
  • Lei Xie, Hunter College, The City University of New York, United States


Presentation Overview: Show

Discovering genome-wide chemical-protein interactions is instrumental for chemical genomics, drug discovery and precision medicine. However, more than 90% of gene families remain dark, i.e., their small molecular ligands are undiscovered. Existing approaches typically fail when the dark protein of interest differs from those with known ligands or structures. To address this challenge, we developed a deep learning framework PortalCG. PortalCG consists of three novel components: (i) end-to-end step-wise transfer learning in recognition of sequence-structure-function paradigm, (ii) out-of-cluster meta-learning in light of protein evolution for generalizing machine learning models to unstudied gene families, and (iii) stress model selection to facilitate model deployment in a real-world scenario. In rigorous benchmark experiments, PortalCG considerably outperformed state-of-the-art sequence- and structure-based techniques when applied to dark gene families. Experimental validations on 65 compounds supported the accuracy and robustness of PortalCG. Thus, PortalCG is a viable solution to the out-of-distribution (OOD) problem in exploring the dark protein functional space, and can be applied to a wide variety of scientific domains.

16:40-17:00
XLEC - Large-scale prediction of protein-protein complex structures from sequence co-evolution and cross-linking data
Room: EH
Format: Live from venue

Moderator(s): Alexey Porollo

  • Hadeer Elhabashy, Max-Planck-Institut für Biologie Tübingen, Germany
  • Oliver Kohlbacher, University of Tübingen, Germany


Presentation Overview: Show

Prediction and structural modeling of protein-protein interactions (PPIs) are essential for understanding biological processes. Most large-scale experimental and computational approaches that predict PPIs do not provide structural information. We present a novel approach, XLEC, combining cross-linking mass spectrometry (XL-MS) and evolutionary couplings (ECs) data for efficient proteome-wide prediction and modeling of PPIs. While ECs derived from multiple sequence alignments primarily yield information on direct contacts between proteins across the interface, XL-MS data preferentially captures longer-range interactions, hence these methods contain complementary information. XLEC integrates information from both approaches in a machine learning-based model and subsequent constraint-based modeling of the complex structure. We applied XLEC to data from murine mitochondrial proteomes and compared its performance to those of XL-MS and ECs separately. Our preliminary assessment suggests that XLEC outperforms XL-MS or ECs-based identification of PPIs (precision/recall: XLEC 76%/76%; XL-MS only: 71%/57%; ECs: 68%/57%). Furthermore, XLEC-based modeling of PPIs achieved excellent L-RMSD (<10 Å) for 20% of the benchmark dataset (XL-MS only: 2%; ECs only: 11%). Using XLEC, we generated around 500 de novo PPI models revealing novel insights into the mitochondrial interactome.

17:00-17:20
Deep Local Analysis evaluates protein docking conformations with locally oriented cubes
Room: EH
Format: Live-stream

Moderator(s): Alexey Porollo

  • Yasser Mohseni Behbahani, Sorbonne Université, France
  • Simon Crouzet, Sorbonne Université, France
  • Élodie Laine, Sorbonne Université, France
  • Alessandra Carbone, Sorbonne Université, France


Presentation Overview: Show

With the recent advances in protein 3D structure prediction, protein interactions are becoming more central than ever before. Here, we address the problem of determining how proteins interact with one another. More specifically, we investigate the possibility of discriminating near-native protein complex conformations from incorrect ones by exploiting local environments around interfacial residues. Deep Local Analysis (DLA)-Ranker is a deep learning framework applying 3D convolutions to a set of locally oriented cubes representing the protein interface. It explicitly considers the local geometry of the interfacial residues along with their neighboring atoms and the regions of the interface with different solvent accessibility. We assessed its performance on three docking benchmarks made of half a million acceptable and incorrect conformations. We show that DLA-Ranker successfully identifies near-native conformations from ensembles generated by molecular docking. It surpasses or competes with other deep learning-based scoring functions. We also showcase its usefulness to discover alternative interfaces.

17:20-17:40
Standing out in the crowd: Native protein partners are distinct from the non-native ones in protein-protein interactions
Room: EH
Format: Live from venue

Moderator(s): Alexey Porollo

  • Amar Singh, Computational Biology Program, The University of Kansas, Lawrence, Kansas 66045, USA, United States
  • Petras J. Kundrotas, Computational Biology Program, The University of Kansas, Lawrence, Kansas 66045, USA, United States
  • Ilya A. Vakser, Computational Biology Program, The University of Kansas, Lawrence, Kansas 66045, USA, United States


Presentation Overview: Show

In the context of crowded cellular environment, one of the important challenges is to elucidate how proteins distinguish their native partners from a wide variety of non-interactors. The increasing availability of experimentally determined protein-protein complexes provides an opportunity to investigate preferences in protein-protein interactions. We systematically explored the shape complementarity of the interacting proteins using binary hetero complexes from the Protein Data Bank (PDB). The results showed that protein shape characteristics and the corresponding intermolecular energy landscape, sampled by a systematic docking protocol, can discriminate the non-interacting proteins. The number of minima on the energy landscape of known protein interactors, as well as the clustering patterns of the energy minima, are different from those of the non-native protein ligands. The findings provide an insight into fundamental properties of protein recognition. The results can be used to generate more adequate sets of protein-protein complexes for knowledge-based modeling.

17:40-18:00
Proceedings Presentation: Topsy-Turvy: integrating a global view into sequence-based PPI prediction
Room: EH
Format: Live from venue

Moderator(s): Alexey Porollo

  • Rohit Singh, Massachusetts Institute of Technology, United States
  • Kapil Devkota, Tufts University, United States
  • Samuel Sledzieski, Massachusetts Institute of Technology, United States
  • Bonnie Berger, Massachusetts Institute of Technology, United States
  • Lenore Cowen, Tufts University, United States


Presentation Overview: Show

Computational methods to predict protein-protein interaction (PPI) typically segregate into sequence-based ""bottom-up"" methods that infer properties from the characteristics of the individual protein sequences, or global ""top-down"" methods that infer properties from the pattern of already known PPIs in the species of interest. However, a way to incorporate top-down insights into sequence-based bottom-up PPI prediction methods has been elusive. We thus introduce Topsy-Turvy, a method that newly synthesizes both views in a sequence-based, multi-scale, deep-learning model for PPI prediction. While Topsy-Turvy makes predictions using only sequence data, during the training phase it takes a transfer-learning approach by incorporating patterns from both global and molecular-level views of protein interaction. In a cross-species context, we show it achieves state-of-the-art performance, offering the ability to perform genome-scale, interpretable PPI prediction for non-model organisms with no existing experimental PPI data. In species with available experimental PPI data, we further present a Topsy-Turvy hybrid (TT-Hybrid) model which integrates Topsy-Turvy with a purely network-based model for link prediction that provides information about species-specific network rewiring. TT-Hybrid makes accurate predictions for both well- and sparsely-characterized proteins, outperforming both its constituent components as well as other state-of-the-art PPI prediction methods. Furthermore, running Topsy-Turvy and TT-Hybrid screens is feasible for whole genomes, and thus these methods scale to settings where other methods (e.g., AlphaFold-Multimer) might be infeasible. The generalizability, accuracy and genome-level scalability of Topsy-Turvy and TT-Hybrid unlocks a more comprehensive map of protein interaction and organization in both model and non-model organisms.

Software availability: https://topsyturvy.csail.mit.edu

Thursday, July 14th
10:15-11:15
Keynote Presentation: Protein design using deep learning
Room: EH
Format: Live-stream

Moderator(s): Douglas Pires

  • David Baker


Presentation Overview: Show

Proteins mediate the critical processes of life and beautifully solve the challenges faced during th...

11:15-11:35
AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms
Room: EH
Format: Live from venue

Moderator(s): Douglas Pires

  • Nicola Bordin, University College London, United Kingdom
  • Ian Sillitoe, University College London, United Kingdom
  • Vamsi Nallapareddy, University College London, United Kingdom
  • Clemens Rauer, University College London, United Kingdom
  • Su Datt Lam, Universiti Kebangsaan Malaysia, Malaysia
  • Vaishali Waman, University College London, United Kingdom
  • Neeladri Sen, University College London, United Kingdom
  • Michael Heinzinger, Technische Universität München, Germany
  • Maria Littmann, Technische Universität München, Germany
  • Stephanie Kim, Seoul National University, South Korea
  • Sameer Velankar, EMBL-EBI, United Kingdom
  • Martin Steinegger, Seoul National University, South Korea
  • Burkhard Rost, Technische Universität München, Germany
  • Christine Orengo, University College London, United Kingdom


Presentation Overview: Show

AlphaFold2, a ML-based method developed by DeepMind, revolutionised the field of structural biology by predicting the 3D structure of proteins with an accuracy often comparable to experimental characterization. In a joint effort with EMBL-EBI, protein structures for 21 model organisms were made available. To exploit these, assigning modelled domains to their evolutionary families helps in understanding how genetic variations modify structure and ultimately function. The CATH database includes evolutionary relationships between protein domains and classifies them into superfamilies. We identify structural domains in AlphaFold2 models and classify them in CATH. While most domain assignments are obtainable by Hidden Markov Models-based methods, remote homologs often are elusive. We recently established CATHe, a supervised machine learning approach that exploits sequence embeddings from the ProtT5 PLM to detect remote homologs. Using CATHe and a new fast structural aligner, Foldseek, we established thresholds for confirming homology. Before structurally validating the assignments, small, disordered, non-globular domains or poorly packed domains were removed. 93% of domains passing these thresholds could be brought into CATH, with the remainder belonging to ~4200 putative novel families. Manual curation efforts on human domains from these novel families, lead to the identification of one new architecture and ~100 new folds.

11:35-11:55
ProteinAlignmentObstruction – an algorithm for detecting and quantifying steric and topological obstructions to structural alignments of proteins
Room: EH
Format: Live-stream

Moderator(s): Douglas Pires

  • Peter Røgen, Technical University of Denmark, Denmark


Presentation Overview: Show

Structure comparison is fundamental for understanding proteins, specifically for studying their sequence and structural evolution and for guiding our efforts to predict their structures from their sequences of amino acids. Coordinate based structural alignment methods optimize the distances traversed by aligned residue pairs during the linear interpolation between two superimposed structures. Current alignment scores do not take into account if there is room for this morph, if it causes steric clashes or if it causes topological changes to the compared structures.

ProteinAlignmentObstruction finds steric clashes and self-intersections occurring during the linear interpolation between two aligned and superimposed structures. Self-intersections that can be avoided by re-folding at most M (user-defined) residues are called removable and the remaining self-intersections detect different threading or topology and are called essential.

We find examples of homologous protein pairs with distinct threading and many pairs of distinctly classified folds that easily are morphed into each other emphasizing the continuous nature of parts of protein fold space. I will present our new server Steric and TOPological Model Hindrance and examples of threading errors it finds in CASP14 models. There are many applications where the ability to detect if structures are close in configuration space may prove important.

11:55-12:15
FrustraEvo: Assessing Protein Families Divergence In The Light Of Sequence and Energetic Constraints
Room: EH
Format: Live from venue

Moderator(s): Douglas Pires

  • Victoria Ruiz-Serra, Barcelona Super Computing Center, Spain
  • Maria Freiberger, Protein Physiology Lab, Buenos Aires University, Argentina
  • Camila Pontes, Barcelona Super Computing Center, Spain
  • Miguel Romero, Barcelona Super Computing Center, Spain
  • Pablo Galaz-Davison, Institute for Biological and Medical Engineering, Pontificia Universidad Catolica de Chile, Chile
  • Cesar Ramirez-Sarmiento, Institute for Biological and Medical Engineering, Pontificia Universidad Catolica de Chile, Chile
  • Rodrigo Gonzalo Parra, Barcelona Supercomputing Center, Spain
  • Alfonso Valencia, Barcelona Supercomputing Center, Spain


Presentation Overview: Show

Protein families evolve by the accumulation of sequence variations that translate into changes in the folding pathways and the structure and dynamics of the native state of their members. These changes are constrained by the features of the folding energy landscape as well as the cellular context where these proteins perform their molecular function.

Natural proteins fold by minimizing the energetics of those interactions that are present in their native states. Although the free energy is globally minimized, not all interactions that are present in the native state can be energetically optimized. These conflicting, frustrated, signals have been linked with different functional aspects such as protein-protein interactions, allosterism and catalytic activity.

Here we present FrustraEvo, a tool that measures local frustration conservation patterns within protein families as a proxy to define residues that are important either for stability or function and relate them to their sequence variability signatures. We additionally compare homologous protein families to understand how they have diversified their functional patterns from a common ancestral origin. We will showcase how FrustraEvo can shed light into the functional understanding of structurally characterized protein families as well as of poorly characterized ones, thanks to recent advances in structure predictions.

12:15-12:35
Proceedings Presentation: On the reliability and the limits of inference of amino acid sequence alignments
Room: EH
Format: Live-stream

Moderator(s): Douglas Pires

  • Dinithi Sumanaweera, Wellcome Sanger Institute, United Kingdom
  • Lloyd Allison, Monash University, Australia
  • Arun Konagurthu, Monash University, Australia
  • Sandun Rajapaksa, Monash University, Australia
  • Arthur Lesk, Pennsylvania State University, United States
  • Peter Stuckey, Monash University, Australia
  • Maria Garcia de la Banda, Monash University, Australia
  • David Abramson, University of Queensland, Australia


Presentation Overview: Show

Motivation: Alignments are correspondences between sequences. How reliable are alignments of amino-acid sequences of proteins, and what inferences about protein relationships can be drawn? Using techniques not previously applied to these questions, by weighting every possible sequence alignment by its posterior probability we derive a formal mathematical expectation, and develop an efficient algorithm for computation of the distance between alternative alignments allowing quantitative comparisons of sequence-based alignments with corresponding reference structure alignments.

Results: By analyzing the sequences and structures of one million protein domain pairs, we report the variation of the expected distance between sequence-based and structure-based alignments, as a function of (Markov time of) sequence divergence. Our results clearly demarcate the `daylight', `twilight' and `midnight' zones for interpreting residue-residue correspondences from sequence information alone.

13:15-13:35
Sequence-structure-function relationships in the microbial protein universe
Room: EH
Format: Live from venue

Moderator(s): Lenore Cowen

  • Paweł Szczerbiak, Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland, Poland
  • Julia Koehler Leman, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • P. Douglas Renfrew, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • Vladimir Gligorijevic, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • Daniel Berenberg, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • Richard Bonneau, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • Tomasz Kosciolek, Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland, Poland


Presentation Overview: Show

For the past half-century, structural biologists relied on the notion that similar protein sequences give rise to similar structures and functions. While this assumption has driven research to explore certain parts of the protein universe, it disregards spaces that don't rely on this assumption. Here we explore areas of the protein universe where similar protein functions can be achieved by different sequences and different structures. We created an open access database of ~200,000 newly predicted structures for diverse protein sequences from 1,003 representative genomes across the microbial tree of life and annotated them functionally on a per-residue basis. Structure prediction is accomplished using the World Community Grid. In order to make our analysis more robust, we used two methods: Rosetta and DMPFold. Critical parts of our dataset were also verified with AlphaFold which confirmed our previous predictions. The resulting database is complementary to the AlphaFold database with regards to domains of life as well as sequence diversity and sequence length. We annotate these models functionally and analyze the features of the resulting protein structure-function universe, including fold novelty and structure-function relationships. We identify 148 novel folds and describe examples where we map specific functions to structural motifs.

13:35-13:55
Scaling to Millions of Structures: Real-Time Protein Structure Motif Searching in the RCSB PDB & AlphaFold Databases
Room: EH
Format: Live from venue

Moderator(s): Lenore Cowen

  • Sebastian Bittrich, RCSB Protein Data Bank, United States
  • Jose M. Duarte, RCSB Protein Data Bank, United States
  • Stephen K. Burley, RCSB Protein Data Bank, United States


Presentation Overview: Show

Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structure motifs made up of smaller numbers of amino acids constituting e.g. the DNA-binding His2/Cys2 Zinc Finger motif or the His-Asp-Ser catalytic triad in serine proteases. We present an efficient yet flexible structure motif search service, which allows users for the first time to search for arbitrary structure motifs across the entire PDB archive within seconds.

Our implementation is available as part of RCSB Protein Data Bank rcsb.org web portal and its search infrastructure. Users can extract motifs from the 3D view of a protein structure, execute their search, and visualize all alignments between the query motif and each retrieved hit using the Mol* 3D viewer. As a proof-of-concept, we demonstrate how our solution readily scales to millions of protein structures. It is capable of searching the entire PDB archive plus the AlphaFold Protein Structure Database within seconds. Efficient methods to navigate the 3D structure space have never been more in demand and the RCSB PDB structure motif search service is one of the tools that allow users to effectively confront the deluge of 3D biostructure data.

13:55-14:15
Foldseek: fast and accurate protein structure search
Room: EH
Format: Live-stream

Moderator(s): Lenore Cowen

  • Michel van Kempen, Max Planck Institute, Germany
  • Stephanie Kim, Seoul National University, South Korea
  • Charlotte Tumescheit, Seoul National University, South Korea
  • Milot Mirdita, Max Planck Institute, Germany
  • Johannes Söding, Max Planck Institute, Germany
  • Martin Steinegger, Seoul National University, South Korea
  • Cameron Gilchrist, Seoul National University, South Korea


Presentation Overview: Show

Highly accurate structure prediction methods, such as AlphaFold2 and RoseTTAFold, are generating an avalanche of publicly available protein structures. Searching through these structures with current structural alignment tools is becoming the main bottleneck in their analysis. Here we propose Foldseek a fast and sensitive protein structures alignment method to compare large structure sets. Foldseek encodes structures as sequences over a 20-state 3Di alphabet. 3Di describes discretized tertiary residue-residue interactions, which is critical for reaching high sensitivities. Foldseek's novel local alignment stage combines structural and amino acid substitution scores to improve sensitivity without sacrificing speed. It reaches sensitivities similar to state-of-the-art structural aligners while being at least 20,000 times faster. The open-source Foldseek software is available at foldseek.com and a webserver at search.foldseek.com

14:15-14:35
Characterizing and explaining impact of disease- associated mutations in proteins without known structures or structural homologues
Room: EH
Format: Live from venue

Moderator(s): Lenore Cowen

  • Neeladri Sen, Institute of Structural and Molecular Biology, University College London, United Kingdom
  • Ivan Anishchenko, Institute for Protein Design, University of Washington, United States
  • Nicola Bordin, Institute of Structural and Molecular Biology, University College London, United Kingdom
  • Ian Sillitoe, Institute of Structural and Molecular Biology, University College London, United Kingdom
  • Sameer Velankar, Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
  • David Baker, Institute for Protein Design, University of Washington, United States
  • Christine Orengo, Institute of Structural and Molecular Biology, University College London, United Kingdom


Presentation Overview: Show

The structure of proteins can help understand the mechanism of diseases associated with missense mutations and help develop therapeutics. With improved deep learning techniques such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologues. We modelled and extracted the domains from 553 disease-associated human proteins without known protein structures or sequential homologues in the Protein Databank. Domains that could be assigned to CATH superfamilies had higher quality and lower RMSD between AlphaFold and RoseTTAFold models compared to those that could only be assigned to Pfam or neither. Using these models, we predicted ligand-binding sites, protein-protein interfaces, conserved residues, destabilising effects, and pathogenicity caused by missense mutations. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization, or pathogenicity. These mutations were more buried, pathogenic, closer to predicted functional sites and had higher predicted ddG of mutation compared to polymorphisms. Usage of models from the two state-of-the-art techniques and multiple predictors predicting the same mutation to have an effect provides higher confidence in our predictions. We explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.

14:35-14:55
Opening the door for ultra-fast in-silico structure mutation
Room: EH
Format: Live-stream

Moderator(s): Lenore Cowen

  • Konstantin Weissenow, Technical University of Munich, Germany
  • Michael Heinzinger, Technical University of Munich, Germany
  • Burkhard Rost, Technical University of Munich, Germany


Presentation Overview: Show

State-of-the-art structure prediction systems such as AlphaFold2 and RoseTTAFold recently achieved remarkable performance in full-atom prediction of protein 3D structure, essentially solving the protein structure prediction problem for most proteins. However, even this pinnacle of 50 years of research has short-comings: all SOTA methods rely on correlated mutations captured by multiple sequence alignments (MSAs). The extensive database searches needed to create MSAs significantly increase the total time required to obtain structure predictions from these models. Irrespective of runtime, the quality of predicted 3D structures relies heavily on the size and diversity of sequences within the MSA. In addition, predictions based on MSAs are family-averaged and not protein-specific, rendering those models less sensitive to effects of mutations.
In this work we present a novel structure prediction system solely trained on sequence representations from protein language models. Our model predicts protein structure orders of magnitude faster than current state-of-the-art systems, allowing high-throughput structure mutation experiments which are computationally too expensive with existing systems. While the overall quality of predictions does not reach the level of AlphaFold2, we show that our system is considerably more sensitive to point mutations than family-averaged predictors.

14:55-15:15
DYNAMICAL STUDY OF VIRAL GLYCOPROTEINS AND EVOLUTIONARY FITNESS SIMULATION
Room: EH
Format: Live from venue

Moderator(s): Lenore Cowen

  • Natalia Fagundes Borges Teruel, UdeM: Université de Montreal, Canada
  • Olivier Mailhot, UdeM: Université de Montreal, Canada
  • Rafael Najmanovich, UdeM: Université de Montreal, Canada


Presentation Overview: Show

Several viral glycoproteins go through conformational changes, fundamental to infection processes. The SARS-CoV-2 Spike protein is of particular importance during the current pandemic. This protein interacts with the human acetylcholinesterase 2 (ACE2) receptor as part of the viral entry mechanism. To do so, the receptor-binding domain (RBD) of Spike needs to be in an open state conformation. Here we utilize coarse-grained Normal Mode Analyses to model the dynamics of SARS-CoV-2 Spike protein variants as well as the transition probabilities between open and closed conformations. We performed 17081 possible in silico single mutations of Spike to determine positions and mutations that may affect the occupancy of the conformational states. Based on that, we successfully predicted some of the main mutations that constitute Alpha, Beta and Gamma variants. We also built a simplified model for binding evaluation, validated with experimental data of the binding between RBD mutants and ACE2, which is now being applied to the evaluation of interfaces between conformational ensembles of Spike and antibody structures, with preliminary results offering a consensus among the various experimental interfaces determined, to propose a method to evaluate mutants that integrates dynamics, binding, and immune escape.

15:45-16:05
Paragraph - Antibody paratope prediction using Graph Neural Networks with minimal feature vectors
Room: EH
Format: Live from venue

Moderator(s): Yu (Brandon) Xia

  • Lewis Chinery, University of Oxford, United Kingdom
  • Charlotte Deane, University of Oxford, United Kingdom
  • Newton Wahome, GSK, United States
  • Iain Moal, GSK, United Kingdom


Presentation Overview: Show

The development of new vaccines and antibody therapeutics typically takes several years and requires over $1bn in investment. Accurate knowledge of the paratope (antibody binding site) can speed up and reduce the cost of this process by improving our understanding of antibody-antigen binding.

We present Paragraph, an open-source structure-based paratope prediction tool that outperforms current state-of-the-art tools using simpler feature vectors and no antigen information. Representing the antibody variable region as a graph, Paragraph uses equivariant graph neural network layers to predict the probability of each residue belonging to the paratope.

Given the lack of readily available antibody crystal data, it is essential that structure-based prediction tools work on model structures. As such, all our results are on models.

In addition to improving paratope prediction accuracy, we also identify issues with currently used benchmark datasets and metrics. To overcome this, we develop a larger, cleaner dataset to be used in future efforts and suggest metrics well-suited to evaluating highly class-imbalanced problems.

Paragraph achieves a PR AUC of 0.725 on ABlooper model structures of our expanded dataset. Promisingly, Paragraph’s performance increases with model confidence, suggesting our accuracy may rise with future improvements to antibody structure prediction.

16:05-16:25
CSM-epitope: linear B-cell epitope prediction using graph-based signatures and interpretable machine learning
Room: EH
Format: Live from venue

Moderator(s): Yu (Brandon) Xia

  • Bruna Moreira da Silva, The University of Melbourne, Australia
  • David Ascher, The University of Melbourne, Australia
  • Douglas Pires, The University of Melbourne, Australia


Presentation Overview: Show

Linear B-Cell epitope refers to a class of antigenic determinants that could bind to B-Cell receptors or antibodies released by the adaptive immune system. Among the two types of epitope classes, the continuous (or linear) and the discontinuous, both only exist upon the detection and binding of the antigen by an antibody. In a scalable and less expensive process, computational approaches aim to contribute with epitope-based vaccines and immunotherapies development, identifying from a protein sequence, which residues are more likely to be part of an epitope.
A variety of prediction methods have been developed over the years, however, their reliability for clinical applications is still questionable based on medium to low performance (Matthew’s Correlation Coefficients ranging from 0.32 to 0.62). Additionally, current machine learning models also lack interpretability, limiting biological insights that could otherwise be obtained. Here, we introduce CSM-epitopes, an interpretable machine learning method, capable of accurately identifying linear B-cell epitopes, leveraging a new graph-based signature representation of protein sequences, based on our well established CSM (Cutoff Scanning Matrix) algorithm.

16:25-16:45
Population missense variants in human ACE2 strongly affect binding to SARS-CoV-2 Spike: A case study in affinity predictions of interface variants
Room: EH
Format: Live from venue

Moderator(s): Yu (Brandon) Xia

  • Stuart A. MacGowan, University of Dundee, United Kingdom
  • Michael I. Barton, Sir William Dunn School of Pathology, University of Oxford, United Kingdom
  • Mikhail Kutuzov, Sir William Dunn School of Pathology, University of Oxford, United Kingdom
  • Omer Dushek, University of Oxford, United Kingdom
  • P. Anton van der Merwe, Sir William Dunn School of Pathology, University of Oxford, United Kingdom
  • Geoffrey J. Barton, University of Dundee, United Kingdom


Presentation Overview: Show

SARS-CoV-2 infection manifests a range of clinical presentations from mild illness to life-threatening disease. As a mediator of viral entry, ACE2 is an a priori candidate genetic risk factor. The affinity of SARS-CoV-2 Spike for ACE2 is a key parameter influencing host-range and tropism and so we determined the affinities of several reported ACE2 population variants experimentally and predicted the effects of many more. We found ACE2 alleles that strongly inhibited binding to Spike and some with moderately increased affinity. Comparison to recent infectivity studies indicates that the affinity ranges of ACE2 variants can protect cells from infection and so some almost certainly confer resistance to carriers; this is now being tested with clinical data. We will also highlight the strengths and weaknesses of current generation predictors, and present new results on the interplay between ACE2 variants and different SARS-CoV-2 strains.