Presentation Overview: Show
The objective of our group is to predict aspects of protein function and structure from sequence. The wealth of evolutionary information available through comparing the whole bio-diversity of species makes such an ambitious goal achievable. Our particular niche is the combination of evolutionary information (EI) with machine learning (ML) and artificial intelligence (AI). 30 years ago, the marriage of machine learning and evolutionary information (in the form of Multiple Sequence Alignments) allowed a breakthrough in secondary structure prediction. The same principle has been underlying all state-of-the-art predictions of protein structure and function and is also the root for the program that broke through in protein structure prediction, namely AlphaFold2.
Over the last two years, it has become possible to deep learn the language of life written in proteins through protein Language Models (pLMs). The information extracted is transfer learned to supervise learn protein prediction with annotations. I will present three particular new methods predicting protein structure (1D: secondary structure, membrane regions, & disorder, 2D: inter-residue distances/contacts, 3D: co-ordinates) and protein function (sub-cellular location, binding residues, GO terms), and the effects of sequence variation using pLMs. These embeddings allow for some applications to reach for others to surpass the state-of-the-art without using evolutionary information.
Crucial in all of this is the understanding of the AI and the control of database bias. For both computational biology could serve as a sandbox to prepare more sensitive applications of AI in society.
Presentation Overview: Show
Motivation: The accuracy gap between predicted and experimental structures has been significantly reduced following the development of AlphaFold2 (AF2). However, for many targets, AF2 models still have room for improvement. In previous CASP experiments, highly computationally intensive MD simulation-based methods have been widely used to improve the accuracy of single 3D models.
Here, our ReFOLD pipeline was adapted to refine AF2 predictions while maintaining high model accuracy at a modest computational cost. Furthermore, the AF2 recycling process was utilised to improve 3D models by using them as custom template inputs for tertiary and quaternary structure predictions.
Results: According to the Molprobity score, 94% of the generated 3D models by ReFOLD were improved. AF2 recycling showed an improvement rate of 87.5% (using MSAs) and 81.25% (using single sequences) for monomeric AF2 models and 100% (MSA) and 97.8% (single sequence) for monomeric non-AF2 models, as measured by the average change in lDDT. By the same measure, the recycling of multimeric models showed an improvement rate of as much as 80% for AF2-Multimer (AF2M)models and 94% for non-AF2M models.
Availability: Refinement using AlphaFold2-Multimer recycling is available as part of the MultiFOLD docker package (https://hub.docker.com/r/mcguffin/multifold). The ReFOLD server is available at https://www.reading.ac.uk/bioinf/ReFOLD/
Presentation Overview: Show
Immune receptor proteins play a key role in the immune system and have shown great promise as biotherapeutics. The structure of these proteins is critical for understanding their antigen binding properties. Here, we present ImmuneBuilder, a set of deep learning models trained to accurately predict the structure of antibodies (ABodyBuilder2), nanobodies (NanoBodyBuilder2) and T-Cell receptors (TCRBuilder2). We show that ImmuneBuilder generates structures with state of the art accuracy while being far faster than AlphaFold2. For example, on a benchmark of 34 recently solved antibodies, ABodyBuilder2 predicts CDR-H3 loops with an RMSD of 2.81Å, a 0.09Å improvement over AlphaFold-Multimer, while being over a hundred times faster. Similar results are also achieved for nanobodies, (NanoBodyBuilder2 predicts CDR-H3 loops with an average RMSD of 2.89Å, a 0.55Å improvement over AlphaFold2) and TCRs. By predicting an ensemble of structures, ImmuneBuilder also gives an error estimate for every residue in its final prediction. ImmuneBuilder is made freely available, both to download (https://github.com/oxpig/ImmuneBuilder) and to use via our webserver (http://opig.stats.ox.ac.uk/webapps/newsabdab/sabpred). We also make available structural models for ≈150 thousand non-redundant paired antibody sequences (https://zenodo.org/record/7258553).
Presentation Overview: Show
AlphaFold2 produces impressively accurate predictions of protein structures. Most evaluations of the method to date have focused on model accuracy on individual protein domains, with relatively less attention paid to accuracy of inter-domain arrangements in multi-domain proteins. Here, we examine AlphaFold2’s performance in accurately predicting structures for multi-domain proteins. Using multi-domain models in the AlphaFold database with known experimental structures, we assess model accuracy in relation to template availability at the time of prediction. We also develop means to assess the accuracy of domain arrangements that are not represented in AlphaFold2’s training set. We find that although AlphaFold2 exhibits high performance overall, there is clear room for improvement in multi-domain structure prediction, particularly on longer proteins bearing specific domain interactions that were not observed during training. For some multi-domain targets on which poor performance is seen, we show how improved accuracy may be obtained using AlphaFold2 with non-standard running protocols.
Presentation Overview: Show
Coiled-coils domains (CCDs) are found in proteins in all kingdoms of life. They perform a wide range of important cellular functions. Canonical Coiled-Coil Domains (CCD) consist of interwined alpha helices containing heptad repeats (labeled abcdefg, the so-called registers) with constraint pairing. CCDs are classified according to the number and orientation of the α‐helices involved, i.e, by their oligomerization state. The importance of CCDs demands computational methods for predicting the presence and localization of CCDs, including registers, and their oligomerization state. Here we present CoCoNat (https://coconat.biocomp.unibo.it), a novel deep-learning based method for predicting CCD regions, registers and oligomerization state. Our method, for the first time, adopts a sequence encoding based on two state-of-the-art protein Language Models (pLMs): ProtT5 and ESM1-b. The pLMs embedding are processed by a three-step architecture including a deep network, a conditional random field and single-layer feed forward network. We trained CoCoNat on a dataset comprising 2191 proteins containing CCDs and 9040 proteins not endowed with CCD. When tested on a blind test set comprising 429 CCD and 278 non-CCD proteins, CoCoNat overpasses the current state-of-the-art both for residue-level and segment-level CCD detection, register annotation as well as oligomerization state prediction.
Presentation Overview: Show
3DSIG has come a long way since its start as a satellite meeting. Now as a COSI we look towards the future when 3DSIG will be a natural place to assemble and exchange ideas throughout the year. Come hear our plans, and give your suggestions.
Presentation Overview: Show
The AlphaFold databases of 214M UniProt proteins and the ESMatlas catalog of nearly 700M metagenomic protein structures provide valuable resources to the community. However, their extensive sizes of 23TB and 15TB, respectively, exceed the capacity of standard workstations and pose a challenge even to well-equipped cluster environments.
To address this issue, we introduce Foldcomp, a novel compression algorithm that encodes the torsion and bond angles in a compact binary format, named FCZ. Foldcomp achieves up to 90% compression compared to float-encoded 3D coordinates, requiring only 13 bytes per residue. Reconstruction of original coordinates is accomplished by utilizing the NeRF algorithm with internal anchor points. By averaging bi-directional reconstructed coordinates, we were able to reduce reconstruction loss to ~0.08Å range. Our method is as fast as gzip, with 3ms and 6ms for compression and decompression, respectively.
Foldcomp is available as a command line interface and a Python API at https://foldcomp.foldseek.com. Additionally, Foldcomp has been augmented by community contributions, such as a PyMol plugin and a dataset wrapper in Graphein. We provide the compressed database of AlphaFold database (1.1TB), ESMatlas (1.8TB), SwissProt (2.9GB), and recently released AlphaFold2 cluster representatives (2.2GB) at https://foldcomp.steineggerlab.workers.dev. Foldcomp is published at https://doi.org/10.1093/bioinformatics/btad153.
Presentation Overview: Show
We present and demonstrate the usage of ProteinShake, a new Python software package that supports deep learning model development on protein 3D structure data by harmonizing the fundamental steps of data processing and model evaluation. The package abstracts away large amounts of boilerplate processing code for downloading, annotating, parsing, filtering, and splitting protein 3D structure files. This allows for rapid creation of new datasets and benchmark tasks for biological applications. Associated to ProteinShake we host a database of pre-processed datasets and evaluation tasks for supervised and self-supervised learning. ProteinShake drastically simplifies access to protein structure data, enabling rapid prototyping and reproducible model evaluation. We intend to serve the growing community of machine learning researchers aiming to expand their models to challenging biological domains. ProteinShake seamlessly integrates with all common deep learning frameworks and converts protein structures to point clouds, graphs, and voxel grids. The package is available at PyPi and at borgwardtlab.github.io/proteinshake
Presentation Overview: Show
AlphaFold's remarkable success in predicting protein structures with near-experimental accuracy has revolutionized the field of structural bioinformatics.
Despite its success, there is ongoing debate over whether AlphaFold's achievement is due to a better understanding of the physics of protein folding. Some researchers have raised skepticism, citing AlphaFold's prediction of side groups with a dangling orientation that suggests a bound zinc ion, despite no input of such information in the program. Others, however, suggest that good protein design facilitated by AlphaFold or better decoy ranking could imply to a deeper understanding of the physics of protein folding.
In this study, we investigated AlphaFold's understanding of physics in two ways. First, we compared AlphaFold's ability to predict distant homologous structures to that of the threading algorithm MUSTER. We found that AlphaFold's correlation between predicted and actual structures was significantly worse than MUSTER, indicating a weaker understanding of protein folding physics. Second, we attempted to use AlphaFold metrics to predict the impact of single mutations on protein stability, but found no significant correlation, further supporting the idea that AlphaFold does not know the physics of protein folding.
Presentation Overview: Show
It's now known that the function of non-coding RNAs is largely determined by their structure. In turn, the folding of the RNA structure is dictated by its sequence. The high structural complexity of non-coding RNAs can be the result of multiple recombination events during the RNA world period. Recently, examples of such natural recombinations preserving the structure and function were discovered for hammerhead ribozymes. However, the studies of RNA sequence rearrangements are highly limited by the computational complexity of the problem. Here, we present a program CoRToise (Computational RNA Topoisomerase) for finding potential backbone rearrangement sites in RNA 3D structures and performing sequence permutations. CoRToise limits the search for potential breakpoints to spatially close O3′-P atom pairs of residues distant in sequence and permutes the RNA sequence/structure fragments by reconnecting them exhaustively. Using the tool we found several cases of pseudoknotted RNA structures turning into nested structures after a series of rearrangements, and multiple reciprocal cases. CoRToise can be utilized to explore the landscape of permuted variants of a given structure and find the most stable variant possibly more functionally efficient. Such a tool will be useful in the analysis of the RNA sequence-structure-function relationships and in RNA design.
Presentation Overview: Show
The spatial structure of non-coding RNAs is crucial in determining its functions. It is known that RNA structures are modular, which allows us to consider their structure as a composition of building blocks called tertiary motifs. The ability to recognize and search for similar motifs by superimposing structures on each other is one of the main tasks of structural biology. Here we present the ARTEM tool for sequence-, topology-, and annotation-independent superposition of two arbitrary RNA 3D structures. The algorithm used in ARTEM allows finding similar motifs in topologically different structural contexts and handles both local and long-range motifs (formed by loops and helices distant in sequence or from different chains). To demonstrate the capabilities of ARTEM we performed a search for RNA 3D motifs structurally similar to the D-loop/T-loop interaction motif from tRNA. We found D-loop/T-loop-like motifs in tRNA, Y RNA, viral tRNA-like UTR, Hatchet ribozyme, RNase P, several riboswitches, and numerous matches were found in archaeal, bacterial, and eukaryotic rRNAs. We believe ARTEM will have a significant impact in the field of RNA structural studies, especially in the comparative analysis of RNA structures and RNA-containing complexes.
Presentation Overview: Show
Motivation: The 3D structures of RNA play a critical role in understanding their functionalities. There exist several computational methods to study RNA 3D structures by identifying structural motifs and categorizing them into several motif families based on their structures. Although the number of such motif families is not limited, a few are well-studied. Out of these structural motif families, there exists several families that are visually similar or very close in structure, even with different base interactions. Alternatively, some motif families share a set of base interactions but maintain variation in their 3D formations. These similarities among different motif families, if known, can provide a better insight into the RNA 3D structural motifs as well as their characteristic functions in cell biology.
Results: In this work, we proposed a method, RNAMotifComp, that analyzes the instances of well-known structural motif families and establishes a relational graph among them. We also have designed a method to visualize the relational graph where the families are shown as nodes and their similarity information is represented as edges. We validated our discovered correlations of the motif families using RNAMotifContrast. Additionally, we used a basic Na\""ive Bayes classifier to show the importance of RNAMotifComp. The relational analysis explains the functional analogies of divergent motif families and illustrates the situations where the motifs of disparate families are predicted to be of the same family.
Presentation Overview: Show
GPCRs transduce extracellular signals to intracellular pathways by coupling with heterotrimeric G-proteins categorized as Gs, Gi/o, Gq/11, and G12/13 based on their α-subunits. To understand the sequence-based coupling selectivity we created a new machine learning predictor PRECOGx. It is based on protein language models that encode structural and functional information of protein sequences. The ESM1b protein embeddings of GPCR are used as features. It predicts GPCR interactions with G protein and β-arrestin. It outperformed its predecessor (e.g., PRECOG) in predicting GPCR-transducer couplings, being also able to consider all GPCR classes. To explore the structural determinants of G-protein-coupling selectivity, we analyzed 362 available 3D structures of GPCR-G-protein complexes. Analysis of the residue contacts at the interfaces revealed a network of secondary structure elements that elucidated new and known structural features that determine coupling specificity. Through RMSD calculation, focusing on the docking mode of the G-protein α-subunits with respect to the receptor we show Gs-GPCR complexes have more structural constraint and a smaller range of docking poses than Gi/o-GPCR. Binding interface energy calculations showed that structural properties of the complexes contribute to higher stability of Gs compared to Gi/o complexes.
Presentation Overview: Show
Structure-based drug design is powerful in discovering and optimising novel therapeutics and small molecule probes. There are still significant challenges in scaling the analysis of all available structures for a given protein target and using the resulting data efficiently and sensibly. Exscientia has developed several scalable automated bioinformatics workflows which incorporate successful methodologies (e.g. Fragment Hotspot Maps) and leverage the scale of our cloud-based infrastructure to support the development of novel, precision-engineered drugs. Here, we describe two applications of such tools.
First, our target tractability assessment process captures a global target profile drawn from all available structural data. Remarkably, our target tractability platform can scale to proteome-wide analysis in a few hours and for less than the cost of a mobile phone. Second, we introduce our work on structure-guided automatic generation of designs, an end-to-end pipeline to produce ready-for-synthesis hit-like molecules. We present our analysis across all kinome structures within the AlphaFold Database, showcasing early experimental validation on two kinase targets, DYRK1B and PKD1. Half of the synthesised hits were active, with our workflows finding at least one low-nanomolar hit per target. Lastly, we will highlight future areas of improvement such as incorporating conformational ensembles to account for protein flexibility.
Presentation Overview: Show
Motivation: Deep learning-based molecule generation becomes a new paradigm of de novo mol-ecule design since it enables fast and directional exploration in the vast chemical space. However, it is still an open issue to generate molecules, which bind to specific proteins with high binding af-finities while owning desired drug-like physicochemical properties.
Results: To address these issues, we elaborate a novel framework for controllable protein-oriented molecule generation, named CProMG, which contains a 3-D protein embedding module, a dual-view protein encoder, a molecule embedding module, and a novel drug-like molecule decoder. Based on fusing the hierarchical views of proteins, it enhances the representation of protein binding pockets significantly by associating amino acid residues with their comprising atoms. Through joint-ly embedding molecule sequences, their drug-like properties, and binding affinities w.r.t. proteins, it autoregressively generates novel molecules having specific properties in a controllable manner by measuring the proximity of molecule tokens to protein residues and atoms. The comparison with state-of-the-art deep generative methods demonstrates the superiority of our CProMG. Furthermore, the progressive control of properties demonstrates the effectiveness of CProMG when controlling binding affinity and drug-like properties. After that, the ablation studies reveal how its crucial compo-nents contribute to the model respectively, including hierarchical protein views, Laplacian position encoding as well as property control. Last, a case study w.r.t protein illustrates the novelty of CProMG and the ability to capture crucial interactions between protein pockets and molecules. It’s anticipated that this work can boost de novo molecule design.
Presentation Overview: Show
Cellular functions are governed by molecular machines that assemble through protein-protein interactions. Their atomic details are critical to studying their molecular mechanisms. Today the structure of virtually all individual proteins is available from predictions using AlphaFold. However, these predictions are limited to individual chains and do not include interactions. In this talk I will describe our attempts to increase the structural coverage of protein-protein interactions. Today fewer than 5% of hundreds of thousands of human protein interactions have been structurally characterised. We show that combining predictions and experiments can orthogonally confirm higher-confidence models, and using AlphaFold2, we have built 3,137 high-confidence models, of which 1,371 have no homology to a known structure. We are exploring rapid methods to identify protein interaction networks. Finally, we show how the predicted binary complexes can be used to build very larger assemblies using a Monte Carlo Tree search method.
Presentation Overview: Show
In this study, we mined the PDB and created a structural library of 178,465 interfaces that mediate protein–protein/domain–domain interactions. Interfaces involving the same CATH fold(s) were clustered together. Our analysis of the library reveals similarities between chain–chain and domain–domain interactions. The library also illustrates how a single protein fold can interact with multiple folds using similar interfaces. Analyzing the data in the library reveals interesting aspects such as how proteins belonging to folds that interact with many other folds also have high number of Enzyme Commission terms. We constructed a statistical potential of pair preferences of amino acids across the interface for chain–chain and domain–domain interactions separately. They are quite similar further lending credence to the notion that domain–domain interfaces could be used to study chain–chain interactions. We analyzed protein complexes modeled by AlphaFold2 and RoseTTAFold and noticed that some of the modes of interaction involve folds and interfaces that have not been observed in the PDB. The library includes predicted small molecule-binding sites at protein–protein interfaces. These interfaces containing small molecule-binding sites can be easily targeted to prevent the interaction and perhaps form a part of a therapeutic strategy.
Presentation Overview: Show
The artificial intelligence-based structure prediction program AlphaFold-Multimer enabled structural modelling of protein complexes with unprecedented accuracy. However, high-throughput protein–protein interactions (PPIs) screens and modelling of protein complexes using AlphaFold-Multimer is still challenging not only because of its high demands of computing resources but also relatively poorer performance in cross-species protein complexes, such as viral-host interaction complexes. I will first present AlphaPulldown, a Python package that streamlines PPI screens and high-throughput modelling of higher-order oligomers using AlphaFold-Multimer and demonstrate some successful applications of using AlphaPulldown in looking for possible PPIs. Then I will outline the latest updates of AlphaPulldown that aim to improve AlphaFold-Multimer’s performance in modelling viral-host protein complexes. Lastly, I will demonstrate other new features we have added to the package, which improve usability, speed, and result interpretation. These additional auxiliary features will accommodate a wider range of users’ needs.
Overall, our work improved the computing efficiency of running AlphaFold-Multimer and provides a convenient command-line interface, a variety of confidence scores and a graphical analysis interface. The recent development and extension of AlphaPulldown will help the community build models for more challenging protein complexes with adequate auxiliary tools.
Presentation Overview: Show
Enzymes play a fundamental role in almost all biotechnological and biopharmaceutical processes. Despite all the efforts made to decipher the interplay between activity and stability, two key characteristics of enzymes, it is
still complex to understand their relationship as well as how evolution and environmental conditions shape it.
To further investigate this question, we collected six hundred enzymes with known structures and catalytic sites. Using the formalism of statistical potentials, we computed the contribution of each residue to the enzyme's folding free energy and studied its dependence on the residue distance from the closest catalytic site. We discover an interesting pattern of stability in the catalytic region, consisting of an energetic compensation between the catalytic residues, which are usually stability weaknesses, and their neighboring residues, which are rather stability strengths. We also compared stability patterns in psycrophilic and thermophilic enzymes, and found more pronounced stability weaknesses in pscycrophilic's catalytic sites than in thermophilics. This work provides interesting information on the stability and activity properties of enzymes that could be exploited to improve enzyme design methods.
Presentation Overview: Show
The quantification of biomolecular interactions is crucial to understand biological processes, guide drug discovery and protein engineering. However, current methods for evaluating protein interfaces are complex and computationally expensive. This study introduces Surfaces, a simplified approach that utilizes a per-residue decomposition method prioritizing performance, and utilizes the SARS-CoV-2 Spike protein as case study. Compared to different computational approaches, methods that employ molecular dynamics (MD) simulations, such as free-energy perturbation (FEP) calculations, offer good predictive performance when compared to experimental measurements, but are computationally demanding. In contrast, Surfaces, which uses a complementarity function (CF) based on atomic areas in contact, offers comparable performance with reduced computational cost, making large-scale applications feasible. Surfaces was applied to analyze a dataset of 738 structures of Spike protein in complex with antibodies and mutations in complex with the receptor ACE2. The results of Surfaces provide insights into the contribution of individual residue-residue interactions to receptor binding and immune escape. In conclusion, Surfaces offers a simplified and effective approach for evaluating protein interfaces and understanding per-residue interaction contributions, making it a valuable tool for large-scale applications, including the study of viral glycoprotein evolution, particularly relevant in the ongoing SARS-CoV-2 pandemic.
Presentation Overview: Show
3DSIG has come a long way since its start as a satellite meeting. Now as a COSI we look towards the future when 3DSIG will be a natural place to assemble and exchange ideas throughout the year. Come hear our plans, and give your suggestions.
Presentation Overview: Show
High-quality computational structural models are now pre-computed and available for nearly every protein in UniProt. However, the best way to leverage these models to predict which pairs of proteins interact in a high-throughput manner, is not immediately clear. The recent Foldseek method of van Kempen et al. encodes the structural information from the distances and angles along the protein backbone into a linear string using tokens from a 21 letter discretized structural alphabet, of the same length as the protein string. We show that when structural data is available, so that the Foldseek string can be efficiently produced, when it is offered as an additional input to our recent Topsy-Turvy deep-learning method that predicts PPIs cross-species solely from a pair of protein amino acid sequences, performance substantially improves. Thus our new method, TT3D, presents a way to reuse all the computational effort going into producing high-quality structural models from sequence, which is sufficiently lightweight so that high-quality binary PPI predictions across all protein pairs can be made genome-wide.
Presentation Overview: Show
Motivation: Proteins interact to form complexes to carry out essential biological functions. Computational methods such as AlphaFold-multimer have been developed to predict the quaternary structures of protein complexes. An important yet largely unsolved challenge in protein complex structure prediction is to accurately estimate the quality of predicted protein complex structures without any knowledge of the corresponding native structures. Such estimations can then be used to select high-quality predicted complex structures to facilitate biomedical research such as protein function analysis and drug discovery.
Results: In this work, we introduce a new gated neighborhood-modulating graph transformer to predict the quality of 3D protein complex structures. It incorporates node and edge gates within a graph transformer framework to control information flow during graph message passing. We trained, evaluated and tested the method (called DProQA) on newly-curated protein complex datasets before the 15th Critical Assessment of Techniques for Protein Structure Prediction (CASP15) and then blindly tested it in the 2022 CASP15 experiment. The method was ranked 3rd among the single-model quality assessment methods in CASP15 in terms of the ranking loss of TM-score on 36 complex targets. The rigorous internal and external experiments demonstrate that DProQA is effective in ranking protein complex structures.
Availability: The source code, data, and pre-trained models are available at https://github.com/jianlin-cheng/DProQA
Presentation Overview: Show
Amyloid fibrils are aggregates consisting of many proteins of the same species. They are involved in many neurodegenerative diseases, such as Alzheimer and Parkinson. In this work, we investigate the role of the hydrophobic core and disordered flanks in amyloid fibril growth. We measure the growth rate of alpha-synuclein with and without disordered flanks, and compare our results to mathematical models to gain mechanistic insight. The experimental measurements show that alpha-synuclein grows significantly faster without disordered flanks, and the mathematical models suggest that this is due to secondary nucleation (initiation of growth of additional fibrils off the surface of the original fibril). Thus, the disordered flanks may play an important role in slowing down protein aggregation.
Additionally, we measure the interaction between alpha-synuclein and nanoPET, a type of nanoplastic. Nanoplastics are commonly found in nature through pollution, and in cosmetic products. Due to their large hydrophobicity, nanoplastics may facilitate amyloid fibril formation in a way similar to secondary nucleation. Here, we show that nanoPET interacts with the hydrophobic core of alpha-synuclein, and that the disordered flanks of alpha-synuclein prevent the formation of nanoPET clusters. Thus, these results indicate that nanoPET can enhance the formation of alpha-synuclein fibrils.
Presentation Overview: Show
Understanding the effects of mutations on protein stability is crucial for variant interpretation and prioritisation, protein engineering, and biotechnology. Despite significant efforts, community assessments of predictive tools have highlighted ongoing limitations, including computational time, low predictive power, and biased predictions towards destabilising mutations. To fill this gap, we developed DDMut, a fast and accurate siamese network to predict changes in Gibbs Free Energy upon single and multiple point mutations, leveraging both forward and hypothetical reverse mutations to account for model anti-symmetry. Deep learning models were built by integrating graph-based representations of the localised 3D environment, with convolutional layers and transformer encoders. This combination better captured the distance patterns between atoms by extracting both short-range and long-range interactions. DDMut achieved Pearson's correlations of up to 0.70 (RMSE: 1.37 kcal/mol) on single point mutations, and 0.70 (RMSE: 1.84 kcal/mol) on double/triple mutations, outperforming most available methods across non-redundant blind test sets. Importantly, DDMut was highly scalable and demonstrated anti-symmetric performance on both destabilising and stabilising mutations. We believe DDMut will be a useful platform to better understand the functional consequences of mutations, and guide rational protein engineering. DDMut is freely available as a web server and API at https://biosig.lab.uq.edu.au/ddmut.
Presentation Overview: Show
Proteins, among them enzymes, have naturally evolved for billions of years. Yet, industrial, biotechnological, and medical applications often require enzymatic improvement. Protein design is aimed at achieving such enhancement in different aspects of enzyme performance. Engineering flexibility is one of the last frontiers in protein design: the implications of mutations in flexibility are yet not well understood. Cryptic relationships among distant residues (allostery) and the lack of a flexibility functional unit definition are reasons for this hardship. We recently engaged in this exciting field by showcasing and generalising the process of exchanging loops (dynamic structural elements) between homologous proteins, transferring the dynamic behaviour from one protein to another as result. To this end, we designed LoopGrafter (https://loschmidt.chemi.muni.cz/loopgrafter/, doi: 10.1093/nar/gkac249), a web server that provides a step-by-step interactive procedure where the user can successively identify loops in the input proteins, calculate their geometries, similarities and dynamics, and select loops to be transplanted. All different chimeras derived from any possible recombination point are calculated, and 3D models constructed and energetically evaluated for each of them. The obtained results can be interactively visualised in a user-friendly graphical interface and downloaded for detailed structural analyses. The server has 3500 users and 1200 jobs registered.
Presentation Overview: Show
A large proportion of proteins are completely or partially embedded in the cell membrane. However, most algorithms assessing the changes in protein structure induced by amino acid substitutions are designed for globular proteins and do not consider the physico-chemical characteristics of the lipid bilayer. We present Missense3D-TM, a program specifically designed to provide a structure-based assessment of the impact of variants occurring in transmembrane regions.
A dataset of 3,346 missense variants (2,197 damaging and 1,149 neutral, in 746 proteins) and 772 3D structures was used for development. Close homologues between the training and testing datasets were removed but a similar pathogenic to benign ratio maintained in both sets.
On the testing set, Missense3D-TM outperformed the standard Missense3D algorithm for globular proteins: sensitivity 58% versus 35%, specificity 81% versus 89%, Mathews correlation coefficient (MCC) 0.37 versus 0.27, accuracy 66% versus 53% (p <1x10-10, two-tailed McNemar’s test). By comparison, the predictor mCSM-membrane achieved 52% sensitivity, 81% specificity, MCC of 0.31 and 61% accuracy (p=0.06).
Missense3D-TM will assist researchers seeking to understand why an engineered or naturally-occurring amino acid substitution occurring in a transmembrane protein might cause changes in protein folding. A web server implementing Missense3D-TM is available at http://missense3d.bc.ic.ac.uk/.
Presentation Overview: Show
The tendency of an amino acid to adopt certain configurations in folded proteins is treated here as a statistical estimation problem. We model the joint distribution of the observed mainchain and sidechain dihedral angles (<φ,ψ,χ1,χ2,...>) of any amino acid by a mixture of a product of von Mises probability distributions. This mixture model maps any vector of dihedral angles to a point on a multi-dimensional torus. The continuous space it uses to specify the dihedral angles provides an alternative to the commonly-used rotamer libraries. These rotamer libraries discretize the space of dihedral angles into coarse angular bins, and cluster combinations of sidechain dihedral angles (<χ1,χ2,...>) as a function of backbone <φ,ψ> conformations. A ‘good‘ model is one that is both concise and explains (compresses) observed data. Competing models can be compared directly and in particular our model is shown to outperform the Dunbrack rotamer library in terms of model-complexity (by three-orders of magnitude) and its fidelity (on average 20% more compression) when losslessly explaining the observed dihedral angle data across experimental resolutions of structures. Our method is unsupervised (with parameters estimated automatically) and uses information-theory to determine the optimal complexity of the statistical model, thus avoiding under/over-fitting, a common pitfall in model-selection problems. Our models are computationally inexpensive to sample from and are geared to support a number of downstream studies, ranging from experimental structure refinement, de novo protein design, and protein structure prediction. We call our collection of mixture models PhiSiCal (φψχal). It is available for download from http://lcb.infotech.monash.edu.au/phisical.
Presentation Overview: Show
The protein kinases are an essential signaling protein family that serves as a prime target for drug discovery. These proteins are structurally dynamic and can adopt different conformational states, including active and inactive states. The positioning of c-helix and DFG structural elements defines these states, which maintain cellular homeostasis. Kinase inhibitors (KIs) exhibit different biophysical, biochemical, and pharmacophore properties depending on the specific kinase conformation they target. The recent success of AlphaFold2 (AF2) in predicting protein structures accurately based on sequence inspired our investigation into the conformational landscape of protein kinases modeled by AF2. Our research demonstrated that AF2 can accurately model several kinase conformations across the kinome in a kinase-specific manner. However, it is challenging to direct AF2 to generate structures of kinases in specific conformational states. This lack of conformational coverage hinders the discovery of novel conformation specific KIs, especially for kinases lacking experimental structures. We present a new methodology that utilizes ColabFold, an open-source protein modeling software based on AF2, to model any kinase in the active conformation. Our findings create the opportunity to use AF2 to model any protein kinase in several pharmacologically relevant conformational states.
Presentation Overview: Show
Humans have 438 catalytically competent protein kinase domains with the typical kinase fold, similar to the structure of PKA. Only 280 of these kinases are currently represented in the PDB. The active form of the kinase must satisfy requirements for binding ATP, magnesium, and substrate. From bioinformatics analysis of structures of 40 unique substrate-bound kinases, as well as many structures with bound ATP and phosphorylated activation loops, we derived criteria for the active form of protein kinases. These criteria include the conformation of the DFG motif (in dihedral angles) and the N-terminal domain salt bridge, required for binding ATP and magnesium. There are also novel requirements on the position of the N and C terminal portions of the activation loop, which lead to the formation of a substrate binding cleft. With these criteria, only 148 of 438 kinase domains (32%) are present in the PDB. We used extensive sampling with AlphaFold2 with these active-state structures as templates and shallow multiple sequence alignments to make active-conformation models of all 438 human kinases. In addition, we used active models produced by AlphaFold2 as templates for modeling recalcitrant kinases ("diffusion templates"). Models of all 438 catalytically competent kinases in the active form are available at http://dunbrack.fccc.edu/kincore/active). They are suitable for interpreting mutations leading to constitutive catalytic activity in cancer as well as for templates for modeling substrate-kinase complexes and inhibitors which bind to the active state.