Directly porting model architectures and tasks, such as word2vec, BERT, SimCLR, and variational autoencoders from natural language processing and computer vision has driven early progress in learning protein representations from unlabeled sequences. However, these models and tasks do not account for important differences between protein sequences and language or image datasets. For example, protein sequences are generated via evolution, sampling is biased towards proteins from humans and model organisms, and there is often side information, such as 3-dimensional structures. I will discuss some important advances in protein representation learning that account for or exploit these differences.
11:35-11:50
Single Layers of Attention Suffice to Predict Protein Contacts
The established approach to unsupervised protein contact prediction estimates coevolving positions using undirected graphical models. This approach trains a Potts model on a Multiple Sequence Alignment. Increasingly large Transformers are being pretrained on unlabeled, unaligned protein sequence databases but have demonstrated mixed results for downstream tasks, including contact prediction. We argue that attention is a principled model of protein interactions, grounded in real properties of protein family data. We introduce an energy-based attention layer, factored attention, and show that it achieves comparable performance to Potts models while sharing parameters both within and across families. We contrast factored attention with the Transformer to indicate that the Transformer leverages hierarchical signal in protein family databases not captured by our single-layer models. This raises the exciting possibility for the development of powerful structured models of protein family databases.
11:50-11:55
Evotuning protocols for Transformer-based variant effect prediction on multi-domain proteins
Accurate prediction of variant effects has broad impacts on protein engineering. Recent machine learning approaches toward this end are based on representation learning, often using large-scale, diverse datasets. However, it is still unclear how we can effectively learn the intrinsic evolutionary properties of an engineering target protein, specifically when the protein is composed of multiple domains. Additionally, no optimal protocols are established for incorporating such properties into Transformer-based variant effect predictors. In response, we propose evolutionary fine-tuning, or “evotuning”, protocols, considering various combinations of homology search, fine-tuning, and sequence embedding strategies, without the need for multiple sequence alignment. Exhaustive evaluations on diverse proteins indicate that the models obtained by our protocols achieve significantly better performances than previous methods. The visualizations of attention maps suggest that the structural information can be incorporated by evotuning without direct supervision, possibly leading to better prediction accuracy.
11:55-12:00
ProteinBERT: A universal deep-learning model of protein sequence and function
Format: Live-stream
Moderator(s): Christian
Yam Peleg
Nadav Rappoport
Nadav Brandes, The Hebrew University of Jerusalem, Israel
Dan Ofer, The Hebrew University of Jerusalem, Israel
Michal Linial, The Hebrew University of Jerusalem, Israel
12:00-12:05
Graph attention network based representation learning for cancer drug response prediction and interpretation
Format: Pre-recorded with live Q&A
Moderator(s): Christian
Dionizije Fa, Ruđer Bošković Institute, Croatia
Frank Supek, Institute for Research in Biomedicine, Spain
We present a state of the art multimodal deep learning model for cancer drug response prediction based on pharmacogenomic data. We featurize cell lines as protein-protein interaction graphs. Graph attention networks then allow us to examine potentially plausible biological interactions in protein-protein interactions graphs by examining the attention coefficients.
12:05-12:20
Architectures and training procedures
Format: Live-stream
Moderator(s): Christian
Alex Rives
12:40-12:55
AI-driven engineering of the immune system
Format: Pre-recorded with live Q&A
Moderator(s): Christian
Maria Rodriguez Martinez
12:55-13:00
HydrAMP: a deep generative model for antimicrobial peptide discovery
Format: Live-stream
Moderator(s): Christian
Paulina Szymczak, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland
Marcin Możejko, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland
Tomasz Grzegorzek, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland
Marta Bauer, Medical University of Gdańsk, Poland
Wojciech Kamysz, Medical University of Gdańsk, Poland
Damian Neubauer, Medical University of Gdańsk, Poland
Michał Michalski, The Centre of New Technologies, University of Warsaw, Poland
Piotr Setny, The Centre of New Technologies, University of Warsaw, Poland
Jacek Sroka, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland
Ewa Szczurek, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland
The development of resistance to conventional antibiotics in pathogenic bacteria poses global health hazard. Antimicrobial peptides (AMPs) are an emerging group of compounds with the potential to become the new generation of antibiotics. Deep learning methods are widely used by wet-laboratory researchers to screen for the most promising candidates. We propose HydrAMP - a generative model based on a semi-supervised variational autoencoder, that can generate new AMPs, and perform analogue discovery. Novel features of our approach include: non-iterative training, parameter-regulated model creativity, and improvement of existing AMPs. We introduced multiple refinements to latent space modelling that allow us to sample novel AMPs despite the data scarcity. The peptides generated by HydrAMP are similar to the known AMPs in terms of physicochemical properties. We have successfully obtained and verified experimentally a new, more active analogue of Pexiganan, proving that HydrAMP is able to find potent analogues for existing peptides. The learnt representation enables fast and efficient discovery of peptides with desired biological activity.
13:00-13:05
Random Walk-based Matrix Factorization of a Multilayer Network for Protein Function Prediction
Format: Pre-recorded with live Q&A
Moderator(s): Christian
Surabhi Jagtap, CentraleSupelec; IFP Energies nouvelles, France
IFP Energies nouvelles IFP Energies nouvelles, IFP Energies nouvelles, France
Cellular systems of organisms are composed of multiple interacting entities that control cellular processes at multiple levels by tightly regulated molecular networks. In recent years, the advent of high-throughput experimental methods has resulted in the increase of large-scale molecular and functional interaction networks such as gene co-expression, protein–protein interaction (PPI) , genetic interaction, and metabolic networks. These networks are rich source[s] of information that could be used to infer the functional annotations of genes or proteins. Extracting relevant biological information from their topologies essential in understanding the functioning of the cell and its building blocks (proteins). Therefore, it is necessary to obtain an informative representation of the proteins and their proximity that is not fully captured by features that are extracted directly from single input networks. Here, we propose BraneMF, a random walk-based matrix factorization of a multi-layer network for protein function prediction.
13:05-13:10
Light Attention Predicts Protein Location from the Language of Life
Format: Live-stream
Moderator(s): Christian
Hannes Stärk, Department of Informatics, Technical University of Munich, Germany
Christian Dallago, Department of Informatics, Technical University of Munich, Germany
Michael Heinzinger, Department of Informatics, Technical University of Munich, Germany
Burkhard Rost, Department of Informatics, Technical University of Munich, Germany
Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expertly chosen input features leveraging evolutionary information that is resource expensive to generate. We showcase using embeddings from protein language models for competitive localization predictions not relying on evolutionary information. Our lightweight deep neural network architecture uses a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention (LA). The method significantly outperformed the state-of-the-art for ten localization classes by about eight percentage points (Q10). The novel models are available as a web-service and as a stand-alone application at embed.protein.properties.
13:10-13:15
Guided Generative Protein Design using Regularized Transformers
Format: Pre-recorded with live Q&A
Moderator(s): Christian
Egbert Castro
Abhinav Godavarthi
Julian Rubinfien
Smita Krishnaswamy, Yale University, United States
13:15-13:30
Geometric and Topological Approaches to Representation Learning in Biomedical Data
High-throughput, high-dimensional data has become ubiquitous in the biomedical, health and social sciences as a result of breakthroughs in measurement technologies and data collection. While these large datasets containing millions of observations of cells, peoples, or brain voxels hold great potential for understanding generative state space of the data, as well as drivers of differentiation, disease and progression, they also pose new challenges in terms of noise, missing data, measurement artifacts, and the so-called “curse of dimensionality.” In this talk, I will cover data geometric and topological approaches to understanding the shape and structure of the data. First, we show how diffusion geometry and deep learning can be used to obtain useful representations of the data that enable denoising (MAGIC), dimensionality reduction (PHATE) of the data. Next we will show how to learn dynamics from static snapshot data by using a manifold-regularized neural ODE-based optimal transport (TrajectoryNet). Finally, we cover a novel approach to combine diffusion geometry with topology to extract multi-granular features from the data (Diffusion Condensation and Multiscale PHATE) to assist in differential and predictive analysis. On the flip side, we also create a manifold geometry from topological descriptors, and show its applications to neuroscience. Together, we will show a complete framework for exploratory and unsupervised analysis of big biomedical data.
13:30-13:45
ChemBERTa: Self-supervised pretraining for molecular property prediction
The design of a robust transfer learning method for molecules has been a longstanding challenge. In this work, we explore the use of NLP-style pretraining for learning a "chemical language" model on a large corpus of SMILES strings. Our results suggest that it is possible to learn meaningful chemical context in an unsupervised fashion and pair with recent results from others on language modeling for DNA, further suggesting that NLP methods provide a robust basis for building understanding of biomolecules.
13:45-14:00
Short DNA sequence embeddings uncover metagenome function
Format: Live-stream
Moderator(s): Christian
Yana Bromberg
14:20-14:35
Decoding language of life written in protein sequences
Over the last two years, it has become possible to deep learn the language of life written in proteins mimicking the tools developed to understand natural language (NLP), most importantly through transformers. The information extracted by such protein language models (pLMs), referred to as embeddings, is transferred to serve as input for the supervised learning protein prediction from experimental annotation. For the prediction of protein secondary structure in 1D, inter-residue distances in 2D and structure in 3D, as well as, for sub-cellular location, such methods now at least reach the top methods without using any evolutionary information from multiple sequence alignments (MSAs) thereby substantially reducing the cost for every future prediction.
14:35-14:40
Efficient Design of Optimized AAV Capsids using Multi-property Machine Learning Models Trained across Cells, Organs and Species
While next-gen high-throughput assays enable us to learn how capsid sequence changes affect capsid functionality, measuring and optimizing capsid properties in the most therapeutically relevant models, such as non-human primates (NHP), remains challenging. The rate of transduction in target organs is lower than ideal, and most of the sequence space is non-functional. To overcome these challenges, we investigated to what extent multi-task machine learning can improve the efficiency of AAV capsid design for high-performing capsids. We apply our method to a previously designed library containing 156,858 designed sequence variants derived from a natural AAV capsid serotype and measured their properties as delivery vectors. MPMs provide a coherent framework in which to connect information from experiments across cell lines, organs, and species to the most relevant outcomes in NHP studies, thereby reducing the high resource and ethical burdens of NHP experimentation. Additionally, MPMs help overcome data sparsity in traits that are hard to measure, thereby improving model accuracy and providing a more reliable interpretation of experimental results. With further refinement, MPMs will enable the design of highly optimized AAV capsids that open new frontiers in delivery, toward realizing the full potential of gene therapy.
14:40-14:45
Multimodal data visualization and denoising with integrated diffusion