Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Saturday, July 13th
10:40-11:20
Invited Presentation: Disruption of ClpX reverses antifungal resistance
Confirmed Presenter: Jennifer Geddes-McAlister, University of Guelph, Canada

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Jennifer Geddes-McAlister, University of Guelph, Canada
  • Michael Woods, University of Guelph, Canada

Presentation Overview: Show

Fungal disease impacts the lives of almost a billion people across the globe. The opportunistic human fungal pathogen, Cryptococcus neoformans, causes cryptococcal meningitis in immunocompromised individuals with high fatality rates in response to limited treatment options. Moreover, the emergence of azole-resistant isolates in the clinic following prolonged treatment regimes, environmental fungicide exposure, and fungal evolution, threatens the outcome of current therapeutic options, endangering the survival of infected individuals. By quantitatively characterizing the proteomes of fluconazole-susceptible and -resistant C. neoformans strains using state-of-the-art tandem mass spectrometry, we defined ClpX, an ATP-dependent unfoldase, as a target to overcome resistance. We discovered that disruption of ClpX through deletion or inhibition re-introduces fluconazole susceptibility into the resistant strains, rendering treatment effective once again. We further explored the mechanism of resistance and determined interruption to heme biosynthesis and ergosterol production associated with ClpX. Our results contribute to the understanding of novel mechanisms driving fluconazole resistance and provide support for targeting proteins as a therapeutic strategy to combat resistance.

11:20-11:40
Perception and reality of FDR control, data completeness and quantitative precision in (single-cell proteomics) DIA data
Confirmed Presenter: Martin Frejno, MSAID, Germany

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Martin Frejno, MSAID, Germany
  • Michelle Berger, MSAID, Germany
  • Johanna Tueshaus, Technical University of Munich, Germany
  • Alexander Hogrebe, MSAID, Germany
  • Florian Seefried, MSAID, Germany
  • Bernhard Kuster, Technical University of Munich, Germany
  • Daniel Zolg, MSAID, Germany
  • Mathias Wilhelm, Technical University of Munich, Germany

Presentation Overview: Show

Introduction

Recently, single cell proteomics (SCP) moved away from TMT labelling and DDA to label-free experiments and DIA due to its reportedly higher sensitivity and data completeness. Here, we developed a tool that enables routine quality control of peptide-centric DIA results. Application to published single cell DIA data showed that data completeness was only 45% instead of 98%.

Methods

Publicly available single-cell and bulk DIA data were downloaded from PRIDE and searched library-free with DIA NN 1.8.1, Spectronaut 18 and Chimerys 2.0 using default settings against normal or entrapment databases without imputation. Data analysis was performed in R. The most useful plots and filtering options were incorporated into a Shiny application, the source code of which is available on GitHub.

Preliminary data

On bulk data, Spectronaut and Chimerys detected the same number of precursors at 1% precursor false discovery rate (FDR), while DIA-NN identified 17% more. In entrapment experiments, Chimerys showed well-controlled run-specific precursor FDR, while DIA-NN and Spectronaut lost a substantial amount of their identifications at 1% empirical FDR. When accurately controlling FDR, the results from all tools are comparable.

On SCP data, Spectronaut outperformed both DIA-NN and Chimerys when multiple raw files were analyzed together. However, fragment-level XIC peak areas showed a tri-modal distribution. For low-intensity fragments they were between 0 and 1, although inspection of the corresponding raw data showed no signal at all. Excluding these fragments reduced data completeness from 98% to 45%.

Our results highlight the importance of closely inspecting search engine results instead of solely relying on FDR control.

A novel supervised learning algorithm for real-time collision energy selection to optimize peptide fragmentation in mass spectrometry
Confirmed Presenter: Mathieu Lavallée-Adam, University Of Ottawa, Canada

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Yun-En Chung, University of Ottawa, Canada
  • Matthew Willetts, Bruker Daltonics Inc., Germany
  • Jens Decker, Bruker Daltonics Inc., Germany
  • Nagarjuna Nagaraj, Bruker Daltonics Inc., Germany
  • Jonathan R. Krieger, Bruker Ltd., Canada
  • Tharan Srikumar, Bruker Ltd., Canada
  • Mathieu Lavallée-Adam, University Of Ottawa, Canada

Presentation Overview: Show

Mass spectrometry is the most popular technique to characterize proteins in complex biological samples. The ability to identify peptides, proteins and their post-translational modifications using mass spectrometry is directly linked to the fragmentation level of peptide ions. Typically, a well-fragmented peptide ion generates data that facilitates sequence identification. Peptide ion properties, such as mass-to-charge ratio (m/z), charge state, and ion mobility coefficient are related to the level of collision energy required for optimal fragmentation. Nonetheless, most mass spectrometers do not make use of all these pieces of information when attempting to determine the optimal collision energy for a given peptide fragmentation, leaving many peptides suboptimally fragmented and unidentifiable.

Herein, we designed an artificial neural network that predicts the relative fragmentation of a given peptide ion using its properties when a certain level of collision energy is applied. This network is then used to determine in real-time, during mass spectrometry analysis, the optimal collision energy for a given peptide ion. Our novel algorithm accurately predicts relative fragmentation (r2=0.72) in the proteomics analysis of commercial human cell lysates with a Bruker timsTOF Pro mass spectrometer. Furthermore, using our software to determine the optimal collision energy for peptide ions increased the number of peptide identifications by 15%. It also improved post-translational modification characterization by identifying 12% more modified peptides when applied to human cell lysate samples enriched for phosphorylated peptides. By optimizing fragmentation, our method improves proteomics characterization and therefore provides a better understanding of biological processes in samples analyzed by mass spectrometry.

11:40-12:00
Proceedings Presentation: A learned score function improves the power of mass spectrometry database search
Confirmed Presenter: Varun Ananth, University of Washington, United States

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Varun Ananth, University of Washington, United States
  • Justin Sanders, University of Washington, United States
  • Melih Yilmaz, University of Washington, United States
  • Bo Wen, University of Washington, United States
  • Sewoong Oh, University of Washington, United States
  • William Stafford Noble, University of Washington, United States

Presentation Overview: Show

One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing employ machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesize that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools. To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, dubbed Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. Our results show that, at a 1% peptide-level false discovery rate threshold, Casanovo-DB outperforms existing hand-designed score functions by 35% to 88%. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.

12:00-12:20
Invited Presentation: Multi-Omic Data Workflows for Drug Discovery and Development
Confirmed Presenter: Matthew Glover, AstraZeneca, United States

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Matthew Glover, AstraZeneca, United States
  • Benjamin Pullman, AstraZeneca, United States
  • Lukasz Zawada, AstraZeneca, United Kingdom
  • Rafal Klimek, AstraZeneca, United Kingdom
  • Jana Zecha, AstraZeneca, United States
  • Sean O’Dell, AstraZeneca, United Kingdom
  • Stewart MacArthur, AstraZeneca, United Kingdom
  • Sebastian Wasilewski, AstraZeneca, United Kingdom
  • Sonja Hess, AstraZeneca, United States

Presentation Overview: Show

The Centre for Genomics Research (CGR) at AstraZeneca aims to identify and validate novel targets and deliver insights into disease biology by using integrated genomics, proteomics, metabolomics, and lipidomics data. Our cross-functional team has leveraged expertise in mass spectrometry (MS), bioinformatics, and systems engineering to deliver platforms capable of analyzing MS-based -omics data at the near petabyte data scale. This presentation will provide an in-depth look into practical challenges and solutions developed to support proteomics, metabolomics, and lipidomics data analysis and management. We will describe our MS and informatics capabilities and how they are designed to handle the acquisition and analysis of large-scale data from >10K proteomes and metabolomes per year in a high-performance computing environment. Central to this is our focus on end-to-end data throughput and reusability, while handling the complexities of data harmonization, storage, and computational demand. In addition, we will highlight several key aspects of this end-to-end omics platform including stringent data quality control measures and streamlined metadata capture to enable automation, reuse, visualization, and interpretation across a diverse array of experiments. Collectively, the topics covered will provide key insight on the importance of robust experimental and computational MS workflows for advancing drug discovery and development.

14:20-15:00
Invited Presentation: A unified LC-MS metabolomics framework for multi-omics and systems biology
Confirmed Presenter: Jianguo Xia

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Jianguo Xia

Presentation Overview: Show

Metabolites are key mediators of host-environment interactions. Global or untargeted metabolomics based on liquid chromatography-mass spectrometry (LC-MS) can provide rich information on host genetics, metabolism, microbiome composition, and environmental exposures. However, processing, annotation, and analysis of LC-MS and MS/MS metabolomics data within the context of other omics remain a major bottleneck. In this talk, I will share our recent progress on developing algorithms, platforms and resources to enable comprehensive metabolomics, microbiomics and multi-omics data analysis, and showcase the main features using two case studies on exposomics and diabetes.

15:00-15:40
waveome: characterizing temporal dynamics of metabolites in longitudinal studies
Confirmed Presenter: Ali Rahnavard, The George Washington University, United States

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Allen Ross, The George Washington University, United States
  • Ali Reza Taheriyoun, The George Washington University, United States
  • Jason Lloyd-Price, Google, United States
  • Ali Rahnavard, The George Washington University, United States

Presentation Overview: Show

Longitudinal studies, clinical trials, and omics measurements reshape drug development, providing a comprehensive view of disease progression, treatment responses, and biological markers. This integration enhances drug discovery efficiency and enables advanced technologies like AI. Challenges such as correlated subjects, data sparsity, high dimensionality, and limited samples hinder longitudinal omics datasets. To overcome this, we offer adaptable machine learning techniques with user-friendly software, including Gaussian Processes for temporal modeling to uncover omics relationships. Longitudinal metabolomics, specific yet challenging due to noise and dimensionality, is explored using Gaussian processes applied to Crohn's disease metabolomics data.
Additionally, longitudinal data span various studies, including time-based omics. Decision-making in longitudinal omics presents high-dimensional challenges. Practical studies on Maternal-Infant omics, Inflammatory Bowel Disease, and the swan gut microbiome illustrate modeling intricacies, highlighting potential and challenges. Access our software at https://github.com/omicsEye/waveome.

AI-driven de novo structural candidate generation for mass spectra annotation
Confirmed Presenter: Margaret Martin, Tufts University, United States

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Margaret Martin, Tufts University, United States
  • Soha Hassoun, Tufts University, United States

Presentation Overview: Show

Despite the increase in reference library size and available annotation tools, the rate of assignment of molecular structures to mass spectra remains low. Importantly, as not all natural products are known nor cataloged in databases; generative AI models can be used to predict structural candidates for spectra annotation. REINVENT4 is an AI framework that may be customized to facilitate de novo generation of molecular structures with desired properties i.e., generate novel molecular structures that have desirable properties for applications such as drug discovery (Loeffler, 2024). Here, we use two features of this framework: the priors and the custom scoring functions for reinforcement learning. The priors are unbiased generators trained on large molecular datasets. We finetune the prior generator model using transfer learning with candidate molecules retrieved from PubChem based on the precursor mass. Using reinforcement learning, we utilize custom scoring functions that guide the generation of relevant molecular candidates for a queried spectrum. When sampled, the REINVENT4 framework produces SMILES of de novo candidate molecules. We evaluate our method by applying it to annotated spectra in the CANOPUS dataset. For a sample of 30 spectra, our method suggests identifying the true structure for 20% of cases, an increase over the recently published work, MS2Mol (Butler, 2023). This result represents an increase in structural identification of spectra representing previously unknown molecules and suggests that this method may elucidate the identities of previously unannotated spectra.

A consensus serum metabolome by large-scale data mining reveals major gaps in metabolomic measurements and modeling
Confirmed Presenter: Yuanye Chi, The Jackson Laboratory for Genomic Medicine, United States

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Yuanye Chi, The Jackson Laboratory for Genomic Medicine, United States
  • Joshua Mitchell, The Jackson Laboratory for Genomic Medicine, United States
  • Maheshwor Thapa, The Jackson Laboratory for Genomic Medicine, United States
  • Shujian Zheng, The Jackson Laboratory for Genomic Medicine, United States
  • Shuzhao Li, The Jackson Laboratory for Genomic Medicine, United States

Presentation Overview: Show

Blood analysis is the most common in biomedical applications and a reference metabolome will be critical for effective annotation and for guiding scientific investigations. However, compiling such a reference is hindered by many technical challenges, despite the availability of large amount of metabolomics data today. We have designed a series of data structures and tools, including asari (Nature Communications 14, 4113) and khipu (Analytical Chemistry 95, 6212), which enabled a first draft of assembling a consensus serum metabolome from large-scale public data. This assembly is based on about 77,000 metabolomes on human serum or plasma samples measured by Orbitrap mass spectrometers coupled by liquid chromatography. We first validated the approach on cross-laboratory comparison of the Checkmate dataset (1172 samples) vs our HZV029 dataset (1685 samples). Next, all 813 datasets from 110 studies were processed into feature tables with quality metrics from asari and 49,184 consensus mass tracks were extracted by kernel density estimation. Preannotation (neutral masses and associated isotopologues and adducts) can be aligned cross studies, therefore greatly improving feature annotation. About 25% of this consensus serum metabolome is covered by HMDB v5, 50% by PubchemLite and 5% by the current human genome scale metabolic models, in a frequency dependent manner. The results indicate significant gaps in the current databases and metabolic models. We will report both the tool development and scientific findings, and the resource will be freely available via a web service.

Transformers for MALDI-TOF MS-based antimicrobial drug recommendation
Confirmed Presenter: Gaetan De Waele, Ghent University, Belgium

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Gaetan De Waele, Ghent University, Belgium
  • Gerben Menschaert, Ghent University, Belgium
  • Willem Waegeman, Ghent University, Belgium

Presentation Overview: Show

Timely and effective use of antimicrobial drugs can improve patient outcomes, as well as help safeguard against resistance development. Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) is currently routinely used in clinical diagnostics for rapid species identification. Mining additional data from said spectra in the form of antimicrobial resistance (AMR) profiles is, therefore, highly promising. Such AMR profiles could serve as a drop-in solution for drastically improving treatment efficiency, effectiveness, and costs. Stifled by a historical lack of open data, machine learning research towards models specifically adapted to MALDI-TOF MS remains in its infancy.

Here, we introduce Maldi Transformer, an adaptation of the state-of-the-art transformer architecture to the MALDI-TOF mass spectral domain. We propose the first self-supervised pre-training technique
adapted to mass spectra. The technique is based on shuffling peaks across spectra, and pre-training the transformer as a peak discriminator.

We deploy the proposed method to predict AMR profiles for the whole repertoire of species and drugs encountered in clinical microbiology. The resulting model can be interpreted as a drug recommender system for infectious diseases. We find that our dual-branch method delivers considerably higher performance compared to previous approaches. In addition, experiments show that the models can be efficiently fine-tuned to data from other clinical laboratories. Maldi Transformer-based recommender systems can, hence, greatly extend the value of MALDI-TOF MS for clinical diagnostics.

15:40-16:00
Invited Presentation: The sky is the limit: a cloud-based proteomics platform for the masses
Confirmed Presenter: Martin Frejno, MSAID, Germany

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Daniel Zolg, MSAID, Germany
  • Markus Schneider, MSAID, Germany
  • Patroklos Samaras, MSAID, Germany
  • Samia Ben Fredj, MSAID, Germany
  • Florian Seefried, MSAID, Germany
  • Dulguun Bold, MSAID, Germany
  • Layla Eljagh, MSAID, Germany
  • Tobias Schmidt, MSAID, Germany
  • Siegfried Gessulat, MSAID, Germany
  • Martin Frejno, MSAID, Germany

Presentation Overview: Show

Background: Laboratories dealing with bottom-up proteomics data often encounter computational hurdles in the journey from raw data to conclusive insights. Challenges arise, from the absence of automated pipelines and disjointed local infrastructure for data storage to processing, systematic result management, and interpretation. The advent of fast-scanning instruments exacerbates these issues by overwhelming local infrastructure with a multitude of files and large raw data sizes. Here, we introduce a highly scalable, fully automatable, cloud-based proteomics platform designed to streamline the entire workflow.

Methods: Our cloud-native platform comprises microservices operated on AWS and orchestrated by Kubernetes. Users can access the platform through either a command line client or a browser-based interface, both interacting with an API governing all platform functionalities. Raw data undergoes processing using Chimerys 4 on an elastic compute cluster. Results are stored in a data lake and can be explored directly in the browser or downloaded. Metadata annotation facilitates navigation and contextualization of numerous files. Platform access is available through subscription or self-hosted deployment.

Results: We present a comprehensive, managed solution for proteomics data management, obviating the need for user-managed pipelines and infrastructure. The platform offers an intuitive web interface for collaborative data upload, management, and processing. File transfer occurs at speeds of up to 100 MB/s into scalable object storage. Raw data can be annotated with metadata via a searchable tag system, simplifying organization and retrieval. A scalable compute cluster enables simultaneous processing of DDA, DIA, and PRM data from thousands of files. The platform is algorithm-independent, currently supporting Chimerys 4 with plans for additional search engines. We demonstrate the scalability by processing multiple files without significant increase in processing time compared to single file processing. Processed data can be organized using the same tag system employed for raw data, with the processing overview providing immediate insight into key parameters for data quality assessment. A fast post-processing workflow combines individually searched raw files, facilitating longitudinal data acquisition and processing without overheads. Results can be accessed via API- or browser-based download, direct API access to the result data lake, browser-based data exploration, or a customizable visualization dashboard featuring common data analyses and visualizations.

Conclusions: This managed, automated proteomics data pipeline promises to streamline the journey from raw data to insights, particularly benefiting laboratories lacking the resources to develop and maintain in-house solutions.

16:40-17:00
Proceedings Presentation: SpecEncoder: Deep Metric Learning for Accurate Peptide Identification in Proteomics
Confirmed Presenter: Haixu Tang, Indiana University Bloomington, United States

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Kaiyuan Liu, Indiana University Bloomington, United States
  • Chenghua Tao, Indiana University Bloomington, United States
  • Yuzhen Ye, Indiana University Bloomington, United States
  • Haixu Tang, Indiana University Bloomington, United States

Presentation Overview: Show

Tandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification. We evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies ~1-2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6-15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6-12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder's potential to enhance peptide identification for proteomic data analyses.

17:00-17:40
FLASHTagger: An open-source web application for ion type- and precursor mass-free protein identification in top-down mass spectrometry
Confirmed Presenter: Kyowon Jeong, Applied Bioinformatics, Department for Computer Science, University of Tübingen, Germany

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Kyowon Jeong, Applied Bioinformatics, Department for Computer Science, University of Tübingen, Germany
  • Wonhyeuk Jung, Department of Cell Biology, Yale School of Medicine, United States
  • Tom Müller, Applied Bioinformatics, Department for Computer Science, University of Tübingen, Germany
  • Jaywon Lee, Department of Cell Biology, Yale School of Medicine, United States
  • Aniruddha Panda, Department of Cell Biology, Yale School of Medicine, United States
  • Jared Shaw, Department of Chemistry, University of Nebraska-Lincoln, United States
  • Louise Buur, Bioinformatics Research Group, University of Applied Sciences Upper Austria, Austria
  • Viktoria Dorfer, Bioinformatics Research Group, University of Applied Sciences Upper Austria, Austria
  • Oliver Kohlbacher, Applied Bioinformatics, Department for Computer Science, University of Tübingen, Germany
  • Kallol Gupta, Department of Cell Biology, Yale School of Medicine, United States

Presentation Overview: Show

The growing capacity to detect proteins and protein complexes in MS pose computational challenges in identifying them through Top-DownMS (TDMS). While alternative fragmentation methods such as ECD and UVPD open up multiple fragmentation pathways increasing sequence coverage, they also complicate the interpretation of fragment spectra. Recently developed protocols like complex-down MS often isolate protein complexes at once yielding multiplexed fragment spectra. Together with complex signal structure of TDMS spectra and frequent errors in deconvolution, they present challenges in correct precursor ion interpretation.
Addressing these issues, we present FLASHTagger, a high-sensitivity protein identification tool for TDMS platforms. Unlike most conventional database searches, FLASHTagger is de novo sequence tag-based and thus runs without specifying fragment ion type or assuming monomeric proteoform precursors. The tags enable rapid protein searches, with a protein-level false discovery rate (FDR) control. Benchmark tests performed with EChcD datasets from monoclonal antibody and E.coli membrane proteins showed that FLASHTagger can reliably identify individual target proteins from MS/MS scans of multimeric complexes. Analysis of the matched tags revealed various ion types, including internal ions, and well known protein modifications.
Current implementation of FLASHTagger focuses on the low complexity datasets, but analysis of complex datasets will be made available in near future as a part of our new proteoform search engine. We anticipate that the precursor independent feature of FLASHTagger would open up the gate toward data independent acquisition in TDP. FLASHTagger is deployed as a part of OpenMS web application FLASHViewer at https://abi-services.cs.uni-tuebingen.de/flashviewer/.

Imputation of cancer proteomics data with a deep model that learns jointly from many datasets
Confirmed Presenter: Lincoln Harris, University of Washington, United States

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Lincoln Harris, University of Washington, United States
  • William Noble, University of Washington, United States

Presentation Overview: Show

TMT proteomics suffers from excessive missing values, especially in the large-scale, multi-batch experimental setting. Imputation is an analytical solution to the missingness problem. Many methods exist for TMT proteomics imputation, however, few of them take advantage of deep neural networks, and none of them can learn jointly from multiple datasets. We introduce Lupine, a deep learning-based imputation tool that learns patterns of missingness across many mass spectrometry runs and experiments. We demonstrate that Lupine outperforms the current state-of-the-art and learns meaningful representations of experimental structure and protein physicochemical properties.

We first constructed a joint protein quantifications matrix consisting of mass spectrometry runs from 10 cancer cohorts from the Clinical Proteomics Tumor Atlas Consortium (CPTAC). These data were generated with a common experimental workflow and processing pipeline. We then developed a deep learning model that leverages matrix factorization to learn low-dimensional representations of proteins and mass spectrometry runs. This model, called Lupine, was trained on our joint protein quantifications matrix.

We show that Lupine outperforms DreamAI, an ensemble imputation method that represents the current state-of-the-art for TMT proteomics. For each of the 10 CPTAC cohorts, the mean squared error of Lupine’s imputed values is lower than DreamAI’s. We also show that Lupine learns a latent representation of proteins that captures missingness fraction and other protein physicochemical properties. Lupine increases the number of differentially expressed proteins between CPTAC cohorts and improves clustering accuracy.

In summary, Lupine is the only existing proteomics imputation method that can learn jointly from many datasets.

Proteogenomics analysis of human tissues using pangenomes
Confirmed Presenter: Husen M. Umer, Bioscience Core Laboratory, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia, Saudi Arabia

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Dong Wang, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, China
  • Robbin Bouwmeester, VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium, Belgium
  • Aniel Sanchez, Section for Clinical Chemistry, Department of Translational Medicine, Lund University, Skane University Hospital Malmö, Sweden
  • Mingze Bai, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, China, China
  • Husen M. Umer, Bioscience Core Laboratory, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia, Saudi Arabia
  • Yasset Perez-Riverol, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK, United Kingdom

Presentation Overview: Show

The genomics landscape is evolving with the emergence of pangenomes, challenging the conventional single-reference genome model. The new human pangenome reference provides an extra dimension by incorporating variations observed in different human populations. However, the increasing use of pangenomes in human reference databases poses challenges for proteomics, which currently relies on UniProt canonical/isoform-based reference proteomics. Including more variant information in human proteomes, such as small and long open reading frames and pseudogenes, prompts the development of complex proteogenomics pipelines for analysis and validation. This study explores the advantages of pangenomes, particularly the human reference pangenome, on proteomics, and large-scale proteogenomics studies. We reanalyze two large human tissue datasets using the quantms workflow to identify novel peptides and variant proteins from the pangenome samples. Using three search engines SAGE, COMET, and MSGF+ followed by Percolator we analyzed 91,833,481 MS/MS spectra from more than 30 normal human tissues. We developed a robust deep-learning framework to validate the novel peptides based on DeepLC, MS2PIP and pyspectrumAI. The results yielded 170142 novel peptide spectrum matches, 4991 novel peptide sequences, and 3921 single amino acid variants, corresponding to 2367 genes across five population groups, demonstrating the effectiveness of our proteogenomics approach using the recent pangenome references.

Optimising Thermal Proteome Profiling experimental design with GPMelt
Confirmed Presenter: Cecile Le Sueur, EMBL Heidelberg, Germany

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Cecile Le Sueur, EMBL Heidelberg, Germany
  • Pablo Rivera-Mejías, EMBL Heidelberg, Germany
  • Isabelle Becher, EMBL Heidelberg, Germany
  • Mikhail Savitski, EMBL Heidelberg, Germany
  • Magnus Rattray, The University of Manchester, United Kingdom

Presentation Overview: Show

Thermal proteome profiling (TPP) combines cellular thermal shift assay and quantitative mass spectrometry to explore protein interactions and states proteome-wide. Temperature range TPP (TPP-TR) datasets consist of protein melting curves, quantifying non-denatured proteins across a temperature gradient. Thermal stability changes are statistically evaluated by comparing melting curves between conditions, like drug treatment versus control. While powerful and versatile in uncovering new biology, TPP's experimental cost, including consumables and mass spectrometry measurement time, limits its accessibility to well-funded researchers. Moreover, the sample requirements hinder its application to rare samples. We recently introduced GPMelt, a statistical framework for TPP-TR datasets based on hierarchical Gaussian process models, robustly integrating replicates information and handling any melting curve shape. Here, we propose an enhanced GPMelt model together with an optimized low-cost and low-sample TPP-TR experimental design. By halving the consumables and mass spectrometry measurement time, this work broadens TPP-TR accessibility to a wider scientific community and opens its application to precious samples. Additionally, it establishes a smooth connection between TPP-TR and 2D-TPP protocols and analyses. 2D-TPP datasets compare protein thermal stability across a larger number of conditions, using a distinct sample multiplexing strategy that hinders melting curves reconstruction. Adapting this multiplexing strategy in combination with the enhanced GPMelt model strengthens 2D-TPP discoveries by retaining melting curves modeling, and hence key biological information. Collectively, extensions to the GPMelt model, combined with optimised experimental designs for both TPP-TR and 2D-TPP, can significantly increase TPP’s effectiveness and dissemination among scientists, paving the way for groundbreaking biological discoveries.

17:40-18:00
Proceedings Presentation: An algorithm for decoy-free false discovery rate estimation in XL-MS/MS
Confirmed Presenter: Shantanu Jain, Northeastern University, United States

Room: 525
Format: In Person

Moderator(s): Timo Sachsenberg


Authors List: Show

  • Yisu Peng, Northeastern University, United States
  • Shantanu Jain, Northeastern University, United States
  • Predrag Radivojac, Northeastern University, United States

Presentation Overview: Show

Motivation: Cross-linking tandem mass spectrometry (XL-MS/MS) proteomics is an established technique that determines distance constraints between residues within a protein or between interacting proteins, thus improving our understanding of protein structure and function under native cellular conditions. To aid biological discovery, it is essential that pairs of chemically linked peptides be accurately identified, a process that requires: (i) database search, that creates a ranked list of candidate peptide pairs for each experimental spectrum, and (ii) false discovery rate (FDR) estimation, that determines the probability of false identification of the top-ranked peptide pairs for a given score threshold. Currently, the only available FDR estimation mechanism in XL-MS/MS is the target-decoy approach (TDA). However, despite its simplicity, TDA has both theoretical and practical drawbacks.

Results: We introduce a novel decoy-free framework for FDR estimation in XL-MS/MS. Our approach relies on multi-sample mixtures of skew normal distributions, where the latent components correspond to the scores of correct peptide pairs (both peptides identified correctly), partially incorrect peptide pairs (one peptide identified correctly, the other incorrectly), and incorrect peptide pairs (both peptides identified incorrectly). To learn these components, we exploit the score distributions of first- and second-ranked peptide-spectrum matches (PSMs) for each experimental spectrum and subsequently estimate FDR using a novel expectation-maximization (EM) algorithm with constraints. We evaluate the method on ten datasets and provide evidence that the proposed DFA is theoretically sound and a viable alternative to TDA owing to its good performance in terms of accuracy, variance of estimation, and run time.