CompMS COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Monday, July 11th
10:30-11:10
Keynote Presentation: Machine Learning Methods for Proteomics
Room: KOPL
Format: Live from venue

  • Brian Searle, The Ohio State University, United States


Presentation Overview: Show

Shotgun proteomics using mass spectrometry is enabling a revolution in the study of large-scale systems biology. Data-independent acquisition (DIA) is a mass spectrometry technique that regularly samples co-fragmented ions produced from multiple peptides falling within a specified mass range. While comprehensive, this approach results in highly complex mass spectra requiring the reconstruction of which fragment ions were produced from which peptides. Here we discuss machine learning methods to make those assignments and assess our confidence in that data.

11:10-11:30
An AI-driven leap forward in peptide identification through deconvolution of chimeric spectra
Room: KOPL
Format: Live-stream

  • Martin Frejno, MSAID GmbH, Germany
  • Daniel P Zolg, MSAID GmbH, Germany
  • Tobias Schmidt, MSAID GmbH, Germany
  • Siegfried Gessulat, MSAID GmbH, Germany
  • Michael Graber, MSAID GmbH, Germany
  • Florian Seefried, MSAID GmbH, Germany
  • Magnus Rathke-Kuhnert, MSAID GmbH, Germany
  • Samia Ben Fredj, MSAID GmbH, Germany
  • Patroklos Samaras, MSAID GmbH, Germany
  • Kai Fritzemeier, Thermo Fisher Scientific (Bremen) GmbH, Germany
  • Frank Berg, Thermo Fisher Scientific (Bremen) GmbH, Germany
  • Waqas Nasir, Thermo Fisher Scientific (Bremen) GmbH, Bremen, Germany
  • David Horn, Thermo Fisher Scientific, United States
  • Bernard Delanghe, Thermo Fisher Scientific (Bremen) GmbH, Germany
  • Christoph Henrich, Thermo Fisher Scientific (Bremen) GmbH, Germany
  • Bernhard Kuster, Technical University of Munich, Germany
  • Mathias Wilhelm, Technical University of Munich, Germany


Presentation Overview: Show

Chimeric spectra are estimated to constitute >40% of DDA data, violating the assumption that one spectrum represents one peptide. Here, we describe a new intelligent search algorithm (CHIMERYS) that rethinks the analysis of tandem mass spectra from the ground up. It routinely doubles the number of peptide identifications and reaches identification rates of >80%.

Analyzing a HeLa tryptic digest (1 hour gradient) with our new algorithm identified 114k PSMs, 61k unique peptides and 7,300 unique protein groups at 1% FDR. This is a 3.5-, 2- and 1.5-fold increase compared to SequestHT, respectively, resulting on average in 2.5-fold more identified peptides per protein.

We successfully demonstrated the fidelity of our new algorithm in four experiments: I) entrapment searches focusing on FDR-estimation, II) dilution experiments focusing on expected ratio distributions, III) comparisons with multiple search engines focusing on the overlap of identifications, IV) simulation experiments focusing on the deconvolution of chimeric spectra.

Our new algorithm is compatible with older mass spectrometer generations, but profits disproportionally from the increased sensitivity of recent instruments and measurements using wider isolation windows. It substantially outperforms other search engines on data of different complexity such as body fluids and organisms from all kingdoms of life.

11:30-11:40
Detecting more peptides from bottom-up mass spectrometry data via peptide-level target-decoy competition
Room: KOPL
Format: Live from venue

  • Andy Lin, Pacific Northwest National Laboratory, United States
  • Temana Short, University of Sydney, Australia
  • William Noble, University of Washington, United States
  • Uri Keich, University of Sydney, Australia


Presentation Overview: Show

Analysis of the data produced by proteomics experiment yields a set of discoveries, which can be summarized in terms of peptide-spectrum matches (PSMs), peptides, or proteins. A critical statistical task involves controlling the false discovery rate (FDR) among the accepted set of discoveries. This task is typically solved at the PSM level by using target-decoy competition (TDC), where a set of observed spectra are searched against a database containing real (target) and decoy peptides. Here, we investigate both PSM-level and peptide-level methods for FDR control, and we come to two conclusions. First, as previously noted by He et al (arXiv, 2015), although the TDC procedure is provably correct under certain assumptions, we observe that in practice one of these key assumptions is violated. Hence, we empirically demonstrate that the PSM-level estimates offered by TDC are liberally biased. We therefore propose that researchers avoid summarizing their results at the PSM-level and instead focus on peptide-level analysis. Second, we investigate three ways to perform peptide-level TDC. We show that the most used method offers the lowest statistical power in practice. The most powerful method carries out competition first at the PSM level and then again at the peptide level.

11:40-12:00
De novo mass spectrometry peptide sequencing with a transformer model
Room: KOPL
Format: Live-stream

  • Melih Yilmaz, Paul G. Allen School of Computer Science and Engineering, University of Washington, United States
  • William Fondrie, Talus Bioscience, United States
  • Wout Bittremieux, Skaggs School of Pharmacy and Pharmaceutical Science, University of California San Diego, United States
  • Sewoong Oh, Paul G. Allen School of Computer Science and Engineering, University of Washington, United States
  • William Noble, Department of Genome Sciences, University of Washington, United States


Presentation Overview: Show

Tandem mass spectrometry is the only high-throughput method for analyzing the protein content of complex biological samples and is thus the primary technology driving the growth of the field of proteomics. A key outstanding challenge in this field involves identifying the sequence of amino acids—the peptide—responsible for generating each observed spectrum, without making use of prior knowledge in the form of a peptide sequence database. Although various machine learning methods have been developed to address this de novo sequencing problem, challenges that arise when modeling tandem mass spectra have led to complex models that combine multiple neural networks and post processing steps. We propose a simple yet powerful method for de novo peptide sequencing, Casanovo, that uses a transformer framework to map directly from a sequence of observed peaks (a mass spectrum) to a sequence of amino acids (a peptide). Our experiments show that Casanovo achieves state-of-the-art performance on a benchmark dataset using a standard cross-species evaluation framework which involves testing with out-of-distribution samples, i.e., spectra with never-before-seen peptide labels. Casanovo not only achieves superior performance but does so at a fraction of the model complexity and inference time required by other methods.

12:00-12:20
A genetic algorithm with deep learning-based guided mutations improves de novo peptide sequencing
Room: KOPL
Format: Live from venue

  • Daniela Klaproth-Andrade, Technical University of Munich, Germany
  • Johannes Hingerl, Technical University of Munich, Germany
  • Mathias Wilhelm, Technical University of Munich, Germany
  • Julien Gagneur, Technical University of Munich, Germany


Presentation Overview: Show

De novo peptide sequencing (DNPS), determining the peptide amino acid sequence from a tandem mass spectrum, could make proteomics amenable for applications including genotyping and metagenomics. However, DNPS is highly ambiguous with state-of-the-art performance having poor recall at high precision. Here we propose three innovations that, when combined, improve DNPS but can also be used individually. First, we consider DNPS as a bin classification problem: whether a discretized m/z value (bin) of a spectrum belongs to a particular ion series. Second, we introduce an amino-acid-gapped convolution layer that is designed to connect distant bins to form consistent ion series. Third, we introduce a fitness function to evaluate how well a candidate peptide matches a given spectrum by training a model estimating the number of single amino acid editions to the correct peptide. Bin classification and the fitness function leverage the peptide-to-spectrum predictor Prosit. The bin classification model yielded high precision-recall of bin classes and the fitness function precisely evaluates any peptide-spectrum match. We combined the methods in a genetic algorithm. Initial results of the genetic algorithm on a human cell line dataset increased the recall by 19.3% at 90% precision and the overall recall improved by 35.6%.

12:20-12:30
Bacterial species identification using MALDI-TOF mass spectrometry and machine learning techniques: A large-scale benchmarking study
Room: KOPL
Format: Live from venue

  • Thomas Mortier, KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Belgium
  • Anneleen D. Wieme, BCCM/LMG Bacteria Collection, Laboratory of Microbiology, Ghent University, Belgium
  • Peter Vandamme, BCCM/LMG Bacteria Collection, Laboratory of Microbiology, Ghent University, Belgium
  • Willem Waegeman, KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Belgium


Presentation Overview: Show

Today machine learning methods are commonly deployed for bacterial species identification using MALDI-TOF mass spectrometry data. However, most of the studies reported in literature only consider very traditional machine learning methods on small datasets that contain a limited number of species. In this paper we present benchmarking results on an unprecedented scale for a wide range of machine learning methods, using datasets that contain almost 100,000 spectra and more than 1000 different species. The size and the diversity of the data allow to compare three important identification scenarios that are often not distinguished in literature, i.e., identification for novel biological replicates, novel strains and novel species that are not present in the training data. The results demonstrate that in all three scenarios acceptable identification rates are obtained, but the numbers are typically lower than those reported in studies with a more limited analysis. Using hierarchical classification methods, we also demonstrate that taxonomic information is in general not well preserved in MALDI-TOF mass spectrometry data. For the novel species scenario, we apply for the first time neural networks with Monte Carlo dropout, which have shown to be successful in other domains, such as computer vision, for the detection of novel species.

14:30-15:10
Keynote Presentation: Advancing Metabolomics for Precision Medicine
Room: KOPL
Format: Live from venue

  • Patti Gary, Washington University in St. Louis, USA


Presentation Overview: Show

Studies of large, diverse, and longitudinal cohorts have the potential to identify subgroups within the population for whom strategies to prevent, diagnose, and treat diseases can be uniquely tailored. This experimental paradigm, often referred to as “precision medicine”, requires profiling thousands or even hundreds of thousands of subjects. The major challenge of studies with exceptionally large sample cohorts is the burden of collecting and processing high volumes of data. Although some technologies may currently be positioned to support such demands (e.g., genomics, wearable devices, etc.), applications of precision medicine in discovery metabolomics have been limited because standard software programs in the field are not designed to process thousands of data files simultaneously. This presentation will first provide a roadmap for the structure of mass spectrometry-based metabolomics data. On the basis of this complexity, a workflow will be introduced to suppprt population-based metabolomics studies. As a proof-of-concept, a metabolomics analysis of COVID-19 patients will be discussed. Using temporal metabolic profiles and machine learning, our workflow enabled us to build a model for predicting which COVID-19 patients admitted to the hospital will go on to develop severe disease. 

15:10-15:30
Proceedings Presentation: Deep kernel learning improves molecular fingerprint prediction from tandem mass spectra
Room: KOPL
Format: Live from venue

  • Kai Dührkop, Friedrich-Schiller-University Jena, Germany


Presentation Overview: Show

Untargeted metabolomics experiments rely on spectral libraries for structure
annotation, but these libraries are vastly incomplete; in-silico methods
search in structure databases, allowing us to overcome this limitation. The best-performing in-silico methods
use machine learning to predict a molecular fingerprint from tandem mass spectra, then use the predicted fingerprint to search in a molecular structure database. Predicted molecular fingerprints are also of great interest for compound class annotation, de novo structure elucidation, and other tasks. So far, Kernel support vector machines are the best tool for fingerprint prediction. However, they cannot be trained on all publicly available reference spectra because their training time scales cubically with the number of training data.
Here, we use the Nystrom approximation to transform the kernel into a linear feature map. We evaluate two methods that use this feature map as input: a linear SVM and a deep neural network.
For evaluation we use a cross-validated dataset of 156,017 compounds and three independent datasets with 1,734 compounds. We show that the combination of kernel method and deep neural network outperforms the kernel support vector machine, which is the current gold-standard, as well as a deep neural network on tandem mass spectra on all evaluation datasets.

16:00-16:40
Keynote Presentation: Novel ComMS Approaches for Top-down Proteomics toward Precision Medicine
Room: KOPL
Format: Live from venue

  • Ying Ge
16:40-17:00
DEIMoS: an open-source tool for processing high-dimensional mass spectrometry data
Room: KOPL
Format: Live-stream

  • Sean Colby, Pacific Northwest National Laboratory, United States
  • Christine Chang, Pacific Northwest National Laboratory, United States
  • Jessica Bade, Pacific Northwest National Laboratory, United States
  • Jamie Nunez, Pacific Northwest National Laboratory, United States
  • Madison Blumer, Pacific Northwest National Laboratory, United States
  • Daniel Orton, Pacific Northwest National Laboratory, United States
  • Kent Bloodsworth, Pacific Northwest National Laboratory, United States
  • Ernesto Nakayasu, Pacific Northwest National Laboratory, United States
  • Richard Smith, Pacific Northwest National Laboratory, United States
  • Yehia Ibrahim, Pacific Northwest National Laboratory, United States
  • Ryan Renslow, Pacific Northwest National Laboratory, United States
  • Thomas Metz, Pacific Northwest National Laboratory, United States


Presentation Overview: Show

We present DEIMoS: Data Extraction for Integrated Multidimensional Spectrometry, a Python application programming interface and command-line tool for high-dimensional mass spectrometry (MS) data analysis workflows that offers ease of development and access to efficient algorithmic implementations. Functionality includes feature detection, feature alignment, collision cross section (CCS) calibration, isotope detection, and MS/MS spectral deconvolution, with the output comprising detected features aligned across study samples and characterized by mass, CCS, tandem mass spectra, and isotopic signature. Notably, DEIMoS operates on N-dimensional data, largely agnostic to acquisition instrumentation; algorithm implementations simultaneously utilize all dimensions to (i) offer greater separation between features, thus improving detection sensitivity, (ii) increase alignment/feature matching confidence among datasets, and (iii) mitigate convolution artifacts in tandem mass spectra. We demonstrate DEIMoS with liquid chromatography–ion mobility spectrometry–tandem mass spectrometry (LC-IMS-MS/MS) data to illustrate the advantages of a multidimensional approach in each data processing step.

17:00-17:10
MASH Native: A Universal and Comprehensive Software for Native Mass Spectrometry
Room: KOPL
Format: Live from venue

  • Sean J. McIlwain, University of Wisconsin-Madison, WI 53719, United States
  • Eli J. Larson, University of Wisconsin-Madison, WI 53719, United States
  • Michael Marty, University of Arizona, Tucson, AZ 85719, United States
  • Kent Wenger, University of Wisconsin-Madison, WI 53719, United States
  • Harini Josyer, University of Wisconsin-Madison, WI 53719, United States
  • Jake Melby, University of Wisconsin-Madison, WI 53719, United States
  • Melissa R. Pergande, University of Wisconsin-Madison, WI 53719, United States
  • David Roberts, University of Wisconsin-Madison, WI 53719, United States
  • Kyndalanne Pike, University of Wisconsin-Madison, WI 53719, United States
  • Kyle A. Brown, University of Wisconsin-Madison, WI 53719, United States
  • Irene M. Ong, University of Wisconsin-Madison, WI 53719, United States
  • Ying Ge, University of Wisconsin-Madison, WI 53719, United States


Presentation Overview: Show

Native top-down mass spectrometry (MS)-based proteomics is a powerful method for the comprehensive characterization of proteoforms and intact protein complexes in their native state. However, practitioners of native MS are challenged by the complex datasets generated by native top-down MS experiments and a lack of software tools designed to cope with problems unique to native MS. Herein, we present MASH Native, a comprehensive software application for native top-down proteomics. MASH Native is a multithreaded Windows application implemented in C# using the .NET framework and provides various functionalities for native top-down MS data interpretation and processing through the incorporation of many deconvolution methods including UniDec, multiple searching algorithm support, spectral averaging, internal fragmentation searching and proteoform quantitation. Importantly, MASH Native is a freely available software package and can process datasets from various vendor formats while still retaining MASH Explorer’s capability to process denatured top-down proteomics data. With the support of multiple file formats, the integration of numerous analysis tools, and the additional navigation, validation, and manual search functionalities; MASH Native is a universal, comprehensive, user-friendly, and vital tool for analyzing any native or denatured top-down MS experimental data.

17:10-17:20
Functional Characterization of Co-phosphorylation Networks and Its Application In Cancer Subtyping
Room: KOPL
Format: Live from venue

  • Marzieh Ayati, University of Texas Rio Grande Valley, United States
  • Serhan Yilmaz, Case Western Reserve University, United States
  • Mark Chance, Case Western Reserve University, United States
  • Mehmet Koyutürk, Case Western Reserve University, United States


Presentation Overview: Show

Protein phosphorylation is a ubiquitous regulatory mechanism that plays a central role in cellular signaling. Phosphorylation is regulated by networks composed of kinases, phosphatases, and their substrates. Characterization of these networks is increasingly important in many biomedical applications, including identification of novel disease-specific drug targets, development of patient-specific therapeutics, and prediction of treatment outcomes. In this talk, we present a comprehensive investigation of the concept of “co-phosphorylation”, defined as the correlated phosphorylation of a pair of phosphosites across various biological states. We integrate nine publicly available MS-based phosphoproteomics datasets for various diseases and utilize functional data related to sequence, evolutionary histories, kinase annotations, protein interactions, and pathway annotations to investigate the functional relevance of co-phosphorylation. Our result show that co-phosphorylation can be used to predict with high precision the sites that are on the same pathway or that are targeted by the same kinase. We also present the application of co-phosphorylation in the context of unsupervised identification of subtype-specific modules in breast cancer. Our results show that integration of quantitative phosphorylation data with functional networks can provide mechanistic insights into the differences between the signaling mechanisms that drive breast cancer subtypes.

17:20-18:00
Keynote Presentation: Computer Vision – Unveiling the Hidden Proteome with Software
Room: KOPL
Format: Live from venue

  • Michael Shortreed


Presentation Overview: Show

Instruments for gathering proteomics data have gotten faster and more sensitive over the last several years. However, the nature of the data has changed little. Whether its bottom-up or top-down, most proteomics data consist of measurements of the intact mass of peptides or proteoforms, followed by measurements of their fragment product ions. Peptides and proteoforms can be unmodified, post-translationally modified, truncated, or even contain amino acid or splice variants. Yet, the mass spectrometer treats them all the same. In truth, what has allowed us to characterize all these molecular forms is the software that processes the MS data once it has been gathered. In the early days of proteomics, only unmodified peptides and proteoforms could be confidently identified. Now we can identify a great variety of PTMs, sequence variants, truncation products and even complete proteoforms. Many laboratories, including our own, have contributed much to the proteomics community’s ability to view deeper into the proteome to reveal the exact chemical nature of each species. In this presentation I will provide a high-level overview of some of our contributions to the field, discuss open problems, and tell you where we are headed.