All times listed are in UTC
- Claire O'Donovan
Presentation Overview: Show
Metabolomics is the large-scale study of small molecules, commonly known as metabolites, within cells, biofluids, tissues or organisms. Collectively, these small molecules and their interactions within a biological system are known as the metabolome. It is considered to be the omics field that is closest to the phenotype, and it has been used for a long time in clinical settings for biomarker discovery and diagnosis, e.g. heel prick test and PKU. Now it is becoming an exciting possibility to understand the actual biological context of how these biomarkers function and/or are created by integrating metabolomics data with metabolic pathways and systems biology information, including genomics, proteomics and transcriptomics data. However, in order to do that, there are a number of challenges to address in the context of the metadata captured in each of these omics and its interoperability, both in single and multi-omics studies.
- Camille Roquencourt, CEA, LIST, Laboratoire Sciences des Donnees et de la Decision, Gif-sur-Yvette, France, France
- Stanislas Grassin Delyle, Hôpital Foch, Exhalomics, Département des maladies des voies respiratoires, Suresnes, France, France
- Etienne Thévenot, Département Médicaments et Technologies pour la Santé (DMTS), Université Paris-Saclay, CEA, Gif-sur-Yvette, France, France
Presentation Overview: Show
The analysis of Volatile Organic Compound (VOCs) in exhaled breath is a non-invasive method for early diagnosis and therapeutic monitoring. Proton Transfer Reaction Time-Of-Flight Mass Spectrometry (PTR-TOF-MS) is of major interest for the real time analysis of VOCs and the discovery of new biomarkers. However, there is currently a lack of methods and software tools for the processing of PTR-TOF-MS data from cohorts.
We therefore developed a suite of algorithms that process raw data from the patient acquisitions and build the table of feature intensities, through expiration and peak detection, quantification, alignment between samples, and missing value imputation. Notably, we developed an innovative 2D peak deconvolution model based on penalized splines signal regression, and a method to specifically select the VOCs from exhaled breath. The full workflow is implemented in the freely available ptairMS R/Bioconductor package.
Our approach was validated both on experimental data (mixture of VOCs at standardized concentrations) and simulations, which showed that identification of VOCs from exhaled breath sensitivity reached 99%. Furthermore, application to clinical data from ventilated patients resulted in the detection of four biomarkers of COVID-19 infection. Altogether, these results highlight the value of the ptairMS approach for biomarker discovery in exhaled breath.
- Martin Andre Hoffmann, Friedrich Schiller University Jena, Germany
- Louis-Félix Nothias, University of California, San Diego, USA, United States
- Marcus Ludwig, Friedrich Schiller University Jena, Germany
- Markus Fleischauer, Friedrich Schiller University Jena, Germany
- Emily C. Gentry, University of California, San Diego, USA, United States
- Michael Witting, Helmholtz-Zentrum München, Germany
- Pieter C. Dorrestein, University of California, San Diego, USA, United States
- Kai Dührkop, Friedrich Schiller University Jena, Germany
- Sebastian Böcker, Friedrich Schiller University Jena, Germany
Presentation Overview: Show
The discovery and elucidation of novel metabolites and natural products is cost-, time- and labor-intensive. Untargeted metabolomics experiments rely on spectral libraries for structure annotation, but libraries are vastly incomplete. In silico methods search in substantially more comprehensive molecular structure databases, but cannot differentiate between correct and incorrect hits.
We introduce the COSMIC workflow that combines structure database generation, in silico annotation, and a confidence score consisting of kernel density p-value estimation and a Support Vector Machine with enforced directionality of features. In evaluation, COSMIC annotates a substantial number of hits at small FDR.
COSMIC allows us to expand structure annotation beyond the space of known molecules: To demonstrate this, 28,630 plausible bile acid conjugate structures were generated combinatorially and searched in a public mice fecal metabolomics dataset. The top~12 novel bile acid conjugate structures were validated by manual interpretation; all but one annotation turned out to be correct. Two structures were further confirmed preparing authentic standards and performing spike-in experiments.
We also annotated and manually evaluated 315 molecular structures in human samples currently absent from the Human Metabolome Database; we then applied COSMIC to 17,400 LC-MS/MS runs and annotated 1,715 structures with high confidence currently absent from spectral libraries.
- Eric Bach, Aalto University, Finland
- Simon Rogers, Department of Computing Science, University of Glasgow, United Kingdom
- John Williamson, University of Glasgow, United Kingdom
- Juho Rousu, Aalto University, Finland
Presentation Overview: Show
Identification of small molecules in a biological sample remains a major bottleneck in molecular biology, despite a decade of rapid development of computational approaches for predicting molecular structures using mass spectrometry (MS) data. Recently, there has been increasing interest in utilizing other information sources, such as liquid chromatography (LC) retention time (RT), to improve identifications solely based on MS information, such as precursor mass-per-charge and tandem mass spectrometry (MS2).
We put forward a probabilistic modelling framework to integrate MS and RT data of multiple features in an LC-MS experiment. We model the MS measurements and all pairwise retention order information as a Markov random field and use efficient approximate inference for scoring and ranking potential molecular structures. Our experiments show improved identification accuracy by combining MS2 data and retention orders using our approach, thereby outperforming state-of-the-art methods. Furthermore, we demonstrate the benefit of our model when only a subset of LC-MS features has MS2 measurements available besides MS1.
- Zeyuan Zuo, Computational Biology Department, Carnegie Mellon University, United States
- Liu Cao, Computational Biology Department, Carnegie Mellon University, United States
- Louis-Félix Nothias, Skaggs School of Pharmacy, University of California San Diego, La Jolla, CA, USA, United States
- Hosein Mohimani, Computational Biology Department, Carnegie Mellon University, United States
Presentation Overview: Show
Untargeted tandem mass spectrometry experiments enable the profiling of metabolites in complex biological samples. The collected fragmentation spectra are the fingerprints of metabolites that are used for molecule identification and discovery. Two main mass spectrometry strategies exist for the collection of fragmentation spectra: data-dependent acquisition (DDA) and data-independent acquisition (DIA). In DIA strategy, all the ions in predefined mass-to-charge ratio ranges are co-isolated and co-fragmented, resulting in highly multiplexed fragmentation spectra. While DIA comprehensively collect the fragmentation ions of all the precursors, it results in a highly multiplexed fragmentation that limits subsequent annotation. In contrast, in DDA strategy fragmentation spectra are collected specifically for the most abundant ions dynamically observed. While DDA results in less multiplexed fragmentation spectra, the coverage is limited. We introduce MS2Planner workflow, an Iterative-Data Acquisition (ItDA) strategy that optimizes the number of high quality fragmentation spectra over multiple experimental acquisitions using topological sorting. Our results show that MS2Planner is 62.5% more sensitive and 9.4% more specific compared to the existing acquisition techniques.
- Kyowon Jeong, University of Tübingen, Germany
- Maša Babović, University of Southern Denmark, Denmark
- Vladimir Gorshkov, University of Southern Denmark, Denmark
- Jihyung Kim, University of Tübingen, Germany
- Ole Jensen, University of Southern Denmark, Denmark
- Oliver Kohlbacher, University of Tübingen, Germany
Presentation Overview: Show
Top-down proteomics (TDP) is gaining great interest in biological, clinical, and medical sciences, as the method of the choice to study proteoforms. While significant improvements have been made on different aspects of TDP protocols, data-dependent acquisition (DDA) has been optimized for bottom-up proteomics, not for TDP. Dedicated acquisition methods thus have the potential to greatly improve TDP.
We present FLASHIda, an intelligent data acquisition method for TDP that ensures the selection of high-quality precursors of diverse proteforms. FLASHIda interfaces with Thermo Scientific iAPI that provides MS1 full scans real-time. By transforming the raw m/z-intensity spectrum to mass-quality spectrum instantly with FLASHDeconv and using a machine learning technique assessing the signal quality, FLASHIda implements Top-N high-quality precursor mass acquisition with a quality-based mass exclusion.
In the benchmark tests with E. coli lysate 90-min gradient single runs (nano-RPLC, Thermo Scientific Orbitrap Eclipse), FLASHIda almost doubled the unique proteoform count (~1,600) as compared with the standard acquisition (~820). Alternatively, similar numbers (~800) as with standard DDA were reported in FLASHIda runs on drastically shorter gradient runs (30-min).
Since FLASHIda does not require any modification in experimental set-ups, it could be readily adopted for TDP study of complex samples to raise proteoform identification sensitivity.
- William Stafford Noble, University of Washington, United States
- Ayse B. Dincer, University of Washington, United States
- Sreeram Kannan, University of Washington, United States
Presentation Overview: Show
Tandem mass spectrometry (MS/MS) can be used to quantify thousands of peptides in a complex biological mixture. However, these quantitative measurements depend in part on the properties of the peptide sequence. We propose a machine learning approach to quantify and eliminate these peptide-specific artifacts. We model the observed peptide intensity as a composition of a peptide coefficient and an adjusted abundance. We then base our model on the key assumption that sibling peptides (i.e., peptides that co-occur in the same protein) should have equal abundances. Accordingly, we implement a neural network with a Siamese architecture to learn peptide coefficients from amino acid sequences, which is trained to minimize the distance between adjusted abundances of sibling peptides. We demonstrate that peptide coefficients are consistent across different MS/MS runs and that the coefficients inferred from one set of runs can generalize to other runs. We aim to extend our prediction model to new proteins and datasets to eliminate these peptide-specific effects, thereby yielding more accurate quantification.
- William Noble, University of Washington, United States
- Andy Lin, Pacific Northwest National Laboratory, United States
- Jeffrey Bilmes, University of Washington, United States
Presentation Overview: Show
Methods that measure the similarity of proteomics runs are useful because the resulting scores can be used for analyses such as classification, clustering, and embedding. However, measuring the similarity of a pair of proteomics runs is difficult because a number of factors can affect the data. Previously developed methods for measuring proteomics run similarity compare the set of
spectra generated by peptide fragments (MS2) from each run against each other. However, these methods fail to consider the different ways MS2 spectra are acquired, such as in data-dependent acquisition (DDA) and data-independent acquisition (DIA). We present a method, MS1Connect, that uses intact peptide spectra (MS1) to measure the similarity between proteomics runs. Since MS1 data is always collected the same way, MS1Connect can compare data collected by DDA, DIA, and targeted proteomics. MS1Connect frames scoring the similarity between a pair of runs as a maximum bipartite matching problem. In our setting, each of the two disjoint sets of vertices represent the set of MS1 features found in a run, and edges link MS1 features in different runs whose masses match within some tolerance. Our results show that MS1Connect scores can be used to compare proteomics runs across independent studies and laboratories.
- Yang Young Lu, University of Washington, United States
- Jeff Bilmes, University of Washington, United States
- Ricard Mias, University of Washington, United States
- Judit Villen, University of Washington, United States
- William Stafford Noble, University of Washington, United States
Presentation Overview: Show
Tandem mass spectrometry data acquired using data independent acquisition (DIA) is challenging to interpret because the data exhibits complex structure along both the mass-to-charge (m/z) and time axes. The most common approach to analyzing this type of data makes use of a library of previously observed DIA data patterns (a "spectral library"), but this approach is expensive because the libraries do not typically generalize well across laboratories. Here we propose DIAmeter, a search engine that detects peptides in DIA data using only a peptide sequence database. Unlike other library-free DIA analysis methods, DIAmeter supports data generated using both wide and narrow isolation windows, can readily detect peptides containing post-translational modifications, can analyze data from a variety of instrument platforms, and is capable of detecting peptides even in the absence of detectable signal in the survey (MS1) scan.
- Vadim Demichev
Presentation Overview: Show
In the past two years, mass spectrometry-based proteomics has taken a major leap in terms of speed. Novel fast workflows can measure hundreds of proteomes per day and thus facilitate robust and cost-effective large-scale experiments, from perturbation screens in cell culture to biomarker discovery studies. In this session, we will discuss how advanced data processing strategies combined with cutting edge mass spectrometer technologies enable proteomics at high throughput.
- Melanie Föll, Institute for Surgical Pathology, University Medical Center Freiburg, Germany
- Niko Pinter, Institute for Surgical Pathology, University Medical Center Freiburg, Germany
- Damian Glätzer, Faculty of Biology, University of Freiburg, Germany
- Matthias Fahrner, Institute for Surgical Pathology, University Medical Center Freiburg, Germany
- Klemens Fröhlich, Institute for Surgical Pathology, University Medical Center Freiburg, Germany
- James Johnson, Minnesota Supercomputing Institute, University of Minnesota, United States
- Björn Grüning, Department of Computer Science, University of Freiburg, Germany
- Bettina Warscheid, Faculty of Biology, University of Freiburg, Germany
- Friedel Drepper, Faculty of Biology, University of Freiburg, Germany
- Oliver Schilling, Institute for Surgical Pathology, University Medical Center Freiburg, Germany
Presentation Overview: Show
With the steadily expanding availability of instrumentation, quantitative proteomic experiments are increasingly widespread and complex, leading to increased demands for software tools and computational resources in the field of proteomic data science. This situation is particularly challenging for non-expert users, who however might require high-performance computing systems to tackle challenging aspects of proteomic data analysis. The open-source Galaxy framework provides a freely accessible, cloud-based environment with thousands of tools and adequate computational resources. Here, we have integrated two de facto standard tools into Galaxy, MaxQuant and MSstats (label-free and TMT), which cover a typical quantitative shotgun proteomics analysis workflow including protein identification, protein quantification and statistical modeling to find differentially abundant proteins. Via Galaxy, both software is executable on public clouds via a web-browser and Galaxy’s graphical user interface which requires no installation and no computational or programming skills. Furthermore, MaxQuant and MSstats can be applied in conjunction with other Galaxy tools and integrated into standardized, shareable workflows. To maximize accessibility, we have developed step by step training material for quantitative proteomic analyses.
- Lukasz Kozlowski, University of Warsaw, Poland
Presentation Overview: Show
The isoelectric point is the pH at which a particular molecule is electrically neutral due to the equilibrium of positive and negative charges. In proteins and peptides, this depends on the dissociation constant (pKa) of charged groups of seven amino acids and NH+ and COO- groups at polypeptide termini. Information regarding isoelectric point and pKa is extensively used in two-dimensional gel electrophoresis (2D-PAGE), capillary isoelectric focusing (cIEF), crystallisation, and mass spectrometry. Therefore, there is a strong need for the in silico prediction of isoelectric point and pKa values. In this work, I present Isoelectric Point Calculator 2.0 (IPC 2.0), a web server for the prediction of isoelectric points and pKa values using a mixture of deep learning and support vector regression models. The prediction accuracy (RMSD) of IPC 2.0 for proteins and peptides outperforms previous algorithms: 0.848 versus 0.868 and 0.222 versus 0.405, respectively. Moreover, the IPC 2.0 prediction of pKa using sequence information alone was better than the prediction from structure-based methods (0.576 vs. 0.826) and a few folds faster. The IPC 2.0 webserver is freely available at www.ipc2-isoelectric-point.org
- Eugenia Voytik, Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany, Germany
- Sander Willems, Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany, Germany
- Matthias Mann, Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany, Germany
- Patricia Skowronek, Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany, Germany
- Maximilian Strauss, Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany, Germany
Presentation Overview: Show
The efficient visualization of omics data is an important and challenging problem, as datasets continue to grow. In mass spectrometry (MS)-based proteomics, this is illustrated by the recent addition of an ion mobility dimension, which increases already large raw data many-fold. The GHz detector of the TimsTOF Pro mass spectrometer acquires millions or even billions of ion intensity values as a function of the chromatographic retention time, ion mobility, quadrupole mass-to-charge and TOF mass-to-charge. Unfortunately, accessing these vast unprocessed five-dimensional data is slow, hindering downstream analysis and visualization. Here we present a software solution that allows accession and visualization of billions of data points in sub-second time scales. Termed AlphaTims, it is freely available as an open-source Python package or stand-alone graphical user interface at https://github.com/MannLabs/alphatims. AlphaTims creates multiple indices that allow to interpret LC-TIMS-Q-TOF data as a sparse four-dimensional matrix. Even for dataset containing billions of datapoints, creating these indices takes less than a minute. Subsequent accession of single data points or slices in arbitrary dimensions of the resulting sparse matrix generally only takes a few milliseconds. Ready data visualization by AlphaTims generates insights into MS-based proteomics data, including at the ultra-sensitive, single cell level.
- Jinghan Yang, CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- Zhiqiang Gao, Artificial Intelligence Research Center, Peng Cheng Laboratory, Shenzhen, China
- Cheng Chang, Beijing Institute of Lifeomics, Beijing, China
- Yan Fu, CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
Presentation Overview: Show
In mass spectrometry-based shotgun proteomics, proteins are first digested by protease(s) into peptides, which are taken as the surrogates for subsequent identification and quantification. However, only a limited proportion of peptides are observed in a routine experiment. Thus, the prediction of proteotypic peptides, i.e., the detectable unique representatives of proteins, has become a crucial task. In our recent work (doi: 10.1021/acs.analchem.9b02520), peptide digestibility has served as a significant feature for accurate peptide detectability prediction. Unfortunately, there are still very few easy-to-use tools for digestibility and detectability prediction for various proteases.
Here, we present a web-based tool named ProteoPeptides for proteotypic peptide prediction for eight commonly-used proteases. In ProteoPeptides, either random forest (RF) or deep learning (DL) is used to predict digestibility or detectability (referred to as RF_Dig, RF_Det, DL_Dig and DL_Det, respectively). Protein sequences are the only input. The results are visualized in an intuitive and interactive way.
Nineteen public large-scale data sets were utilized to evaluate the performance of ProteoPeptides. The test AUCs of RF_Dig, RF_Det, DL_Dig, and DL_Det were 0.785~0.966, 0.715~0.929, 0.849~0.978, and 0.828~0.960, respectively. For each protease, DL_Det had a higher AUC compared to RF_Det, indicating an enhanced learning performance. ProteoPeptides can be accessed at http://fugroup.amss.ac.cn/ProteoPeptides/.
- Charlotte Adams, University of Antwerp, Belgium
- Kris Laukens, University of Antwerp, Belgium
- Wout Bittremieux, University of California San Diego and University of Antwerp, United States
Presentation Overview: Show
Unraveling virus–host protein–protein interactions (VH-PPIs) is crucial to understand how viruses hijack their hosts during infection. As a response to the current COVID-19 pandemic there have been multiple mass spectrometry (MS)-based studies to determine interactions between SARS-CoV-2 and its human host. However, because of search tool limitations, post-translational modifications (PTMs) have not been sufficiently considered during the data analysis of these studies.
To study the role of PTMs, we reprocessed MS-based SARS-CoV-2–human PPI data using open modification searching (OMS). This makes it possible to consider any type of modification, including biologically relevant PTMs that have never been studied in the VH-PPI context before. We identified 66% additional modified peptide- spectrum matches (PSMs) that were originally missed, demonstrating the power of OMS. These additional identifications enabled us to more accurately filter PPIs. We reproduced 49% of the 332 original PPIs and identified 211 novel PPIs, of which 47 have been reported in independent SARS-CoV-2 PPI studies. We investigated selected modifications in more detail, and were able to detect novel phosphorylation, S-nitrosylation, and ubiquitination sites on SARS-CoV-2 proteins.
Our results demonstrate that by taking modified peptides into account we obtain novel insights into the molecular mechanics of SARS-CoV-2 infection.
- Maria Rodriguez Martinez, IBM Research Europe, Switzerland
- Joris Cadow, IBM Research Europe, Switzerland
- Matteo Manica, IBM Research Europe, Switzerland
- Roland Mathis, IBM Research Europe, Switzerland
- Tiannan Guo, Westlake University, China
- Ruedi Aebersold, ETH Zuerich, Switzerland
Presentation Overview: Show
In recent years, SWATH-MS has become the proteomic method of choice for data-independent–acquisition, as it enables high proteome coverage, accuracy and reproducibility. However, data analysis is convoluted and requires prior information and expert curation. Furthermore, as quantification is limited to a small set of peptides, potentially important biological information may be discarded.
Here we demonstrate that deep learning can be used to learn discriminative features directly from raw MS data, eliminating hence the need of elaborate data processing pipelines. Using transfer learning to overcome sample sparsity, we exploit a collection of publicly available deep learning models already trained for the task of natural image classification. These models are used to produce feature vectors from each mass spectrometry (MS) raw image, which are later used as input for a classifier trained to distinguish tumor from normal prostate biopsies. Although the deep learning models were originally trained for a completely different classification task and no additional fine-tuning is performed on them, we achieve a highly remarkable classification performance of 0.876 AUC.
We investigate different types of image preprocessing and encoding. We also investigate whether the inclusion of the secondary MS2 spectra improves the classification performance. Throughout all tested models, we use standard protein expression vectors as gold standards. Even with our naïve implementation, our results suggest that the application of deep learning and transfer learning techniques might pave the way to the broader usage of raw mass spectrometry data in real-time diagnosis.
- William Noble, University of Washington, United States
- William Fondrie, University of Washington, United States
- Elena Romero, University of Washington, United States
Presentation Overview: Show
Tandem mass spectra capture detailed structural information about the
molecules that generated them, such as peptides and metabolites. Although the
proteomics and metabolomics experiments that generate these mass spectra are
prime candidates to benefit from revolutionary machine learning methods, a
fundamental challenge remains in working directly with mass spectra: it is
difficult to accurately represent a mass spectrum as a concise vector.
Strategies that discretize the m/z axis have long been employed to
solve this problem; however, the binning induces edge effects and the bin
size must be chosen to balance the accuracy of the representation with its
size and sparsity. Given these challenges, we propose an unsupervised deep
learning method to represent mass spectra as compact vectors, while
preserving a measure of spectral similarity. Specifically, we leverage the
recent Transformer neural network architecture to embed mass spectra, without
the need for prior discretization. Our Transformer is trained within a
Siamese architecture to generate vector representations of mass spectra that can estimate
a measure of spectral similarity. After training with
more than 85 million mass spectra, we found that the Transformer was able to
generate reliable representations with as few as 32 dimensions.
- Edward Huttlin, Harvard Medical School, United States
Presentation Overview: Show
The proteome can be viewed as a community of thousands of proteins that assemble into complexes and signaling networks to support cellular function. Because these interactions are dynamic and organize the proteome according to shared function, the structure of this ‘social network’ within a cell can reveal functions of individual proteins and provide a systems-level view of cellular state. For the past several years, we have been using affinity-purification mass spectrometry to systematically map physical protein interactions in human cells. By immuno-purifying more than 10,000 distinct human proteins – more than half the proteome - in HEK 293T cells, we have created BioPlex, the most comprehensive experimentally-derived model of the human interactome to date. In addition, by repeating these AP-MS experiments in additional cell lines, we are now producing multiple proteome-scale, cell-specific interaction networks to explore how entire interactomes remodel in different cellular contexts. Individually and in combination with other complementary biological data, these networks are powerful tools for discovery. In this talk I will provide an overview of the BioPlex project, emphasizing how computational approaches have provided functional insights for thousands of uncharacterized proteins and deepened our understanding of the systems-level organization of the proteome.