Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

CompMS COSI

Presentations

Schedule subject to change
Wednesday, July 15th
10:40 AM-11:00 AM
Community standard for reporting the experimental design in proteomics experiments: From samples to data files
Format: Pre-recorded with live Q&A

  • Johannes Griss, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Chengxin Dai, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China., China
  • Jingwen Bai, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Anja Füllgrabe, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Xuefei Zhao, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China., China
  • Nancy George, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Pablo Moreno, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, United Kingdom
  • David García-Seisdedos, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Chakradhar Bandla, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Julianus Pfeuffer, University of Tuebingen, Sand 14, 72076 Tuebingen, Germany, Germany
  • Mathias Walzer, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Melanie Christine Föll, Institute of Surgical Pathology, Medical Center – University of Freiburg, Germany
  • Alvis Brazma, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Irene Papatheodorou, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Juan Antonio Vizcaíno, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom
  • Timo Sachsenberg, University of Tuebingen, Sand 14, 72076 Tuebingen, Germany, Germany
  • Mingze Bai, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China., China
  • Yasset Perez-Riverol, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, United Kingdom

Presentation Overview: Show

Sharing of proteomics data within the research community has been greatly facilitated by the
development of open data standards such as mzIdentML, mzML, and mzTab. In addition, the ProteomeXchange consortium of proteomics resources has enabled the data submission and dissemination of public MS proteomics data worldwide. However, several authors have pointed out the lack of suitable experiment-related metadata in public proteomics datasets making data reuse by the community difficult. In particular, datasets contain limited sample information about the disease, tissue, cell type, and treatment, among others; and the relationship between the samples and the mass spectrometry files (RAW) is missing in most cases. Here, we present a community standard data model and file format to represent the experimental design for proteomics experiments. The proposed data model is based on the Sample and Data Relationship Format (SDRF), which originates from the metadata file format developed by the transcriptomics community called MAGE-TAB. The aim is to define a set of guidelines to support the annotation of the experimental design in proteomics datasets in the public domain, enabling future integration among multi-omics experiments.

11:00 AM-11:20 AM
MassIVE.quant: a community resource of curated quantitative mass spectrometry-based proteomics datasets
Format: Pre-recorded with live Q&A

  • Meena Choi, Northeastern University, United States
  • Tom Dunkley, Hoffmann-La Roche Ltd, Switzerland
  • Olga Vitek, Northeastern University, United States
  • Jeremy Carver, University of California San Diego, United States
  • Cristina Chiva, Center for Genomics Regulation, Spain
  • Manuel Tzouros, Hoffmann-La Roche Ltd, Switzerland
  • Ting Huang, Northeastern University, United States
  • Tsung-Heng Tsai, Northeastern University, United States
  • Benjamin Pullman, University of California San Diego, United States
  • Oliver M. Bernhardt, Biognosys, Switzerland
  • Ruth Hüttenhain, University of California, San Francisco, United States
  • Guo Ci Teo, University of Michigan, United States
  • Yasset Perez-Riverol, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Jan Muntel, Biognosys, Switzerland
  • Maik Müller, ETH, Switzerland
  • Sandra Goetze, ETH, Switzerland
  • Maria Pavlou, ETH, Switzerland
  • Erik Verschueren, University of California, San Francisco, United States
  • Bernd Wollscheid, ETH, Switzerland
  • Alexey Nesvizhskii, University of Michigan, United States
  • Lukas Reiter, Biognosys, Switzerland
  • Eduard Sabidó, Center for Genomics Regulation, Spain
  • Nuno Bandeira, University of California San Diego, United States

Presentation Overview: Show

We present MassIVE.quant (http://massive.ucsd.edu/ProteoSAFe/static/massive-quant.jsp), a tool-independent repository infrastructure and data resource for reproducible quantitative mass spectrometry-based biomedical research. MassIVE.quant is an extension of MassIVE to provide the opportunity of large-scale deposition of heterogeneous experimental datasets and facilitate a community-wide conversation about the necessary extent of experiment documentation and the benefits of its use. It supports various reproducibility scopes, such as the infrastructure to fully automated the workflow, to store, and to browse the intermediate results. First, MassIVE.quant supplements the raw experimental data with detailed annotations of the experimental design, analysis scripts, and results, which enable the quantitative interpretation of MS-based experiments and the online inter-active exploration of the results. A branch structure enables to view and even compare reanalyses of each experiment with various combinations of methods and tools. Second, the curated alternative workflows can be used off-line and online reanalyses of the data starting from an intermediate out-put in MassIVE.quant. To exemplify from quantitative experiments, we present the first compilation of more than 40 proteomic datasets from benchmark controlled mixtures and biological investigations, interpreted with various data processing tools and analysis options. The extensive documentation for workflow and data submission, including video tutorial, is available.

11:20 AM-11:30 AM
CoronaMassKB: an open-access platform for sharing of mass spectrometry data and reanalyses from SARS-CoV-2 and related species
Format: Pre-recorded with live Q&A

  • Jeremy Carver, University of California San Diego, United States
  • Benjamin Pullman, University of California San Diego, United States
  • Nuno Bandeira, University of California San Diego, United States
  • Julie Wertz, University of California San Diego, United States

Presentation Overview: Show

CoronaMassKB is an open-data community resource for sharing of mass spectrometry data and (re)analysis results for all experiments pertinent to the global SARS-CoV-2 pandemic. The resource is designed for the rapid exchange of data and results among the global community of scientists working towards understanding the biology of SARS-CoV-2/COVID19 and thus aims to accelerate the emergence of effective responses to this global pandemic. CoronaMassKB is currently based on reanalysis of >7 million spectra from SARS-CoV-2 dataset and >20 million spectra from related viruses (including SARS-Cov, MERS and H1N1) in public mass spectrometry data available at MassIVE. Searching for modified peptides nearly doubled the number of identifications derived from the same data and reveals >550 SARS-CoV-2 hypermodified peptides with 10+ distinct combinations of modifications, up to over 100 unique modification variants for a single peptide sequence. Overall, thousands of distinctly-modified and unmodified peptide variants were identified to 25 SARS-CoV-2 protein products, with thousands of additional variants also identified for the various related viral species. Over 7,000 host proteins were also detected by tens of thousands of peptide variants across several types of experiments, including 339 variants mapped to regulatory regions of 92 drug-target proteins assessed to be interactors of SARS-CoV-2 proteins.

11:30 AM-11:40 AM
Phosphopedia 2.0, a modern targeted phosphoproteomics resource
Format: Pre-recorded with live Q&A

  • William S. Noble, University of Washington, United States
  • Anthony Valente, University of Washington Genome Sciences, United States
  • Judit Villen, University of Washington Genome Sciences, United States

Presentation Overview: Show

Global mass spectrometry methods are the workhorse of phosphopeptide identification and quantification. Yet, targeted mass spectrometry approaches achieve much higher quantification sensitivity and precision. To facilitate the development of targeted assays we previously built Phosphopedia, a resource that compiled phosphopeptide identifications from nearly 1000 DDA mass spectrometry runs. However this implementation was difficult to expand, missing out on the wealth of continuously generated public phosphoproteomic data. Here, we present Phosphopedia 2.0, which extends our database and allows the generation of targeted assays for any phosphosite, regardless of previous observation.
This update was achieved through two important developments. First, in order to enable dynamic database expansion, we have automated data injection via the Snakemake workflow management system, enhancing the database with thousands of new DDA mass spectrometry runs. Second, we employ machine learning to model fundamental properties of phosphopeptides necessary for targeted assay development enabling us to access phosphosites that have not been observed previously.
Through straightforward access to public data and phosphopeptide property prediction, we provide the ability to greatly enhance the development of targeted phosphoproteomic assays. In addition, our database and prediction tools can be used to build spectral libraries for the analysis of DIA phosphoproteomic experiments.

12:00 PM-12:20 PM
Proceedings Presentation: Deep multiple instance learning classifies subtissue locations in mass spectrometry images from tissue-level annotations
Format: Pre-recorded with live Q&A

  • Olga Vitek, Northeastern University, United States
  • Dan Guo, Northeastern University, United States
  • Melanie Christine Föll, University of Freiburg, Germany
  • Veronika Volkmann, University of Freiburg, Germany
  • Kathrin Enderle-Ammour, University of Freiburg, Germany
  • Peter Bronsert, University of Freiburg, Germany
  • Oliver Schilling, University of Freiburg, Germany

Presentation Overview: Show

Motivation: Mass spectrometry imaging (MSI) characterizes molecular composition of tissues at spatial resolution, and has a strong potential for distinguishing tissue types, or disease states. This can be achieved by supervised classification, which takes as input MSI spectra, and assigns class labels to subtissue locations. Unfortunately, developing such classifiers is hindered by the limited availability of training sets with subtissue labels as the ground truth. Subtissue labeling is prohibitively expensive, and only rough annotations of the entire tissues are typically available. Classifiers trained on data with approximate labels have sub-optimal performance.
Results: To alleviate this challenge, we contribute a semi-supervised approach mi-CNN. mi-CNN implements multiple instance learning with a convolutional neural network (CNN). The multiple instance aspect enables weak supervision from tissue-level annotations when classifying subtissue locations. The convolutional architecture of the CNN captures contextual dependencies between the spectral features. Evaluations on simulated and experimental datasets demonstrated that mi-CNN improved the subtissue classification as compared to traditional classifiers. We propose mi-CNN as an important step towards accurate subtissue classification in MSI, enabling rapid distinction between tissue types and disease states.

12:20 PM-12:30 PM
Mass spectrometry imaging in the age of reproducible medical science
Format: Pre-recorded with live Q&A

  • Melanie Christine Föll, Institute of Surgical Pathology, Medical Center – University of Freiburg, Germany
  • Oliver Schilling, Institute of Surgical Pathology, Medical Center – University of Freiburg, Germany
  • Lennart Moritz, Institute of Surgical Pathology, Medical Center – University of Freiburg, Germany
  • Thomas Wollmann, Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Germany
  • Maren Nicole Stillger, Institute of Surgical Pathology, Medical Center – University of Freiburg, Germany
  • Niklas Vockert, Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Germany
  • Martin Werner, Institute of Surgical Pathology, Medical Center – University of Freiburg, Germany
  • Peter Bronsert, Institute of Surgical Pathology, Medical Center – University of Freiburg, Germany
  • Karl Rohr, Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Germany
  • Björn Andreas Grüning, University of Freiburg, Germany

Presentation Overview: Show

Mass spectrometry imaging (MSI) has great potential for a variety of clinical research areas including pharmacology, diagnostics and personalized medicine. MSI data analysis remains challenging due to the large and complex data generated by the measurement of hundreds of analytes in thousands of tissue locations. Reproducibility of published research is limited due to extensive use of proprietary software and in-house scripts. Existing open-source software that paves the way for reproducible data analysis necessitates steep learning curves for scientists without programming knowledge. Therefore, we have integrated 18 MSI tools into the Galaxy framework (https://usegalaxy.eu) to allow easy accessible data analysis with high levels of reproducibility and transparency. The tools are based on Cardinal, MALDiquant and scikit-image enabling all major MSI analysis steps from quality control to image co-registration, preprocessing and statistical analysis. We successfully applied the MSI tools in combination with other proteomics and metabolomics Galaxy tools to analyze a publicly available N-linked glycan imaging dataset, as well as in-house peptide imaging cancer datasets. Furthermore, we created hands-on training material for use cases in proteomics and metabolomics and provide a Docker container for a fully functional analysis platform in a closed network situation, such as in clinical settings.

12:30 PM-12:40 PM
Combining Information from Crosslinks and Monolinks in the Modelling of Protein Structures
Format: Pre-recorded with live Q&A

  • Matthew Sinnott, Institute of Structural and Molecular Biology - Birkbeck College and UCL, United Kingdom
  • Sony Malhotra, Institute of Structural and Molecular Biology - Birkbeck College, United Kingdom
  • Mallur S Madhusudhan, IISER, Pune, India
  • Konstantinos Thalassinos, Institute of Structural and Molecular Biology - Birkbeck College and UCL, United Kingdom
  • Maya Topf, Institute of Structural and Molecular Biology - Birkbeck College, United Kingdom

Presentation Overview: Show

We have developed a computational approach for getting the most out of an XLMS dataset in the context of modelling protein structures by using both crosslink and monolink data. Monolinks are a by-product of a Chemical Crosslinking Mass Spectrometry (XLMS) experiment. They convey residue exposure information and are more abundant in an XLMS dataset than crosslinks, which makes them a useful source of structural information. However, they are rarely used in structural modelling. We have devised the Monolink Depth Score (MoDS), a scoring function for ranking protein structure models from monolink information. Using simulated and reprocessed experimental data from the Proteomic Identification Database, we compare the performance of MoDS to the Matched and Non-Accessible Crosslink (MNXL) score, which we have previously devised to score data from crosslinks. Our results show that MNXL only marginally outperforms MoDS, and that MoDS is an effective tool for scoring model structures. Furthermore combining MoDS and MNXL into the Crosslink Monolink (XLMO) score improves performance above that of both MoDS and MNXL. To make our software easily accessible to the community we created Crosslink Modelling Tools (XLM-Tools), a python program for scoring protein structure models using crosslinks and monolinks.

2:00 PM-2:40 PM
CompMS Keynote: Proximity interactome for SARS-CoV-2. Why knowing your neighbor is key in pandemic times.
Format: Live-stream

  • Anne-Claude Gingras

Presentation Overview: Show

Compartmentalization is essential for all complex forms of life. In eukaryotic cells, membrane-bound organelles, as well as a multitude of protein- and nucleic acid-rich subcellular structures, maintain boundaries and serve as enrichment zones to promote and regulate protein function, including signaling events. Consistent with the critical importance of these boundaries, alterations in the machinery that mediate protein transport between these compartments has been implicated in a number of diverse diseases, and is harnessed by pathogens including viruses.

Prompted by the implementation in vivo biotinylation approaches such as BioID, we report here the systematic mapping of the composition of various subcellular structures, using as baits proteins (or protein fragments) which are well-characterized markers for a specified location. We defined how relationships between “prey” proteins detected through this approach can help understanding the protein organization inside a cell, which is further facilitated by newly developed computational tools. We will discuss our map of a human cell containing major organelles and non-membrane bound structures, and illustrate how this map can be leveraged to devise “compartment sensors” to explore biology. We next address the use of this type of strategies to reveal new insights on the biology of pathogens, focusing on the first proximity map for each of the SARS-CoV-2 proteins.

2:40 PM-2:50 PM
MSnbase, efficient and elegant R-based processing and visualisation of raw mass spectrometry data
Format: Pre-recorded with live Q&A

  • Laurent Gatto, Computational Biology Unit, de Duve Institute, UCLouvain, Avenue Hippocrate, 75, 1200 Brussels, Belgium, Belgium
  • Sebastian Gibb, Department of Anaesthesiology and Intensive Care of the University Medicine Greifswald, 17475 Greifswald, Germany, Germany
  • Johannes Rainer, Institute for Biomedicine, Eurac Research, Affiliated Institute of the University of Lübeck, Bolzano, Italy, Italy

Presentation Overview: Show

We present version 2 of the MSnbase R/Bioconductor package. MSnbase provides infrastructure for the manipulation,processing and visualisation of mass spectrometry data. We focus on the new on-disk infrastructure, that allows the handling of large raw mass spectrometry experiment on commodity hardware and illustrate how the package is used for elegant data processing,method development and visualisation.

2:50 PM-3:00 PM
Democratizing DIA analysis on public cloud infrastructures via Galaxy
Format: Pre-recorded with live Q&A

  • Björn Andreas Grüning, Bioinformatics Group, Department of Computer Science, University of Freiburg, Germany
  • Matthias Fahrner, Institute for Surgical Pathology, Faculty of Medicine, University of Freiburg, Germany
  • Melanie Christine Föll, Institute for Surgical Pathology, Faculty of Medicine, University of Freiburg, Germany
  • Oliver Schilling, Institute of Surgical Pathology, Medical Center – University of Freiburg, Germany

Presentation Overview: Show

Data independent acquisition (DIA) has become one of the most important approaches in global proteomic studies. DIA data provides detailed and in-depth insights into the molecular variety of biological systems. However, due to the high complexity and large data size the data analysis remains challenging. Available open-source software requires different operational systems, programming skills, and large compute infrastructures. Thus, current open-source DIA data analysis is mainly applicable by bioinformatics competent researchers with access to large computational resources and often lacks reproducibility and usability. Here we present a straight-forward workflow containing all essential DIA analysis steps based on OpenSwath, pyprophet, diapysef and swath2stats, which can be applied and adapted by a large user community without the need for tool installations, special computing resources and programming skills. The all-in-one DIA workflow in Galaxy drastically increases the robustness, reproducibility and speed of the DIA data analysis due to parallel processing of multiple inputs using Galaxys HPC- and cloud infrastructure. Each tool is available as Conda package and Biocontainer. However, a few steps in this workflow require up to 1 TB of memory, hence we recommend to use the workflow on the European Galaxy server (https://usegalaxy.eu) which can utilize worldwide HPC- and Cloud resources.

3:20 PM-3:40 PM
Isolation forests improve the capability to detect quality problems in mass spectrometry-based proteomics
Format: Pre-recorded with live Q&A

  • Olga Vitek, Northeastern University, United States
  • Akshay Kulkarni, Khoury College of Computer Science, Northeastern University, United States
  • Eralp Dogu, College of Science, Mugla Sitki Kocman University, Turkey
  • Roger Olivella, Proteomics Unit, Centre de Regulaci ́o Gen ́omica, Spain
  • Eduard Sabido, Proteomics Unit, Centre de Regulaci ́o Gen ́omica, Spain

Presentation Overview: Show

Quality control (QC) of mass spectrometry based proteomic experiments involve quantifying a standard mixture which includes a set of analytes, generating multiple metrics. The metrics are then used to evaluate effects of technical variability to the quantification of the actual biological samples. Next, a reliable baseline data set is used to train statistical models or traditional statistical quality control methods to classify outlying runs. Although current technological improvements help conduct initial steps of most QC workflows, many practitioners still lack baseline data and misclassifies QC runs in real time implementation.

Here, we present an unsupervised machine learning extension of MSstatsQC to detect deviations from optimal performance of multiple metrics. MSstatsQC implements unsupervised isolation-based trees to address outlier detection problem. Our results show how tree-based methods are helpful in terms of differentiating optimal and sub-optimal experiments where limited information is available about the optimal/suboptimal performance of the instrument. We also provide supporting information based on the root causes of anomalous behavior per peptide which can be used to design preventive actions. Our method is available with MSstatsQC R/Bioconductor package and with web-based graphical user interface MSstatsQCgui.

3:40 PM-4:00 PM
Focus on the spectra that matter by clustering of quantification data in shotgun proteomics
Format: Pre-recorded with live Q&A

  • Lukas Käll, KTH Stockholm, Sweden
  • Matthew The, KTH Stockholm, Sweden

Presentation Overview: Show

We propose a quantification-first approach for peptides in shotgun proteomics experiments that reverses the classical identification-first workflow. This prevents valuable information from being discarded prematurely in the identification stage and allows us to spend more effort on the identification process. We demonstrate that combining this with Bayesian protein quantification dramatically increases sensitivity on multiple engineered and clinical datasets.

Our method, Quandenser, applies unsupervised clustering on both MS1 and MS2-level, summarizing all analytes of interest without assigning identities. This eliminates the need for redoing quantification for new search parameters/engines and reduces search time due to the data reduction. For one of the investigated datasets, using an open modification search with MODa, we assigned identities to 47% more consensus spectra while reducing search time from a week to well below a day. Furthermore, de novo searches on large MS2 spectrum clusters unveiled peptides and proteins not present in the database. Importantly, Quandenser addresses the false transfer problem by providing feature-feature match error rates using decoy features and a novel automated weighting scheme. We integrated these into our probabilistic protein quantification method, Triqler, that propagates error probabilities from feature to protein level and reduces the noise from false positives and missing values.

4:00 PM-4:20 PM
Selection of features with consistent profiles improves relative protein quantification in mass spectrometry experiments
Format: Pre-recorded with live Q&A

  • Tsung-Heng Tsai, Kent State University, United States
  • Meena Choi, Northeastern University, United States
  • Balazs Banfai, Roche Innovation Center Basel, Switzerland
  • Yansheng Liu, Yale University School of Medicine, United States
  • Brendan MacLean, University of Washington, United States
  • Tom Dunkley, Hoffmann-La Roche Ltd, Switzerland
  • Olga Vitek, Northeastern University, United States

Presentation Overview: Show

In bottom-up mass spectrometry-based proteomics, relative protein quantification is often achieved with data-dependent acquisition (DDA), data-independent acquisition (DIA), or selected reaction monitoring (SRM). These workflows quantify proteins by summarizing the abundances of all the spectral features of the protein (e.g., precursor ions, transitions or fragments) in a single value per protein per run. When abundances of some features are inconsistent with the overall protein profile (for technological reasons such as interferences, or for biological reasons such as post-translational modifications), the protein-level summaries and the downstream conclusions are undermined. We propose a statistical approach that automatically detects spectral features with such inconsistent patterns. The detected features can be separately investigated, and if necessary removed from the dataset. We evaluated the proposed approach on a series of benchmark controlled mixtures and biological investigations with DDA, DIA and SRM data acquisitions. The results demonstrated that it can facilitate and complement manual curation of the data. Moreover, it can improve the estimation accuracy, sensitivity and specificity of detecting differentially abundant proteins, and reproducibility of conclusions across different data processing tools. The approach is implemented as an option in the open-source R-based software MSstats. This work was accepted to publish in Molecular & Cellular Proteomics.

4:20 PM-4:40 PM
Proceedings Presentation: MutCombinator: Identification of mutated peptides allowing combinatorial mutations using nucleotide-based graph search
Format: Pre-recorded with live Q&A

  • Seunghyuk Choi, Department of Computer Science, Hanyang University, South Korea
  • Eunok Paek, Department of Computer Science, Hanyang University, South Korea

Presentation Overview: Show

Proteogenomics has proven its utility by integrating genomics and proteomics. Typical approaches use data from next generation sequencing to infer proteins expressed. A sample-specific protein sequence database is often adopted to identify novel peptides from matched mass spectrometry-based proteomics; nevertheless, there is no software that can practically identify all possible forms of mutated peptides suggested by various genomic information sources. We propose MutCombinator, which enables us to practically identify mutated peptides from tandem mass spectra allowing combinatorial mutations during the database search. It uses an upgraded version of a variant graph, keeping track of frame information. The variant graph is indexed by nine nucleotides for fast access. Using MutCombinator, we could identify more mutated peptides than previous methods, because combinations of point mutations are considered, and also because it can be practically applied together with a large mutation database such as COSMIC. Furthermore, MutCombinator supports in-frame search for coding regions and three-frame search for noncoding regions.

5:00 PM-5:20 PM
MealTime-MS: A machine learning-guided real-time mass spectrometry analysis for protein identification and efficient dynamic exclusion
Format: Pre-recorded with live Q&A

  • Mathieu Lavallée-Adam, University of Ottawa, Canada
  • Zhibin Ning, University of Ottawa, Canada
  • Daniel Figeys, University of Ottawa, Canada
  • Yun-En Chung, University of Ottawa, Canada
  • Alexander R. Pelletier, University of Ottawa, Canada
  • Nora Wong, University of Ottawa, Canada

Presentation Overview: Show

While mass spectrometry-based proteomics can identify thousands of proteins in a biological sample, commonly used mass spectrometry data acquisition approaches suffer from a poor identification sensitivity of low abundance proteins. In a typical protein identification experiment, mass spectra are preferentially collected from proteins with higher abundance. The identification of these proteins is then performed after the completion of the experiment. Such an approach typically results in the redundant acquisition of mass spectra from proteins with high abundance, while very few are collected for low abundance proteins, which therefore remain unidentified. Hence, we propose a novel supervised learning-based algorithm (MealTime-MS) that identifies proteins in real-time as mass spectrometry data are acquired and prevents redundant data acquisition from already confidently identified proteins. Using in-silico simulations of a mass spectrometry analysis of a HEK293 cell lysate, we demonstrate that MealTime-MS successfully identifies 92.1% of the proteins normally detected in the experiment without any data exclusion, while using only 66.2% of the mass spectra. We also show that our approach outperforms a previously proposed method, and is sufficiently fast for real-time mass spectrometry analysis. Finally, MealTime-MS’ efficient usage of mass spectrometry resources will provide the tools for a more comprehensive characterization of proteomes.

5:20 PM-5:40 PM
Reducing false peptide-spectrum matches in peptide identification using spectrum clustering
Format: Live-stream

  • Lei Wang, Indiana University, United States
  • Sujun Li, Indiana University, United States
  • Haixu Tang, Indiana University, United States

Presentation Overview: Show

Database searching is widely used in peptide identification for massive tandem mass (MS/MS) spectrometric data, in which a target-decoy approach is often employed to estimate the false discovery rate (FDR) in peptide-spectrum matches (PSM) above a score threshold. A small fraction (e.g., 1%) of PSMs reported by proteomic studies are false. Here, we propose to exploit the redundancy present in large spectra dataset to detect these false PSMs (referred to as the PSM validation), based on the dissimilar spectra in PSMs of the same peptides, and the similar spectra in PSMs of different peptides. We formalize the method in a probabilistic post-processing approach to improving peptide identification. Obviously, for a spectra pair of the same peptide, the higher their similarity, the more likely they are both correct; in contrast, for a spectra pair of the different peptide, the lower their similarity, the more likely they are both correct. We tested our method on several large-scale proteomics datasets, and achieved much lower FDR after eliminating the false PSMs with high likelihood. We have identified, on the other side, more PSMs while preserving the same FDR after the elimination of false PSMs.

5:40 PM-5:50 PM
Full-Spectrum Prediction of Peptides Tandem Mass Spectra using Deep Neural Network
Format: Pre-recorded with live Q&A

  • Lei Wang, Indiana University, United States
  • Sujun Li, Indiana University, United States
  • Haixu Tang, Indiana University, United States
  • Kaiyuan Liu, Indiana University, United States
  • Yuzhen Ye, Indiana University, United States

Presentation Overview: Show

The ability to predict tandem mass (MS/MS) spectra from peptide sequences can significantly enhance our understanding of the peptide fragmentation process, however, current approaches for predicting high-energy collisional dissociation (HCD) spectra are limited to predict the intensities of expected ions, i.e., the a/b/c/x/y/z ions and their neutral loss derivatives (the backbone ions). In practice, backbone ions only account for < 70% of total ion intensities, indicating many intense ions are ignored. Thus we developed a deep learning approach that predicts the complete spectra directly from peptide sequences. We made no assumptions or expectations on which kind of ions to predict but predicts the intensities on all possible M/z. Experiments show that the predicted 2+ and 3+ HCD spectra are highly similar to the experimental spectra, with average cosine similarities of 0.820 (+/- 0.088) and 0.786 (+/- 0.085), very close to the similarities between the experimental replicates and much higher than the best-performed backbone-only models of ~0.75 and ~0.70. Furthermore, we developed multi-task learning approaches for predicting electron transfer dissociation (ETD) spectra and HCD spectra of 1+ and 4+, which are of insufficient training samples. Finally, peptides with common modifications (like oxidations) can also be predicted by adding corresponding training samples.

5:50 PM-6:00 PM
Strategies for controlling false discovery rate when only a subset of peptides in a sample are of interest
Format: Pre-recorded with live Q&A

  • William S. Noble, University of Washington, United States
  • Andy Lin, University of Washington, United States
  • Deanna Plubell, University of Washington, United States
  • Uri Keich, University of Sydney, Australia

Presentation Overview: Show

The false discovery rate (FDR) estimation process in proteomic analysis assumes most peptides in a sample are relevant to downstream analysis. However, in settings where a subset of peptides are interesting, variations of the estimation protocol are needed. Previous literature has described four FDR control methods (search-then-filter, subset search, group FDR, and all-sub) applicable to this setting. However, these methods fail to consider neighbor peptides, which are irrelevant peptides that resemble, in terms of precursor mass and peak locations, a relevant peptide. Neighbor peptides complicate FDR estimation because they may increase the number of incorrectly accepted peptide-spectrum matches (PSMs). We develop a new method, filter then subset-neighbor search (FSNS), that explicitly accounts for neighbor peptides. We empirically compare these five methods for estimating FDR when a subset of peptides are relevant. First, we test whether these methods fail to properly control FDR. This analysis demonstrates that search-then-filter and all-sub fail to control the FDR. Next, using the remaining methods, we conduct a power analysis to compare performance. Our results indicate subset search outperforms FSNS, which outperforms group FDR. However, we find that subset search outperforms FSNS due to an increase in the number of incorrect PSMs that match neighbor peptides.

Thursday, July 16th
10:40 AM-10:50 AM
An invitation to Computational Metabolomics
Format: Live-stream

  • Shuzhao Li, The Jackson Laboratory for Genomic Medicine, United States

Presentation Overview: Show

Modern metabolomics produces rich data on biological metabolites and environmental chemicals (exposome), thus fills an important gap between the interactions of genome and environment. The explosive growth of the field opens up tremendous scientific opportunities, but also poses many challenges in informatics, which are briefly reviewed here.

10:50 AM-11:00 AM
The Journey of MetaboAnalyst
Format: Live-stream

  • Jeff Xia, McGill University, Canada

Presentation Overview: Show

The growth of every omics field depends on the growth of its community of practitioners. Bioinformatics play a key role during the process. Used by >100,000 researchers with >3 million analysis jobs per year, MetaboAnalyst, initially developed to support online statistics and chemoinformatics, has gradually become an important platform that engages and empowers the metabolomics community. The journey of MetaboAnalyst reflects the trends of the expanding metabolomics research applications as well as the evolving landscape of big data analytics.

11:00 AM-11:20 AM
Classes for the masses: Systematic classification of unknowns using fragmentation spectra
Format: Pre-recorded with live Q&A

  • Marcus Ludwig, Friedrich-Schiller-University Jena, Germany
  • Louis-Félix Nothias, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, USA, United States
  • Kai Dührkop, Friedrich-Schiller-University Jena, Germany
  • Markus Fleischauer, Friedrich-Schiller-University Jena, Germany
  • Martin Hoffmann, Friedrich-Schiller-University Jena, Germany
  • Pieter C. Dorrestein, Department of Chemistry and Biochemistry department, UCSD, United States
  • Sebastian Böcker, Friedrich-Schiller-University Jena, Germany
  • Juho Rousu, Aalto University, Finland

Presentation Overview: Show

Metabolomics experiments can employ non-targeted tandem mass spectrometry to detect hundreds to thousands of molecules in a biological sample. Structural annotation of molecules is typically carried out by searching their fragmentation spectra in spectral libraries or, recently, in structure databases. Annotations are limited to structures present in the library or database employed, prohibiting a thorough utilization of the experimental data. Here, we propose to shift the current paradigm of annotation, presenting a systematic and comprehensive compound class annotation tool. CANOPUS uses a deep neural network to predict 1,270 ClassyFire compound classes from fragmentation spectra, and explicitly targets compounds where neither spectral nor structural reference data are available. CANOPUS even predicts classes for which no MS/MS training data are available, which is made possible by a two-step learning procedure.
We evaluate CANOPUS against four other computational approaches and found that it outperforms them.
We demonstrate the broad utility of CANOPUS by investigating the effect of the microbial colonization in the digestive system in mice, and through analysis of the chemodiversity of different Euphorbia plants; both uniquely revealing biological insights at the compound class level.

11:20 AM-11:40 AM
Pathway-Activity Likelihood Analysis and Metabolite Annotation for Untargeted Metabolomics using Probabilistic Modeling
Format: Pre-recorded with live Q&A

  • Ramtin Hosseini, Tufts University, United States
  • Neda Hassanpour, Tufts University, United States
  • Li-Ping Liu, Tufts University, United States
  • Soha Hassoun, Tufts University, United States

Presentation Overview: Show

Despite computational advances, interpreting untargeted measurements and determining their biological roles remains a challenge. We present an inference-based approach, termed Probabilistic modeling for Untargeted Metabolomics Analysis (PUMA). Our approach captures metabolomics measurements and the biological network for the biological sample under study in a generative model and uses stochastic sampling to compute posterior probability distributions. PUMA predicts the likelihood of pathways being active, and then derives probabilistic annotations. Unlike prior pathway analysis tools that analyze differentially active pathways, PUMA defines a pathway as active if the likelihood that the path generated the observed measurements is above a particular (user-defined) threshold. Due to the lack of “ground truth” metabolomics datasets, where all measurements are annotated and pathway activities are known, PUMA is validated on synthetic datasets that are designed to mimic cellular processes. PUMA, on average, outperforms pathway enrichment analysis by 8%. When applied to case studies, PUMA annotation results were in agreement to those obtained using other tools that utilize additional information in the form of spectral signatures. Importantly, PUMA annotates a significant number of additional putative annotations over spectral database lookups. For an experimentally validated 50-compound dataset, annotations using PUMA yielded 0.833 precision and 0.676 recall.

12:00 PM-12:40 PM
CompMS Keynote: In Silico Metabolomics - Tools to Illuminate the Dark Matter of the Metabolome
Format: Live-stream

  • David Wishart, University of Alberta, Canada

Presentation Overview: Show

Metabolomics, as a field, struggles with the fact that it is only able to identify less than 5% of the identifiable features in high-resolution LC-MS metabolomic experiments. The unidentified portion of the metabolome is often called metabolomic "dark matter". In this presentation I will describe some of the software tools and data resources that we have developed to assist with the identification of this dark matter. These include BioTransformer (a tool for biologically feasible metabolite prediction), CFM-ID 4.0 (for accurate MS/MS spectral prediction) and NMR-Pred (for accurate NMR spectral prediction). Examples will be provided showing how these tools can be used to identify novel metabolite dark matter.

3:20 PM-3:40 PM
A call to create a true exposome database for untargeted chemical profiling
Format: Live-stream

  • Douglas Walker, Icahn School of Medicine at Mount Sinai, United States

Presentation Overview: Show

Over a lifetime, humans experience thousands of chemical exposures. A more complete estimate of environmental exposures across the lifespan (i.e. the exposome) would be a transformative research initiative. The use of high-resolution, mass spectrometry (HRMS) provides a key platform for assessing the exposome and provides measures of thousands of chemical signals in a single human sample. At present, untargeted assays still rely on metabolomic reference databases to identify chemicals from chromatograms. While lists of exogenous chemicals are available and can be used for annotation, many are specific to the form approved for commercial use and have limited information on transformation products. Thus, current approaches greatly underestimate the presence of environmental chemicals in human populations. For example, out of 167 exposure biomarkers measured by the CHEAR targeted laboratories, only 21% –79% of the compounds were present in commonly used databases. To address these needs, this presentation will highlight the limitations of metabolomic databases when annotating HRMS data for exposomics and identify strategies for improving our ability to detect exposure-related chemicals, including use of environmental suspect-screening databases, application of experimental and in silico predictions of chemical transformation products, and development of exposome-wide chemical transformation pathways for use in pathway-enrichment analyses.

3:40 PM-3:50 PM
ZODIAC: database-independent molecular formula annotation using Gibbs sampling reveals novel small molecules
Format: Pre-recorded with live Q&A

  • Marcus Ludwig, Friedrich-Schiller-University Jena, Germany
  • Louis-Félix Nothias, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, USA, United States
  • Kai Dührkop, Friedrich-Schiller-University Jena, Germany
  • Irina Koester, Scripps Institution of Oceanography, University of California San Diego, United States
  • Markus Fleischauer, Friedrich-Schiller-University Jena, Germany
  • Martin Hoffmann, Friedrich-Schiller-University Jena, Germany
  • Daniel Petras, Scripps Institution of Oceanography, University of California San Diego, United States
  • Fernando Vargas, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, United States
  • Mustafa Morsy, Department of Biological and Environmental Sciences, University of West Alabama, Livingston, United States
  • Lihini Aluwihare, Scripps Institution of Oceanography, University of California San Diego, United States
  • Pieter C. Dorrestein, Department of Chemistry and Biochemistry department, UCSD, United States
  • Sebastian Böcker, Friedrich-Schiller-University Jena, Germany

Presentation Overview: Show

The confident identification is one of the biggest challenges in the high-throughput analysis of small molecules with mass spectrometry. Even molecular formula annotation remains challenging; in particular for compounds with high mass, and when performed de novo, that is without any use of databases.
We present ZODIAC, a method for comprehensive de novo molecular formula annotation on complete datasets. It takes advantage of the fact that metabolites co-occur in a network of derivatives within the sample. ZODIAC uses the top molecular formula candidates provided by SIRIUS as input and re-ranks these based on Bayesian statistics. ZODIAC models likelihoods using SIRIUS scores and prior probabilities as similarities between candidates of different compounds. We apply Gibbs sampling to estimate the posterior probabilities. Datasets can have more than 10,000 variables and over a million pairwise similarities. We apply extensive algorithm engineering to make the sampling swift in practice. On five diverse datasets we improve on SIRIUS, the current best-of-class tool. On one dataset we decrease incorrect annotations by 16-fold.
We show the advantage of de novo assignments over database search: ZODIAC annotates multiple compounds with novel molecular formulas absent from PubChem, one of the biggest structure databases; two of these have been validated.

3:50 PM-4:00 PM
Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships
Format: Pre-recorded with live Q&A

  • Florian Huber, Netherlands eScience Center, Amsterdam, the Netherlands, Netherlands
  • Lars Ridder, Netherlands eScience Center, Amsterdam, the Netherlands, Netherlands
  • Simon Rogers, University of Glasgow, United Kingdom
  • Justin Jj van der Hooft, Wageningen University, Netherlands

Presentation Overview: Show

Highlights:

Spectral similarity is key for many metabolomics analyses.

Here, we introduce Spec2Vec, a novel spectral similarity score that is more scalable and more proportional to structural similarity of molecules than traditional scores.

The advantages of Spec2Vec are shown in library searching for both exact matches and analogues as well as in molecular networking.

4:00 PM-4:10 PM
LipidCreator workbench to probe the lipidomic landscape
Format: Pre-recorded with live Q&A

  • Dominik Kopczynski, Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Germany
  • Robert Ahrends, Department of Analytical Chemistry, University of Vienna, Austria
  • Bing Peng, Division of Rheumatology, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Sweden
  • Cristina Coman, Department of Analytical Chemistry, University of Vienna, Austria
  • Nils Hoffmann, Leibniz-Institut für Analytische Wissenschaften – ISAS – e.V., Otto-Hahn-Straße 6b, 44227 Dortmund, Germany, Germany

Presentation Overview: Show

Lipids are imbedded in biology; they form cells and organelles, mediate information flow, protect cells and tissues from a hostile environment and serve as energy building blocks. Targeted lipidomics focuses on a reproducible, quantitative analysis of a subset of lipids of interest. We therefore introduce LipidCreator, a lipid building block-based workbench and knowledge base for the automated generation of targeted lipidomics MS assays. Assay generation can be conducted with a GUI or at the command line. LipidCreator can calculate masses for lipid species and their fragment ions, covering over 60 common lipid classes and a lipid array of 1012 lipid molecules. To generate in-silico spectral libraries, the ability to determine the relative intensities of fragment ions at different defined collision energies is important. We therefore trained non-linear regression models on empirical data from standard measurements of lipid mediators on two different MS instrument types.
LipidCreator contains a visual inspection level for fragments and a lipid nomenclature translator supporting LIPID MAPS formulations. It can be integrated into KNIME and Galaxy workflows via its command line interface on Linux and Windows. LipidCreator can work standalone but it is also fully integratable into Skyline and its small molecule support.

4:10 PM-4:20 PM
The metaRbolomics Toolbox in Bioconductor and beyond
Format: Pre-recorded with live Q&A

  • Nils Hoffmann, Leibniz-Institut für Analytische Wissenschaften – ISAS – e.V., Otto-Hahn-Straße 6b, 44227 Dortmund, Germany, Germany
  • Laurent Gatto, Computational Biology Unit, de Duve Institute, UCLouvain, Avenue Hippocrate, 75, 1200 Brussels, Belgium, Belgium
  • Sebastian Gibb, Department of Anaesthesiology and Intensive Care of the University Medicine Greifswald, 17475 Greifswald, Germany, Germany
  • Johannes Rainer, Institute for Biomedicine, Eurac Research, Affiliated Institute of the University of Lübeck, Bolzano, Italy, Italy
  • Jan Stanstrup, Preventive and Clinical Nutrition, University of Copenhagen, Rolighedsvej 30, 1958 Frederiksberg C, Denmark, Denmark
  • Corey D. Broeckling, Colorado State University, Proteomics and Metabolomics Facility, United States
  • Rick Helmus, Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, Amsterdam, the Netherlands, Netherlands
  • Ewy Mathé, Division of Preclinical Informatics, National Center for Advancing Translational Sciences, Rockville, MD 20850, United States
  • Thomas Naake, Max Planck Institute of Molecular Plant Physiology, Potsdam-Golm 14476, Germany, Germany
  • Luca Nicolotti, The Australian Wine Research Institute, Metabolomics Australia, PO Box 197, Adelaide, SA 5064, Australia, Australia
  • Kristian Peters, Leibniz Institute of Plant Biochemistry, Bioinformatics and Scientific Data, Weinberg 3, 06120 Halle, Germany, Germany
  • Reza Salek, The International Agency for Research on Cancer, 150 cours Albert Thomas, 69372 Lyon CEDEX 08, Lyon, France, France
  • Tobias Schulze, Helmholtz Centre for Environmental Research - UFZ, Department of Effect-Directed Analysis, 04318 Leipzig, Germany, Germany
  • Emma L Schymanski, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 avenue du Swing, L-4367 Belvaux, Luxembourg, Luxembourg
  • Michael A. Stravs, ETH Zürich, Institute of Molecular Systems Biology, Otto-Stern-Weg 3, 8093 Zürich, Switzerland, Switzerland
  • Etienne Thévenot, Medicines and Healthcare Technologies Dept. CEA, INRAE, Paris-Saclay University, MetaboHUB, Gif sur Yvette, France, France
  • Hendrik Treutler, Leibniz Institute of Plant Biochemistry, Bioinformatics and Scientific Data, Weinberg 3, 06120 Halle, Germany, Germany
  • Ralf Weber, Phenome Centre Birmingham and School of Biosciences, University of Birmingham, Birmingham, United Kingdom, United Kingdom
  • Egon Willighagen, BiGCaT, NUTRIM, Maastricht University, Netherlands
  • Michael Witting, Research Unit Analytical BioGeoChemistry, Helmholtz Zentrum München, Neuherberg, Germany, Germany
  • Steffen Neumann, Leibniz Institute of Plant Biochemistry, Bioinformatics and Scientific Data, Weinberg 3, 06120 Halle, Germany, Germany

Presentation Overview: Show

Metabolomics aims to measure and characterise the complex composition of metabolites in a biological system using analytical techniques such as mass spectrometry and nuclear magnetic resonance spectroscopy. The scientific community has developed a wide range of open source software, providing freely available advanced processing and analysis approaches. The programming and statistics environment R has emerged as one of the most popular environments to process and analyse metabolomics datasets. A major benefit of such an environment is the possibility of connecting different tools into more complex workflows. We provide an extensive overview of existing packages in R for different steps in a typical computational metabolomics workflow, including data processing, biostatistics, metabolite annotation and identification, and biochemical network and pathway analysis. Multifunctional workflows, possible user interfaces and integration into workflow management systems are also covered. We also address the Findability, Accessibility, Interoperability and Reusability of software, and how these can be improved through repositories and semantic annotation, and now maintain the resource as an Open Source book with Continuous Integration as part of the RforMassSpectrometry initiative, which aims to provide efficient, thoroughly documented, tested and flexible R software for the analysis and interpretation of high throughput mass spectrometry data.

4:40 PM-5:00 PM
Using metabolic models to integrate microbiomes and metabolomes
Format: Pre-recorded with live Q&A

  • Cecilia Noecker

Presentation Overview: Show

Microbes interact with each other and with eukaryotic hosts by producing and transforming diverse metabolites, with consequences for human and environmental health. Technologies for comprehensively profiling the composition and metabolism of microbial communities continue to improve, but the integration and interpretation of microbiome and metabolomics data remains challenging. One strategy to address this challenge is to take advantage of reference knowledge of the metabolic capacities of microbial taxa, encoded in metabolic models. In this talk, I will describe a conceptual framework and software tool (MIMOSA2) for using metabolic models to evaluate mechanistic links between microbes and metabolites in omics datasets.

5:00 PM-5:40 PM
CompMS Keynote: Democratizing Mass Spectrometry and Metabolomics
Format: Live-stream

  • Gary Siuzdak

Presentation Overview: Show

Mass spectrometry can be categorized as an elite sport, often performed at the highest levels by those with the best and most expensive equipment. However, developments in databases, informatics and instrumentation is “democratizing” mass spectrometry’s utility, allowing even low-cost technologies to perform at levels, sometimes, even better than their more expensive counterparts. This presentation will discuss these database/informatic/instrumentation developments and how they are being applied with XCMS/METLIN in the context of enhanced In-Source fragmentation/Annotation (eISA).