View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Identification of metabolites from tandem mass spectrometry data is commonly performed by matching the peak pattern against a spectral library of previously curated fragmentation spectra. Since this method discriminates based on peak intensity as well as mass, peptides are also increasingly identified in this manner. Existing methods are typically based on separately scoring each library spectrum against each query spectrum. Nevertheless, chimeric spectra are common, with most spectra having at least some residual noise from co-fragmented species. We propose a new group-sparse formulation of Richardson-Lucy deconvolution that fits a non-negative mixture of relevant library spectra to a query spectrum simultaneously, hence accounting for chimeric spectra by design. For evaluation we applied our approach to identify data on a wide range of synthetic peptides [Zolg et al, Nature Methods 2017], both as 'pure' spectra and as artificially mixed chimeric spectra. Preliminary results against established software show that when matching against the entire NIST human peptide library we identify more of the pure spectra correctly (67%, compared to SpectraST 62%, MSPepSearch 59%), yet moreover, on the chimeric spectra we report a correct peptide as the top hit 80% of the time, whereas SpectraST and MSPepSearch failed to report any true positives.
Short Abstract: Metabolite identification is the bottleneck of metabolomics. Based on the notion that a pathway perturbation often results in the change of not only a single metabolite, but a group of metabolites, our mummichog software (mummichog.org) was designed to identify such pattern shift in untargeted metabolomics data by comparing to known biochemical pathways/networks, prior to metabolite identification. This method occupies a unique space in metabolomics data analysis, and has been incorporated into the widely used XCMS Online and Metaboanalyst. We will report here the recent update in mummichog version 2, including the use of retention time in LC-MS data in the grouping of isotopes and adducts, and redesign of software architecture for web-based applications. Part of the new data structure is “empirical compound”, as a central computational unit for a tentative metabolite, which accommodates isotopes and adducts, ambiguity of annotation, and linking to additional information. This will allow connection to a cumulative database of metabolic models and disease associations. We will demonstrate the application of mummichog 2 on the Snyderome data and on a meta-analysis of several disease studies.
Short Abstract: Liquid Chromatography (LC) followed by tandem Mass Spectrometry (MS/MS) is one of the predominant methods for metabolite identification. In recent years, machine learning has started to transform the analysis of tandem mass spectra and the identification of small molecules. In contrast, LC data is rarely used to improve metabolite identification, despite numerous published methods for retention time prediction using machine learning. We present a machine learning method for predicting the retention order of molecules; that is, the order in which molecules elute from the LC column. Our method has important advantages over previous approaches: We show that retention order is much better conserved between instruments than retention time. To this end, our method can be trained using retention time measurements from other LC systems and configurations without tedious preprocessing, significantly increasing the amount of available training data. Our experiments demonstrate that the retention order prediction is an effective way to learn retention behavior of molecules from heterogeneous retention time data. Finally, we demonstrate how retention order prediction and MS/MS-based scores can be combined for more accurate metabolite identifications when analyzing a complete LC-MS/MS run.
Short Abstract: Researchers at different institutions commonly study similar biological questions and thus produce multiple datasets for the same biological condition across multiple organisms. We present MultiMat, a method that performs peptide-level differential expression analysis of multiple proteomic datasets simultaneously. MultiMat provides a single p-value and a single effect size estimate for the differences in protein abundances. A test statistic is computed as a sum of F-statistics produced for each individual dataset. A p-value is then estimated via a permutation. Simultaneous utilization of all available peptides within proteins in multiple datasets increases statistical power to detect differences among conditions or treatments. In addition, in MultiMat package, we build on our previous research and provide functionality for normalization, model-based imputation of missing peptide abundances and peptide-level differential protein expression analysis. MultiMat provides a flexible pipeline from raw peptide abundances to protein quantification for multiple as well as single datasets in bottom-up mass spectrometry-based proteomics studies. Here we show the improved detection of the differentially expressed proteins on simulated as well as Alzheimer’s disease data. MultiMat is implemented as an R package and is currently undergoing a review to become part of the Bioconductor.
Short Abstract: Automated interpretation of tandem mass spectra (MS/MS) is often limited to searching in spectral libraries. Reference libraries are vastly incomplete, containing data for only a few thousand molecules; in contrast, databases with molecular structures are orders of magnitude larger. Manual interpretation of MS data is cumbersome and work-intense, as current MS technology can produce hundreds of thousands of MS/MS spectra per day on a single instrument. A “BLAST-like” computational tool that allows to search in structure databases, is currently missing but is highly anticipated by the communities. In my talk, I will report on recent advances for our tools SIRIUS and CSI:FingerID. Both have become highly used in metabolomics and related communities, with more than a million compound queries processed, and user from 30 countries and six continents. Our tools have demonstrated outstanding performance in compound identification, with identification rates of up to 71.7 % on challenging metabolomics datasets. This was independently confirmed in a blind competition (CASMI 2017), where our approach performed 6-fold better than the runner-up. I will give a short introduction to the computational methods behind our tools, and report on our latest research on related topics.
Short Abstract: Neoepitope peptides are newly formed antigens presented by major histocompatibility complex class I (MHC-I) on cell surfaces. The cells presenting neoepitope peptides are recognized and subsequently killed by cytotoxic T-cells. Immunopeptidomic approaches aim to characterize the peptide repertoire (including neoepitope) associated with the MHC-I molecules on the surface of tumor cells using proteomic technologies, providing critical information for designing effective immunotherapy strategies. We developed a novel constrained de novo sequencing algorithm to identify neo-epitope peptides from tandem mass spectra acquired in immunopeptidomic analyses. Our method incorporates prior probabilities to putative peptides according to position specific scoring matrices (PSSMs) representing the sequence preferences recognized by MHC-I molecules. We implemented a dynamic programming algorithm to determine the peptide sequences with an optimal posterior matching score for each given MS/MS spectrum. Similar to the de novo peptide sequencing, the dynamic programming algorithm allows an efficient searching in the entire peptide sequence space. On an LC-MS/MS dataset, we demonstrated the performance of our algorithm in detecting the neoepitope peptides bound by the HLA-C*0501 molecules that were superior to database search approaches and existing general purpose de novo peptide sequencing algorithms.
Short Abstract: Tandem mass-spectrometry has become a method of choice for high-throughput, quantitative analysis in proteomics. However, identification of the peptides in shotgun MS/MS is not a straight-forward procedure. Most peptide-spectrum matching algorithms score the concordance between the experimental and the theoretical spectra of candidate peptides, by counting the number of theoretically possible fragment ions observed in the experimental spectra and comparing it to random expectation. However, the underlying assumption that each theoretical fragment is equally likely to be observed is inaccurate; rather, MS2 spectra often have few dominant fragments. Here, we present a novel method for predicting fragment ion intensities, based on a hidden Markov model with two key properties that greatly improves the efficiency of the training process. Using millions of MS/MS spectra generated in our lab, we investigated the overall reproducibility of the experimental spectra as well as the performance of our model in various settings. We found good overall correlation between predicted intensities and the experimental spectra. Furthermore, we propose that the real value of the model is by identifying fragments that are unlikely to be intense for a given candidate peptide, rather than using the actual predicted intensities, in order to leverage better statistics.
Short Abstract: Introduction: Current methodologies for protein-level statistical analysis in LC-MS proteomics workflows generally do not account for digestion variability and outliers at both peptide and measurement level. Methods: Running on a HTCondor cluster, our method implements a three-level hierarchical Bayesian model. For each protein a Poisson generalized linear model is constructed to estimate the unknown protein quantification pattern, the deviation of each peptide quantification pattern from the protein pattern, and the deviation of each feature quantification from its peptide quantification pattern. The protein pattern is inferred as the peptide pattern with the most consistent evidence in the data. We have also recently extended this model to realise proteoform-specific quantification, by estimating shared-peptide contributions by their mixture-modelling. Results: We have performed validation with a spike-in dataset, together with a production clinical study on control vs post-mortem human Alzheimer's brain, illustrating substantial benefits to differential expression testing specificity and the ability to robust differentiate proteoforms for the first time. Conclusion: We have demonstrated that Bayesian modelling can both robustly improve the quality of fold-change estimates in an LC-MS proteomics experiment, and yield proteoform-level differential quantifications under the substantial biological variability inherent in clinical studies.
Short Abstract: Recent success in metabolite identification from MS/MS has been led by machine learning with two stages: mapping MS to fingerprint vectors and then retrieving candidate from the database. In the first stage, i.e. fingerprint prediction, spectrum peaks are features and considering their interactions would be reasonable for more accurate identification of unknown metabolites. Existing approaches of fingerprint prediction are based on only individual peaks in the spectra, without explicitly considering the peak interactions. We propose two learning models allowing to incorporate peak interactions for fingerprint prediction. First, we extend the state-of-the-art kernel learning method by developing kernels for peak interactions to combine with kernels for peaks through multiple kernel learning (MKL). Second, we formulate a sparse interaction model for metabolite peaks, which we call SIMPLE, which is computationally light and interpretable for fingerprint prediction. The formulation of SIMPLE is convex and guarantees global optimization, for which we develop an alternating direction method of multipliers (ADMM) algorithm. Experiments using the MassBank dataset show that both models achieved comparative prediction accuracy with the current top-performance kernel method. Furthermore SIMPLE clearly revealed individual peaks and peak interactions which contribute to enhancing the performance of fingerprint prediction.
Short Abstract: EMBL-EBI has established the MetaboLights database as one of the most successful international metabolomics repositories. Recommended by a number of leading journals including Nature, PLOS and Metabolomics. MetaboLights hosts a wealth of cross-species, cross-technique, open access experimental research. The services unique manual curation maintains quality, provides helpful support for users and ensures accessibility for secondary analysis of studies. With nearly 3000 species represented, the enriched reference layer displays information outlining the chemical and biological nature of numerous metabolites. Structural and spectral references, network pathways, enzymatic reactions, and associated literature equips the user with a knowledge base to enrich their research. Focusing on adding further value and functionality, EMBL-EBI is currently integrating online analysis tools to provide a complete workflow, from data management, processing and analysis through to data publication. MetaboLights strives to become the model resource for metabolomics and therefore is eager to develop and integrate with others where possible. As such, MetaboLights works closely with companies and Phenome Centres to ensure easy integration of data derived from commercially available kits into the database. MetaboLights actively collaborates in the development of tools that allow convenient discovery of metabolomics and multi-omics research such as MetabolomeXchange and OmicsDI.